Computation Wireless Communications - microsoft.com · Computation Wireless Communications Kun Tan...

Preview:

Citation preview

Computation Wireless Communications

Kun TanWireless and Networking Group

MSR Asia2010 Summer

Agenda

• Software (Defined) Radio

– Basic concept, Architecture and System

• Computational Thinking in Wireless Research

– Examples from three SIGCOMM papers

• Sora Tutorial

– Platform, tools and programming

• Digital Wireless Communication Illustrated (optional)

– Case study of IEEE 802.11

Software RadioBasic concept, Architecture and System

What is Software (Defined) Radio

• Historically defined by Mitola (1992)

– Radio’s physical layer behavior is primarily defined in software

– Accepts fully programmable data and control traffic

– Supports broad range of frequencies, air interfaces (protocol), and applications

– Be able to dynamic changes its configuration according to user requirements

From Hardware to Software

RF/IFBaseband Processing

Radio architecture

58/3/2010

Antenna

High Layer Processing

HardwareRadio Hardware (ASIC)

Software (GPP)

SoftwareDefinedRadio

Hardware (ASIC)

Software (GPP)

network interface

ProgrammableHardware

network interfaceADC/DAC

SoftwareRadio

Hardware (ASIC)

Software (GPP)

Software (GPP)

Virtual network interface

ADC/DACBus

Software (Defined) Radio Benefit

• Flexibility and programmability

• Reduced research-product-market cycle

• Scalability to potential large systems

• Open architecture

• Reduce the cost for customized silicon

• Better performance/market trajectory

• Enhanced maintainability

Approaches

7

ProgrammabilityLow High

Low

Hig

h

Perf

orm

ance

Programmable hardware (FPGA + DSP)

8/3/2010

Conventional GPP Software

WARP Platform (Rice University)

• Xilinx XC2VP70 FPGA for DSP

– With 2 embedded PowerPC processor

• Low-level programming on hardware logic (Verilog)

• Matlab + sysgen

• Limited by single chip FPGA, difficult to scale

• High price (~$10,000USD)

• Promised good performance, but…

– No sophisticated PHY (e.g. 802.11 or 3G/4G)

– Limited achievable network throughput demoed (~Mbps)

GNURadio/USRP

• All DSP is performed on CPU

• Modular design

– C/C++ and Python script

• Slow interface to PC

– v1: USB2.0

– v2: Gigabit Ethernet

• Large latency (ms)

• Software is not optimized

– Large overhead

– Not ready for multiple core

• Limited performance

– Achievable wireless throughput is low

98/3/2010

Approaches

10

ProgrammabilityLow High

Low

Hig

h

Perf

orm

ance

Programmable hardware (FPGA + DSP)

Difficult to program Not to scale Expensive

8/3/2010

Sweet-point High-performance w/

sophisticated wireless Easy to program Scale well

Conventional GPP Software

Slow Limited capability No real-time

Programmable Hardware vs. GPP

118/3/2010

General Purpose Processor

Programmable Hardware

Programmability & Flexibility

Programming in high-level language and abstractionMature tools for programming and debugging

Low level programmingLimited programming tools

Open architecture Standard instructionset and architecturePortability

Specialized or proprietary architectureLimited interoperability

Scalability Loosely coupledScale with the state-of-the-art computer system

Tightly coupled embedded designDifficult to scale

Programmable Hardware vs. GPP (cont.)

128/3/2010

General Purpose Processor

Programmable Hardware

Trajectory Quicker evolution/ better optimization driven by large marketMoore’s law continues in multi-core trend

Isolated specialized market

Whole system price

Performance/$ decays exponentially

Hold still mostly, thoughcomputing components’ price reduced

Performance for DSP

Mainly designed for general computing tasks, not for specific DSPHow well can we do?

Satisfy the requirement with great engineering efforts

Architecture of a GPP-based SDR System

13

RF/IFBaseband ProcessingProgram

Antenna

Up/downConvertor

ADC/DAC

RF Front-End Circuit CPU+Memory

PC interface

Fundamental Challenges

• System interface throughput

– Large volume of high-fidelity digital samples

– From 1Gbps to 10Gbps

• Computation

– Large amount of arithmetic calculations for digital signal processing

– Tens of GOPS estimated

• Real-time support

– Hard deadline and accurate timing control for wireless protocols

– From 10ms (multi-media) to 1 s

14

How Can We Meet the Requirements?

• Take advantage of modern PC bus technologies– Driven by high-speed interconnection among PC

subsystems– PCIe – 2Gbps per lane and up to 64Gbps w/ x32;

standard in current PCs– PCIeV2 – 5Gbps per lane, up to 128Gbps w/ x32; – Up coming PCIeV3 (256Gbps) and Silicon Photonics

(billion bps promised)• Ride the wave of multi-core technology

– Sustain Moore’s law when CPU hits the heat wall– Four core processing is standard; six/eight cores for

high-end configuration– 64 and more cores are coming– Software innovation to unlatch the power of parallel

programming

The Sora Approach

• New hardware interconnection board based on PCIe

• High performance PHY implementation on multi-core

– Trade memory for computation

– Exploit data parallel with SIMD

– Streamline across multiple cores

– Scale with the ever increased cores

• Core dedication for real-time support

– A dumb idea who age comes

168/3/2010

Sora Architecture

17

RF Frontend

AntennaSora FRL

RCB Memory

PCIe bus

Core

Core

Core

Core

Core

Core

SR Processing

Sora Radio Control Board

18

PCIe-8x interface: up to 16Gbps throughput

Sora Fast Radio Link Slot•2Gbps per channel• up to 8 channels (8x8 MIMO)

Sora Fast Radio Link

• An abstract protocol to transfer radio control information and digital samples between RF front-end and baseband

– Abstract registers for RF control and status information

– Common I/Q sample format

• Enable interoperation between any baseband and any RF hardware

RF chipsetAbstract

registers map

InterpreterVendor specific operations

FRL End-point

To RCB

Connect to a RF front-end

Convertor board implements Sora FRL

Connect to a 2.4GHz RF front-end board

Sora Software Architecture

21

Sor

aS

tack

Ethernet Interface

RCB management

Radio management

Sora DSP Lib

Sora ethreadLib (core

dedication)

Sora StreamingLib

SW MAC

SW PHY

Sora FSM LibLink layer

Sora framing Lib

TCP/IP

Sora RCB

Kernel

Net App

Net App

Net App

Sora tools

8/3/2010

Programming PHY

• Basic model – a pipeline of processing block

• Key design questions

– What is the right implementation tradeoff?

– How to exploit the parallelism in PHY and scale with multiple cores?

– How to schedule the execution of processing blocks?

– How to realize real-time services?

22

DMA Buf

Remove DC

FIFO

Demapper

Frame Buf

FilterFIFO

Decoder

Key Design Choice One: Trade Memory for Computation

• Exploit large high-speed cache memory of CPU

– Multiple mega byte L2/L3 cache

• Extensive use of lookup tables (LUT) to store the computation

23

Direct impl. 8 ops per bit

LUT impl. 2 Look-up op for 8 bits!(size 32KB)

Ex: Convolutional encoder+

+

Tb Tb Tb Tb Tb Tb

Output Data A

Output Data B

– Applicable for more than half of the common algorithms; speedup ranges from 1.5x to 22x

Key Design Choice Two: Exploit Data Parallelism with SIMD

• Utilize short-vector SIMD extension in CPU

– Simultaneously perform calculations on multiple elements of vectors

24

Ex. (I)FFT

– Applicable to many PHY algorithms with significant speedups (1.6x ~ 50x)

Core 2Core 1

Key Design Choice Three: Pipeline across Cores

• Partition PHY processing work across cores

• Interconnecting sub-pipeline with light-weight, synchronized FIFOs

25

Demod +

InterleavingFFT

Viterbi

decodingRemove GI DescrambleDecimation

Synchronized FIFO

Key Design Choice Four: Static Scheduling

• PHY pipeline is synchronized

– Behavior of each processing block is pre-deterministic

– Compute a lock-free schedule at the compile time

• PHY processing block is very fine granulate

• Benefit over multi-thread model

– No context switching overhead

– No need to check execution condition

– Avoid cache flushing/missing

– Mitigate synchronization overhead

FIFO

Demapper

FilterFIFO

Decoder

Key Design Choice Five: Core Dedication for Realtime

• Software is considered “uncertain” in traditional OS– Why?– Multiplexed with multiple tasks/processes/threads– Contention in memory/bus– Interrupts– RTOS – complex/overhead/limited functionality

• Core dedication – a dumb idea for multi-core– Exclusively allocate enough cores for RT tasks– Guaranteed resource

• CPU/Cache/Memory

– I/O Polling, instead of interrupt– Precise timing control at CPU clock level– Simple abstraction, and easier to implement in standard

OSes• Even implemented in Windows

Put Them Together

• Turn a commodity multi-core Windows PC into a powerful software radio

• Orders of magnitude better than prior art

• Demonstrated capabilities

– First WiFi (802.11 a/b/g) inpure software• Standard compliant up to 54Mbps data rate

• Interoperable with commercial NIC

– First LTE (uplink) in pure software running online• 43Mbps peak data rate

8/3/2010 28

Sora in the Sweet Point

29

ProgrammabilityLow High

Low

Hig

h

Perf

orm

ance

Programmable hardware (FPGA +

DSP)

8/3/2010

SoraA big bet on

commodity multi-core CPU and PC

Conventional GPP Software

MSR Software Radio Kits

• For academic non-commercial research

– Pre-order for RCB hardware

– Software download online

• More information on Sora– http://research.microsoft.com/sora

– http://research.microsoft.com/en-us/projects/sora/academickit.aspx

30

Summary - Time for Software Radio Has Come

• History will always repeat itself– General computing platforms always win over

specialized platforms, when the processing ability of the former exceeds the application requirements

– Consider multimedia twenty years ago• Practical applications

– Test, measurement, monitoring instruments– Cellular base-station/Access points– Prototype and research

• Challenges remain– Power consumption– Security– Fear of “unknown” of people

Computational Thinking in Wireless Research

Examples from three SIGCOMM papers

What is Computational Thinking?

• It is the fundamental skill for a computer scientist to solve problem– Esp, large scale, difficult problems

• It is all originated from the fundamental question of What is computable?

• Computation thinking rests on solid theory, but differs from the mathematical thinking– It is about not only to characterize the solution,

but also how efficient you can obtain the solution• Computation thinking relies on excellent engineering,

but differs from the engineering thinking– It is about not only a good solution, but also the

fundamental understanding of the difficulty and solution space

Key Mental Tools for Computational Thinking

• Critical thinking on fundamental tradeoff– Why the problem is difficult?– What are the fundamental constraints?– Whether an approximate solution is good enough?

• Intuitive thinking with heuristic reasoning– Getting quick insight on the problem– Constructive way to find a solution or a

counterexample• Thinking using abstraction and decomposition

– Choosing the right representations– Separation of concerns– Divide-and-conquer

Key Mental Tools for Computational Thinking (cont.)

• Thinking recursively

– Thinking about layering, indirections, and architecture things with simple rules

– Recognizing the virtues and the dangers of a concept, an abstraction

– Understanding the power and the cost of a method

• Thinking in terms of prevention, protection and recovery

– Prepare for the worst-case; not optimize for it

• Thinking in aesthetic

– Computational thinking is also about the beauty of a solution

Paper I: XORs in The Air: Practical Wireless Network Coding

Published inACM SIGCOMM’06

Problem

Increasing the throughput of dense wireless mesh networks

Traditional approachAlice Bob

1 2

34

Network coding approachAlice Bob

1 2

3 = xor1 2

Alice Bob1 2

3 = xor1 2DickClare

Two Departures

• Accept wireless as a broadcast medium

– Dispose of the point to point abstraction

• Routers mix bits in packets, then forward them Network coding

– Dispose of the store-and-forward primitive of routing

COPE Design

• Difficulty

– When and how to code?

– How to decode?

• Key principle – Coding opportunistically

– Gain when chances happen; large throughput increase in general case

– Bounded worst case performance

Snooping

• Exploit wireless broadcast

• Snoops on all frames in a time window

• Tradeoff between storage and coding gain

• Node sends Reception Reports to tell its neighbors what packets it heard

– Piggyback to mitigate overhead

– Periodically send standalone reports when no frames transmit• Preventing deadlocks

Coding

• Bound the worst case

– To send frame p to neighbor A, XOR p with frames already known to A• Thus, A can always decode; no worse than the traditional

design

• Benefit multiple neighbors opportunistically

Coding Illustration

Coding Illustration (cont.)

Coding Illustration (cont.)

Coding Algorithm

• Simple rule – decodable condition

– XOR n frames together iff the next hop of each frame already has the other n-1 frames apart from the one he wants

• How does a node know its neighbors’ frames?

– Reception reports

– Guesses based on delivery rate between the two nodes • Dealing with reports lose or delay

– If error occurs, recover by retransmission

• If there is no coding chance, just forward directly; never delay a frame

Performance: Alice-Bob Case

Almost double the throughput

Summary

• First integration of network coding into the network stack

• Simple ideas that work Elegant

– Opportunistic design

– Optimize for the normal case and bound the worst case

– Carefully making tradeoffs

Paper II: PPR: Partial Packet Recovery for Wireless Networks

Published inACM SIGCOMM’07

Problem

• How many bits are erroneous when a frame lost?

Non-colliding bits

Non-colliding bits

(P1)

(P2)

Time

Key Question – How to Make Use of Correct Bits?

(P2)Preamble

(P1)

Preamble

Checksum

Checksum

Difficulties

1. How does receiver know which bits are correct?

2. How does receiver know P2 is there at all?

3. How to design an efficient ARQ protocol?

How can receiver identify correct bits?

• Use physical layer (PHY) hints: SoftPHY– Receiver PHY has the information!

– Pass this confidence information to higher layer as a hint

• New interface abstraction: SoftPHY

– PHY-independent

51

52

Opportunistically Guess Erroneous Bits

High uncertainty

Low uncertainty

(P2) Preamble

(P1)

Preamble

53

Example: SoftPHY hint for spread spectrum

SoftPHY hint is 2

Receive: 11101101000111000011010110100010C1: 11101101100111000011010100100010

SoftPHY hint is 9

Receive: 11001101000111010111011110110111C1: 11101101100111000011010100100010

Hamming distance between received chips and decided-upon codeword

54

len

dst

src

len

dst

src

Header Trailer

Training

Sequence

SFD

Preamble

Training

Sequence

EFD

Postamble

cksu

m

Body

Postamble Decoding – Store and Decode

(P2)Preamble

(P1)

PostamblePreamble

ARQ with partial packets

• Resend only incorrect bits

• Difficulty

– How to efficiently tell sender about what happened?

• Efficient feedback

– Labeling bits “good” or “bad”

– Merging the good/bad range with minimal bit cost• Dynamic programming problem

• Accept some false positive and false negatives

“Good” bits

“Bad” bits

Throughput improvement 2.3-2.8x

56

57

Summary

• New mechanisms for recovering correct bits from parts of frames

– SoftPHY interface (PHY-independent)

– Postamble decoding

• Inspirations on new applications

– Collision detection

– Opportunistic forwarding

Paper III: Fine-grained Channel Access in WLAN

To appear inACM SIGCOMM’10

Problem

59

• How much we can benefit from the increase of PHY data rate?

• Fixed frame size 1500B

Understanding the Overhead

60

Busy DIFS Backoff Transmission SIFS ACK

Useful air time

10~16μs

9~20μs

• When PHY data rate increases high, the useful air time reduces, but the overhead remains the same• Constrained by fundamental physics laws and

electronics

# of nodes

Limitation of Existing MAC

• Allocate whole channel to one user at a time

– Single Carrier : too coarse when• The bandwidth is wide

• Data rate is high

– Aggregate a large amount of data to be efficient• 23KB per frame for 80% efficiency at 300Mbps data

rate

• Adversary interaction w/ RT, interactive, Web traffic, etc.

61

Call for a new access method – Fine Grained Channel Access

Why Difficult?

• Direct reduce the channel width

– Guard-band overhead increase significantly with fine subchannels• 75% overhead if channel width is 5MHz w/ 802.11a

• Use orthogonal subcarriers

– Overlapped but cross-interference is zero

– OFDM-like transmission among different nodes

– Require coordination among nodes

62

Design Space

• Tight centralized control

– Tight time synchronization and transmission control

– Mitigate the symbol misalignment

– Proposed in 4G standards; but not fit for unlicensed wireless

• Distributed coordination

– Rely on carrier sensing and broadcasting

– Accommodate the possible misalignment

– Suitable for WLAN in unlicensed band

63

FICA

FICA Approach

• A new PHY architecture

– Adopt a large guard-interval (or CP) and large OFDM symbol time to accommodate the time misalignment

• A new MAC contention and backoff scheme

– Time-domain random backoff is inefficient

– Frequency-domain contention and backoff

64

New PHY Architecture

• Timing misalignment in WLAN

– Bounded by carrier sensing (< 11 μs)and broadcasting (< 2 μs)

65

Carrier-sensing Broadcasting𝑡𝑚𝑖𝑠 < 11𝜇𝑠 𝑡𝑚𝑖𝑠 < 2𝜇𝑠

• Symbol structure

– Long CP 11.8μs, short CP 2.8μs

– Data symbol 15.6 μs (20% CP overhead)

• Subchannelization: 1.33MHz (17 subcarriers of 256-FFT)

Frequency Domain Contention

• Use PHY signaling

– Special design OFDM symbols for M-RTS/CTS

• Module contention-information on different contention subcarriers

• Contention resolved by choosing a winning subcarrier

66

0Subcarrier id 1 2 3 4 … 14 155

M-RTS M-CTS

Frequency Domain Backoff

• Make frequency-domain contention scale

• Apply congestion control principles

– Control the maximal number (cmax) of subchannels to access• Reduce cmax if collision

• Increase cmax if success

– Various rules: Reset-to-Max, AIMD

67

Access with FICA

• M-RTS: coordinated with carrier-sensing, using long CP• Other frames: coordinated with previous frames, using

short CP

68

Busy Multi-tone RTS

Access with FICA

• M-RTS: coordinated with carrier-sensing, using long CP• Other frames: coordinated with previous frames, using

short CP69

Busy Multi-toneRTS

Multi-tone CTS

Results

• Mixed traffic: 5 saturated; others are with random traffic rate from 800K~5Mbps

70

11n PHY: 150Mbps; FICA PHY: 145Mbps 11n PHY: 600Mbps; FICA PHY: 580Mbps

Summary

• A radical design for random access in wireless network

– Balanced design among synchronization requirements and PHY efficiency

– PHY signaling and frequency domain contention to reduce the signaling overhead

– Frequency domain backoff• Explorer another dimension for backoff

• Inspirit new thinking on wireless protocol design

71

Conclusions

• Computational thinking is a fundamental skill that every one should have

• It is thinking on problems and solutions, but fundamentally on problems

• It is conceptualizing, but not programming or artifact

• It complements and combines both mathematical and engineering thinking, but differs from either of them

• It is an ART

Sora Tutorialplatform, tools and programming

Using Sora for Wireless Experiments

• Option 1: Hardware Unit Test Tool

– A versatile driver for Sora platform, originally developed for hardware testing purpose

– Main functions: • Transmit a pre-computed wave-form – signal generator

• Dump a snap-shot of wireless channel – signal capture

– Simple, but very useful for offline signal processing

• Option 2: User Mode Extension

– User-mode APIs to develop a SDR program

– Simple programming model; easy debug

– Could be interrupted by critical kernel events

Using Sora for Wireless Experiments (cont.)

• Option 3: Kernel SDR driver

– Integration to the network stack; virtual NICs to support existing applications

– High privilege and real-time guarantee (with Core dedication)

– Difficult to program, but loved by system hackers

What you will learn next?

• How Sora sending and receiving are implemented?

• How to use handful tools provided by Sora SDK?

• How to develop a SDR application using Sora User Mode Extension?

• How to program parallel processing using Vector1 library?

Sora Receiving Architecture

RF Front-end

Receiving baseband

RCB

I/Q Samples

AD/DA

Cyclic DMA Rx Buffer

DMA

Rx Buffer

• Rx Descriptor format

• CPU chases behind DMA write point and wait when no more samples left for process

Header I/Q Samples

16B 112B = 32bit I/Q * 28

descriptors

Read point

Write pointframe signalsRx buffer

Sora Tx Architecture

• Cache waveform before actual transmission

RF Front-end

Sender baseband

I/Q Samples

AD/DA

DMA Transfer

Tx Sample Buffer

Waveform Cache

RCB

TxCommand

What you will learn next?

• How Sora sending and receiving are implemented?

• How to use handful tools provided by Sora SDK?

• How to develop a SDR application using Sora User Mode Extension?

• How to program parallel processing using Vector1 library?

Hardware Unit Test Tool

• Install driver

Hardware Unit Test Tool (cont.)

• dut.exe

Start Hardware Test

• Start Sora radio

– dut.exe start

• Start receiving module

– dut.exe rx

• Set proper gain values

– Remain Tx gain as default

– Set Rx gain

– dut.exe rxpa –-value 0x2000

– dut.exe rxgain –-value 0x1800

Capture Channel

• Run: dut.exe dump

• And you can get a dump file at c:\

Analyze Dump File Offline

• File Format: a direct image of the Rx DMA buffer

Signal Generator

• Prepare a signal file using any favorite tool

• File format

– Each I/Q sample is 16 bit complex number (I-8b, Q-8b)

– For some history reasons, subject to change in later version

I/Q I/Q I/Q I/Q …… I/Q

Load Signal File

• dut.exe transfer --file d:\signal.tdmp

• Note: the filename must be full path!

• Use dut.exe info tduto check the status

Transmit Your Signal

• Use dut.exe tx --sid 0x01

• Write a batch commend to repeat transmission

– E.g. transmit for 100 times

– for /L %i in (1,1,100) do dut.exe tx –sid 0x01

What you will learn next?

• How Sora sending and receiving are implemented?

• How to use handful tools provided by Sora SDK?

• How to develop a SDR application using Sora User Mode Extension?

• How to program parallel processing using Vector1 library?

Sora User Mode Extension

• Quick way to compose a SDR application w/o losing too much performance

• Pros:

– Easy programming

– Easy debugging

– Low overhead in receiving samples

– Multiple streams virtualization

– Core dedication support

• Cons:

– Long latency in control path

– No integration in network stack

Sora UMX Architecture

RF Front-end

RCB

I/Q Samples

AD/DA

Cyclic DMA Rx Buffer

DMA

Sora UMX app Sora UMX app

Virtual memory mapping

Ctrl Stub

Ctrl cmd

Kernel Mode

User Mode

Initialize and Deinitialize UMX

• Initialize UMX

• Receive or transmit data

• Clean up UMX

Receiving using UMX

• Map Rx buffer to user mode

• Bind a RxStream object to Rx buffer

Receiving using UMX (cont.)

• Get a sample block through RxStream object

• Wait for the block to be valide and process on it

Receiving using UMX (cont.)

• Sample bock

– An image of a Rx Descriptor

– 28 samples per block

• Clean up after you done

Header I/Q Samples

16B 112B = 32bit I/Q * 28

Sending using UMX

• Map modulation buffer to user mode

• Fill up the buffer with your own sample values

• Allocate RCB resource and transfer samples to RCB

Sending using UMX (cont.)

• Instruct RCB to emit your waveform

• Clean up

More information: UMX_sample at $(sorasdk)\src\AppExt

What you will learn next?

• How Sora sending and receiving are implemented?

• How to use handful tools provided by Sora SDK?

• How to develop a SDR application using Sora User Mode Extension?

• How to program parallel processing using Vector1 library?

Programming with Vector1 Library

• M. Flynn’s taxonomy on computer architectures

• SIMD to handle data parallel – Operate on vectors of

parallel data– Inexpensive compared MIMD– Good enough for many apps.

• Common SIMD H/W– Intel SSE/AVX; IBM AltiVec– GPU

Single Instruction

Multiple Instruction

Single Data SISD MISD

Multiple Data SIMD MIMD

Vector1 Library

• An abstract library to support SIMD programming

• High-level syntax and integration to C++

– Hiding significant details, e.g. register allocation

– Advanced operation abstractions

– Type-checking

• Platform independent

– Current implementation on Intel SSE

– Portable to other platform

• Highly efficient code generation

– Integrated with intrinsic of modern C++ compiler

– Same/better performance compared to hand-made optimization

Vector1 Data Type

• Vector size 128b.

• Must be 16 aligned!

Element Type Element Size

Vector Length

vb/vub Byte (unsigned byte) 8b 16

vs/vus short(unsigned short) 16b 8

vi/vui Int (unsigned int) 32b 4

vf Float 32b 4

vcb/vcub Complex (unsigned) byte 16b 8

vcs/vcus Complex (unsigned) short 32b 4

vci/vcui Complex (unsigned) int 64b 2

vcf Complex float 64b 2

Vector1 Operations

• Arithmetic operations– Add/saturate_add– Sub/saturate_sub– mul_low/mul_high

• Element-wise multiple and with low (high) part of the results

– conj_mul• Complex multiply with a complex conjuncture

• Max/Min– smax/smin

• Obtain element-wise max/min value

• Absolute value– Abs

• Initialization – Set_zero/set_all

Vector1 Operations (cont.)

• Interleave

– Interleave_low/interleave_high• R = interleave_low (x, y);

• Permutation

– permutation<idx> • R = permutation<3,2,1,0> (x)

x0 x1 x2 x3 y0 y1 y2 y3

x0 y0 x1 y1

y0 y1 y2 y3

y3 y2 y1 y0

R

R

Vector1 Sample 1 – Calculate Energy

• Algorithm

• Code

𝐸 = 𝑅𝑒(𝑥 ∙ 𝑥 ∗) = 𝑅𝑒( 𝑥𝑖 ∙ 𝑥∗𝑖

𝐿0 )

Vector1 Sample 2 – FIR Filter

• Algorithm

• One quick idea

𝑦[𝑘] = 𝑐𝑗 ∙ 𝑥[𝑘 − 𝑗]

𝐿

𝑗=0

Violate the 16 alignment rule!

Vector1 Sample 2 – FIR Filter (cont.)

• A better way

x0 x1 x2 x3 …

c0 c1 c2 c3 … … cl

c0 c1 c2 c3 … … cl

c0 c1 c2 … … … cl

c0 c1 … x3 … … cl

xk … xn

Use multiple coefficient bands

Avoid 16 alignment issue, but any more issue?

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

R1

R2

R3

R4t0

t1

t(L-1)

tL

t2

t3

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

R1

R2

R3

R4t0

t1

t(L-1)

tL

t2

t3

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

t0

t1

t(L-1)

tL

R1

R3

R4

t2

t3

R2

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

t0

t1

t(L-1)

tL

R1

t2

t3

R2

R4

R3

Y0 Y1 Y2 Y3

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

t0

t1

t(L-1)

tL

R1

t2

t3

R2

R4

R3

Y0 Y1 Y2 Y3

Vector1 Sample 2 – FIR Filter (cont.)

• Multiple coefficient bands w/ cache tilting

x0 x1 x2 x3 … xk … xn

0 0 0 c0

0 0 c0 c1

0 c0 c1 c2

c0 c1 c2 c3

… … … …

c(L-3) c(L-2) c(L-1) cL

c(L-2) c(L-1) cL 0

c(L-1) cL 0 0

cL 0 0 0

t0

t1

t(L-1)

tL

R1

t2

t3

R2

R4

R3

Y0 Y1 Y2 Y3

Read sample code at$(sdkroot)\src\vector1\hello_vector

Summary

• Vector1 reduces the efforts to program SIMD

• But it is still in the development stage

– Try it now and understand it

– Write your code with it and test performance

– Give us feedbacks and your suggestion may be reflected in later release!

Digital Wireless Communication Illustrated

Kun Tan

Wireless and Networking Group

MSR Asia

Outline

• Communication basis

– Sampling Theory and System

– Time Domain and Frequency Domain

• Overview of Digital Communication System

• Case study

– Spectrum spreading – 802.11b

– OFDM - 802.11a

Sampling System: ADC and DAC

x(t)

ADC

t=nT

X(n)=x(nT)

t=nT

X (t)^Pulse Gen

LPF

Nyquist–Shannon Sampling Theory

• Study the condition where we can have

• Theory:if a function x(t) contains no frequencies higher than B

hertz, it is completely determined by giving its ordinates at a series of points spaced T=1/(2B) seconds apart.

fs=1/T is called sampling rate.

Given sampling rate is fs, the maximal frequency component can be reconstructed is limited by fs/2.

fs= 1/(2B) is also called Nyquist sampling rate

X (t) = x(t)^

Impact of The Sampling Theory

• With properly taken samples, operating on those discrete samples has exactly the same effects as operating on analog signal itself

• The fundamental of Digital Signal Processing

• Signal reconstruction from samples

B/2-B/2

fs=2B

Pulse gen

Low pass filter

Time Domain and Frequency Domain

• Fourier transform

– Analytical tools for better present and study the properties of a function, say f(t)

– Idea: represent an arbitrate function f(t) using a series of simple periodic functions

– The frequency of is ω

– F(ω) is called frequency domain presentation of f(t)

e−2πjωt

𝐹 𝜔 = 𝑓 𝑡 𝑒−𝑗2𝜋𝜔𝑡𝑑𝑡∞

−∞

Discrete Fourier Transform (DFT)

• Operate on N discrete samples

• Provide samples at discrete frequencies at

where B=fs/2

• When N increases, we have a finer samples at the frequency domain; but with a longer delay to gather enough samples

𝑋 𝑘 = 𝑥 𝑛 e−2πjkn

N

𝑁−1

𝑛=0

, 𝑘 = 0. . 𝑁

kn

NB

Overview of Communication System

Frames

Channel Coding

Digital Modulation

Baseband signal samples

Coded bits

DAC

RF signal

ADC

Frames

Channel decoding

Digital Demodulation

Baseband signal samples

Coded bits

Up-convertor Down-convertor

Modulated carrier

Digital Modulation and Demodulation

• Digital Modulation

– Map a series of bits onto a finite number of M alternative symbols

– Fundamental methods (a.k.a. Keying)• Phase-Shift Keying (PSK)

• Amplitude-Shift Keying (ASK)

• Quadrature Amplitude Modulation (QAM)

QAM is a more general form of digital modulation that combines both PSK and ASK

QAM and I/Q Signals

• Modulate symbols on two quadrature carriers

– Quadrature: phase shift of the two carriers is 90 degree• , sin(2𝜋𝜔𝑡) cos(2𝜋𝜔𝑡)

cos

sin

In-phase (I)

Quadrature-phase (Q)

RF

cos

-sin

In-phase (I)

Quadrature-phase (Q)

Constellation Graph

• Useful to describe QAM

• Present symbols as points on a complex plane

– In-phase amplitude Real part

– Quadrature-phase amplitude Image part

I

Q

Constellation graph of QPSK

Digital Demodulation

• Determine the transmitted symbol based on received I/Q values

• Maximum a Posteriori (MAP) rule– Find the symbol that is mostly

like the received I/Q

• Maximum like-hood

– MAP ML (hard decision)• Maximum like-hood ratio

– LLR Soft information

I

Q

x

Received I/Q

s = argmaxi p(si|y) 𝑝 𝑠𝑖 𝑦 =

𝑝 𝑦 𝑠𝑖 𝑝(𝑠𝑖)

𝑝(𝑦)

Λ 𝑦 =sup{𝑝 𝑦 𝑏𝑖 = 0 }

sup{𝑝 𝑦|𝑏𝑖 = 1 } LLR(y) = Log Λ 𝑦

Wireless Channel

• A wireless channel is an abstract concept• It represents a composite effects that changes

properties of a received symbol compared to the transmitted symbol– Attenuation on the amplitude– Rotation in phase– Shift in frequency– Multi-path superposition

Wireless Channel Model

• Simple flat-fading model

– No multipath

– Model channel as a single complex value h(i)=a(i)e-2πjθ(i), y(i) = h(i) x(i)

– Only amplitude and phase of a symbol is distorted

• Multi-path fading model

– Model channel as a series complex values H(i) =[h1(i), h2(i), h3(i),…,hk(i)]

– Each component presents a path

– Y(i) = H(i) * X(i)“*” is the convolution operation

z i = y(i) ∗ x i = y i − m x(m)

m=−∞

Receiver Synchronization and Equalization

• Timing synchronization

– Find out the start point of a symbol

• Frequency synchronization

– Compensate the frequency offset between the sender and the receiver

• Equalization

– Compensate the distortion of wireless channel

– Find H* (t), so that H* (t) H(t) = I

• The synchronization and equalization method varies with different standards

Case Study I: Spectrum Spreading

• Why spreading?– Gain more reliability at the cost of more spectrum

use– Equivalent to channel coding

• Direct-sequence Spectrum Spreading

Low frequency user data

Symbol

High frequency spreading code

Chip

DSSS wave form

IEEE 802.11b

• IEEE Standard for WLAN

• DSSS PHY on 22MHz channel 2.4GHz

– Symbol rate: 1Mbps

– Spreading sequence: Barker sequence; CCK

• Data rate up to 11Mbps

• Modulation: DBPSK, DQPSK, CCK

• Channel coding: no (why?)

802.11b Transmission Structure

• Scrambler– Randomize the input bit stream to favorite RF circuits

• Mapper– Map bits to I/Q symbols: DBPSK/DQPSK– x[k] = s[k] x[k-1]

• Spreading– Multiple symbols by the spreading sequence

• Power shaper – Oversample – Raised root cosine filter

Scrambler

frameDBPSK/DQPSK

MapperBarker (11) Spreading

Power shaper

802.11b Receiver Structure

Time Sync

Decimation DespreadingDifferential demapper

bits

Time Synchronization

• Sample time recovery

Sending clock

Time Synchronization

• Sample time recovery

– How to recovery to the ideal sampling time?

Receiver clock

SNR loss

Time Synchronization

• Sample time recovery

– How to recovery to the ideal sampling time?

– Simple solution: over-sampling!

– Search for the sample index with the maximal energy level

– What if the sampling frequency of sender and receiver offsets slight?

Receiver clock

SNR loss reduced

Time Synchronization (cont.)

• Symbol time recovery

– Correlate with Barker

– Looking for a peak

• Locate the start of data symbols

– Decode to find all ones (training symbols)

– Search SFD after lock onto training symbols

Training symbols (all ones) SFD Data symbols

Preambles

Differential Demapper

• Calculate the angle rotated and demap it to bits

• Why using differential modulation?

– Why the carrier frequency offsetis not considered?

– Why initial symbol phase is notconsidered?

– What is cost?

Y[k-1]

Y[k]

Case study II : OFDM

• Orthogonal Frequency Division Multiplexing• A multi-carrier modulation scheme for high-speed wireless

systems• Sub-carrier symbols are modulated on a series of

orthogonal sub-carriers• Pros:

– Simple equalization– Robust against inter-symbol interference (ISI) and

multi-path fading– High spectral efficiency– Low sensitivity to time synchronization errors

• Cons:– Sensitive to Doppler shift– Sensitive to frequency synchronization errors– High peak-to-average-power ratio (PAPR), requiring

high dynamic range ADC/DAC

OFDM Example

• OFDM is easily implemented using FFT

• Sub-carrier locates at Bk/N, k=1,2,…,N-1.

I1/Q1

I2/Q2

In/Qn

IFFT FFT

I1/Q1

I2/Q2

In/Qn

Cyclic Prefix

• N-point IFFT translates N frequency-domain sub-carrier symbols to N time-domain OFDM samples c0, c1, c2,…,cN-1

• Cyclic prefix – copy G tailing samples to the front and form a (G+N) sample OFDM symbol

• Why CP?– Handling ISI – CP works as a guide time– Handling time synchronization error

• Rotation of time-domain samples results only phase-shift in frequency-domain symbols

• Latter can be corrected by equalization

cN-G, cN-G+1,…,cN-1 c0, c1,…,cN-1

N FFT samplesG sample CP

Case Study: IEEE 802.11a

• IEEE Standard for WLAN

• OFDM PHY on 20MHz channel 5GHz

– Same PHY is adopted in 2.4GHz spectrum as 802.11g

– 64 sub-carriers (FFT size is 64)

• Data rate up to 54Mbps

• Modulation: BPSK, QPSK, 16QAM, 64QAM

• Channel coding: convolutional code with 1/2, 2/3 and 3/4 coding rate

Transmitter structure of 802.11a

• Scrambler– Randomize the input bit stream to favorite RF circuits

• Interleaver– Spread coded bits among different subcarriers and bit positions

in subcarrier symbol• Pilot

– Known symbol sequence on known subcarriers; providing training sequence to track the channel changes

Scramblerframe

Convolutionalencoder

Coded bits

Interleaver

Map

perPilot

IFF

T

CP

Sub-carrier symbols

OFDM symbol

Scrambler

• Simple XOR between the bit stream and a 127-bit template

• The initial states of the scrambler are pseudo random non-zero state from a known pattern

– Why ?

• This initial states are encoded in a field in PHY header and transmitted to the receiver

Convolutional Encoder

• A sort of Forward Error Correction code

– Encode m bits into n bits (n>=m); m/n code

• Encoder used in 802.11a

– Basic rate ½

– Achieving higher rate by punching

Original bits

½ coded bitsPunched bits

Interleaver

• Why interleaving?

– Channel code works best with random independent errors

– Wireless errors are in practical correlated

• Error correlation in subcarriers

– If one subcarrier faces deep fading, it is very likely that nearby subcarriers suffer the similar fading – coherent bandwidth

• Spread adjacent coded bits to

– Non-adjacent subcarriers

– Alternate on less and or significantbits on constellation

Pilot and Mapping

• Subcarrier usage

– 48 data subcarriers

– 4 pilot subcarriers

– 12 null subcarriers

• Pilot subcarriers

– Initial {1,1,1,-1}

– Polarity changes accordinga pre-defined sequence

IFF

TDC

-1 data

1 data

7 Pilot

21 Pilot

26 data

31Guard-band

-7 Pilot

-21 Pilot

-26 data

-32Guard-band

64 time domain samples

PHY Frame Structure

• Preamble – 10 short training symbols (STS)– Long training symbol (LTS)

• Two repeated 64 FFT samples (3.2 us * 2) + CP (1.6 us)

• Signal and data symbols– 64 FFT samples (3.2 us) + CP (0.8 us)

Short Training Symbol

Long Training Symbol

Signal (PLCP Header)

Data symbols

0.8x10 us 8 us 4 us

Preambles

Receiver Structure of 802.11a

• Two states in receiver

– Synchronization state • Tries to find a preamble (frame detection)

• Synchronizes to the preamble (time sync, freq. sync, equalization, etc.)

• Operates on preamble

– Decoding state• Starts to demodulate and decode the transmitted bits

• Tracks the channel changes via pilots

• Operate on data symbols

Receiver Synchronization (I)

• Frame detection

– Idea: utilize the repeat pattern of STS

– Algorithm: auto-correlationDetect if a patter has repeated periodically

• Time synchronization

– Idea: utilize the known waveform of STS

– Algorithm: cross-correlationDetect if a patter has appeared

Average energyAuto-correlation

A periodical patter is determined

Cross-correlation peaks

Receiver Synchronization (II)

• Frequency offset

– Transmitter’s frequency is not exactly equal to the receiver’s frequency

– Range from 500Hz~100KHz without calibration

– Destroy the orthonality and cause inter-subcarrier interference

Tx: BB

0Hz

Tx: RF

ftxHz

Rx: RF

frxHz

Rx: BB

ΔfHz

Up-convert Down-convert

𝑦𝑖 = 𝑥𝑖𝑒𝑗2𝜋𝑖Δ𝑓

𝑌𝑛 = 𝑦𝑖𝑒−𝑗2𝜋

𝑖𝑛𝑁

𝑁−1

𝑖=0

= 𝑥𝑖𝑒−𝑗2𝜋

𝑖𝑛𝑁 𝑒𝑗2𝜋𝑖Δ𝑓

𝑁−1

𝑖=0

inter-subcarrier interference

Frequency Offset Estimation and Compensation

• Utilize the LTS for FOE

– LTS has two repeating blocks

– For each sample, we can compute a Δf; the estimation take the average

– A coarse estimation based on STS is needed if the frequency offset is very large• i.e. N Δf > fs

• Freq. offset compensation

– The ith sample is rotated by 2πiΔf before feeding to FFT

T1 T2

𝑦𝑖+𝑁 = 𝑦𝑖𝑒𝑗2𝜋𝑁Δ𝑓

Equalization

• Separate equalizer for each subcarrier

• Simple flat fading channel for each subcarrier

– Narrow band assumption

• Utilize LTS

– Compute hi for each subcarrierhi=si/yi, where si is the known transmitted symbol on subcarrier i, and yi is the received symbol

• Equivalent to FIR filter at time domain

– But the filter parameters are easier to determine

Demodulation and Decoding Data Symbols

Descrambler

Viterbidecode

Deinterleaver

Dem

appe

r

Pilot tracking

FF

T

Remove CP

OFDM symbol

FO correction

Frame

Soft-bit

Equ

aliz

atio

n

Viterbi Decode (I)

• Dynamic programming to compute the most likely sequence of hidden states

• Trellis presentation of convolutional codes– K-bit convolutional code needs 2K-1 states– 64 states in 802.11a– A path represents an encoded bit sequence

000000

000001

111111

111110

111010

111101

i:1

011110i:0

O::01 O:11

Viterbi Decode (II)

• Decoding – given the output sequence, find the most likely encoding sequence

– Find the path whose output is more like the given sequence

• Path metric

– Hamming distance

– Soft-info

• Algorithm

– Path metric compute and select

– Trace-back along the minimal path 000000

000001

111111

111110

111100

0:10 111101

1:01

011110 1:10

01

Recommended