6
Reinforcement Learning-Based Multiband Sensing Policy for Cognitive Radios Jan Oksanen #1 , Jarmo Lundén #2 , and Visa Koivunen #3 # Department of Signal Processing and Acoustics, SMARAD CoE Aalto University School of Science and Technology, Finland 1 [email protected] 2 [email protected] 3 [email protected] Abstract—Cognitive radios (CR) and dynamic spectrum access (DSA) have been proposed as a way to exploit the underutilized radio spectrum by allowing secondary users to access the licensed frequencies in an opportunistic manner. The constraint set to the secondary use is that it should not interfere the primary users, i.e., the license holder. Hence, the secondary users need to sense the spectrum in order to classify a licensed frequency band as vacant or occupied. However, spectrum sensing can be a demanding task for a single user due to the random nature of the wireless channel, and to mitigate the effects of channel fading cooperative detection algorithms have been proposed. In this paper a multiband spectrum sensing policy for coordinating the cooperative sensing is proposed. It is based on dynamically allocating frequency hopping codes to the secondary users. The proposed policy employs the -greedy reinforcement learning action selection to prioritize the sensing of different subbands and to select the best secondary users to sense them. The results show the proposed policy is able to significantly increase the obtained throughput in the secondary network and to reduce the number of missed detections of the primary signal. Index Terms—Cognitive radio, Spectrum sensing, Reinforce- ment learning, Spatial diversity, Frequency hopping. I. I NTRODUCTION The rapidly growing markets for wireless devices have made the useful radio spectrum a scarce resource. Spectrum measurement campaigns, however have shown that although most of the radio spectrum has been licensed for different operators, large parts of the spectrum are underutilized [1]. This finding is the key motivation behind cognitive radio [2] and DSA (dynamic spectrum access), where the goal is to find unused radio spectrum in time and in space, and to allow secondary users (SU) to access such licensed frequencies in a way that the primary user will not be interfered. Consequently, spectrum sensing is needed by the cognitive radio system to identify such spectrum opportunities and to characterize the possible interference levels to the primary system. Wireless signals go through random channels causing the instantaneous signal-to-noise ratio (SNR) to fluctuate in time. Occasionally the SNR may be reduced by tens of dBs. Cooper- ative detection schemes have been proposed in the literature to combat against fading by combining the sensing results from multiple SUs from various locations (see, e.g., [3]). A cognitive secondary network consists of N S wireless ter- minals sensing the licensed spectrum of interest. The spectrum of interest may be noncontiguous as the licensed frequency bands may be scattered in the radio spectrum depending on the location and the allocation of the bands to different operators. This calls for organizing and dividing the sensing task among the SUs, since a single user may not be able to sense the whole band at once. Spectrum sensing and allocation policies have been proposed in the literature [4], [5], [6], [7] for selecting the subband to be sensed at the next sensing instant. However, these works do not consider collaborative spectrum sensing and distribution of work among SUs with different sensing capabilities. In [8] a diversity based multiband spectrum sens- ing policy has been proposed, where the design of the sens- ing policy has been converted into designing and allocating pseudorandom frequency hopping codes to the SUs guiding them which subbands are sensed and when. The subbands are sensed by constellations of D SUs consequently exploiting spatial diversity. In this paper we further develop this sensing policy and show how reinforcement learning methods can be embedded in it to actively select the most promising subbands to be sensed and the SUs to sense them. This method maintains the desired diversity, i.e., the number D of SUs sensing the same subband simultaneously. The proposed sensing policy provides significant gains to the secondary network throughput and reduces the interference to the primary system caused by missed detections. This paper is organized as follows. The system model of cooperative sensing over multiple subbands is described in section II. In section III we present a pseudorandom frequency hopping sensing policy for organizing the multiband sensing task in a secondary network. In section IV we propose a reinforcement learning based sensing policy. In sections V and VI we show the simulation results for the proposed sensing policy and discuss about them. The paper is concluded in section VII. II. WIDE–BAND COOPERATIVE SPECTRUM SENSING The bandwidth of the spectrum of interest for the cognitive radio system may be wide and in some cases even noncontigu- ous. Sensing the entire spectrum of interest simultaneously is demanding for the hardware and may be energy inefficient. Energy efficiency and low cost are especially important in mobile applications where the SUs are typically battery- 2010 2nd International Workshop on Cognitive Information Processing 978-1-4244-6459-3/10/$26.00 ©2010 IEEE 316

[IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

  • Upload
    visa

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

Reinforcement Learning-Based Multiband Sensing

Policy for Cognitive Radios

Jan Oksanen#1, Jarmo Lundén#2, and Visa Koivunen#3

#Department of Signal Processing and Acoustics, SMARAD CoE

Aalto University School of Science and Technology, [email protected]

[email protected]

[email protected]

Abstract—Cognitive radios (CR) and dynamic spectrum access(DSA) have been proposed as a way to exploit the underutilizedradio spectrum by allowing secondary users to access the licensedfrequencies in an opportunistic manner. The constraint set tothe secondary use is that it should not interfere the primaryusers, i.e., the license holder. Hence, the secondary users needto sense the spectrum in order to classify a licensed frequencyband as vacant or occupied. However, spectrum sensing can bea demanding task for a single user due to the random natureof the wireless channel, and to mitigate the effects of channelfading cooperative detection algorithms have been proposed. Inthis paper a multiband spectrum sensing policy for coordinatingthe cooperative sensing is proposed. It is based on dynamicallyallocating frequency hopping codes to the secondary users. Theproposed policy employs the ε-greedy reinforcement learningaction selection to prioritize the sensing of different subbandsand to select the best secondary users to sense them. The resultsshow the proposed policy is able to significantly increase theobtained throughput in the secondary network and to reducethe number of missed detections of the primary signal.

Index Terms—Cognitive radio, Spectrum sensing, Reinforce-ment learning, Spatial diversity, Frequency hopping.

I. INTRODUCTION

The rapidly growing markets for wireless devices have

made the useful radio spectrum a scarce resource. Spectrum

measurement campaigns, however have shown that although

most of the radio spectrum has been licensed for different

operators, large parts of the spectrum are underutilized [1].

This finding is the key motivation behind cognitive radio [2]

and DSA (dynamic spectrum access), where the goal is to

find unused radio spectrum in time and in space, and to allow

secondary users (SU) to access such licensed frequencies in a

way that the primary user will not be interfered. Consequently,

spectrum sensing is needed by the cognitive radio system to

identify such spectrum opportunities and to characterize the

possible interference levels to the primary system.

Wireless signals go through random channels causing the

instantaneous signal-to-noise ratio (SNR) to fluctuate in time.

Occasionally the SNR may be reduced by tens of dBs. Cooper-

ative detection schemes have been proposed in the literature to

combat against fading by combining the sensing results from

multiple SUs from various locations (see, e.g., [3]).

A cognitive secondary network consists of NS wireless ter-

minals sensing the licensed spectrum of interest. The spectrum

of interest may be noncontiguous as the licensed frequency

bands may be scattered in the radio spectrum depending on the

location and the allocation of the bands to different operators.

This calls for organizing and dividing the sensing task among

the SUs, since a single user may not be able to sense the whole

band at once. Spectrum sensing and allocation policies have

been proposed in the literature [4], [5], [6], [7] for selecting

the subband to be sensed at the next sensing instant. However,

these works do not consider collaborative spectrum sensing

and distribution of work among SUs with different sensing

capabilities. In [8] a diversity based multiband spectrum sens-

ing policy has been proposed, where the design of the sens-

ing policy has been converted into designing and allocating

pseudorandom frequency hopping codes to the SUs guiding

them which subbands are sensed and when. The subbands are

sensed by constellations of D SUs consequently exploiting

spatial diversity. In this paper we further develop this sensing

policy and show how reinforcement learning methods can be

embedded in it to actively select the most promising subbands

to be sensed and the SUs to sense them. This method maintains

the desired diversity, i.e., the number D of SUs sensing the

same subband simultaneously. The proposed sensing policy

provides significant gains to the secondary network throughput

and reduces the interference to the primary system caused by

missed detections.

This paper is organized as follows. The system model of

cooperative sensing over multiple subbands is described in

section II. In section III we present a pseudorandom frequency

hopping sensing policy for organizing the multiband sensing

task in a secondary network. In section IV we propose a

reinforcement learning based sensing policy. In sections V and

VI we show the simulation results for the proposed sensing

policy and discuss about them. The paper is concluded in

section VII.

II. WIDE–BAND COOPERATIVE SPECTRUM SENSING

The bandwidth of the spectrum of interest for the cognitive

radio system may be wide and in some cases even noncontigu-

ous. Sensing the entire spectrum of interest simultaneously is

demanding for the hardware and may be energy inefficient.

Energy efficiency and low cost are especially important in

mobile applications where the SUs are typically battery-

2010 2nd International Workshop on Cognitive Information Processing

978-1-4244-6459-3/10/$26.00 ©2010 IEEE 316

Page 2: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

powered devices. To facilitate easier receiver front-end design

and energy efficient operation it may be more practical to

divide a large, possibly noncontiguous frequency band into

subbands that are sensed separately. Hence, we assume that

the spectrum of interest consists of NP subbands with possibly

different bandwidths Bi, i = 1, . . . , NP .

Assuming that the spectrum of interest consists of a set of

subbands, a SU in a simple case has to do the sensing one

subband at a time. However, if there are multiple SUs in the

network that cooperate, the secondary network is able to sense

multiple subbands simultaneously. Alternatively, multiple SUs

might sense the same subbands at the same time and combine

their results, thus providing spatial diversity. By combining

their sensing results, the probability of detection at a given

SNR is increased, or for equal performance, simpler detector

structures could be employed.

We assume a slotted design for the secondary network. That

is, the operation of the secondary network over time is divided

into sensing time slots and potential transmission time slots as

illustrated in Fig. 1. During each sensing time slot each SU

senses one of the subbands and then sends its local test statistic

or decision to the fusion center (FC) via a control channel. The

local test statistics or decisions are then combined at the FC

using some fusion rule.

Figure 1. Slotting of a secondary user’s operation. After sensing a particularsubband and sending its local test statistics to the FC, the FC grants apermission to one of the SUs to transmit.

We propose an approach where the sensing policy is con-

trolled by the FC. That is, for each time slot the FC decides

which subbands are sensed and by which SUs. We propose

a two-stage sensing policy where in the first stage the FC

decides which subbands will be sensed and in the second stage

it decides which SUs they are sensed by.

III. PSEUDORANDOM FREQUENCY HOPPING BASED

SENSING POLICY

In [8] a diversity based multiband spectrum sensing policy

has been proposed, where the design of the sensing policy has

been converted into designing and allocating pseudorandom

frequency hopping codes to the SUs guiding them which

subbands are sensed and when. The assumption is that the

frequency hopping codes can be stored into the SU memory

off-line. Frequency hopping code allocation for the SUs is

made by the FC such that after each hopping code different

D-tuples of the NS SUs will be employed to scan the

spectrum of interest together. To speed up the scanning of the

whole spectrum of interest, different D-tuples sense different

subbands simultaneously. Figure 2 shows an example design

of the hopping codes for NS = 4, NP = 3 and D = 2. The

benefit of this kind of sensing policy is that it reduces the

control signaling from the FC to the SUs about the bands to

be sensed to only signaling of the hopping sequence indices

to the SUs.

Figure 2. Pseudorandom frequency hopping codes for NS = 4, NP = 3

and D = 2. At each sensing instance the SU senses the subband pointed bythe current entry of its hopping code. Hopping code entries are pointing tothe physical frequencies that are maintained in table F .

In frequency hopping based sensing each SU hops according

to its hopping sequence to sense one of the subbands of

interest. The subband to be sensed at time index i is given

by f(i) = F [Sq(i)], where Sq(i) is the qth frequency hopping

sequence, F is a table containing the mappings to the physical

subbands. The assigning of the sequences Sq to different

SUs is discussed later in sections III-A1 and III-A2. Table

F may include links to the subbands’ center frequencies and

bandwidths. As F is primary system dependent mapping to the

physical frequencies, it may be assumed to be the same for all

SUs exploiting the same primary network. However, frequency

reuse in cellular primary systems would make the F to be also

location dependent, but for simplicity we assume here that

licensed frequencies are not reused. Hence, the examination

of hits among SUs sensing may be limited to the examination

of hits among their frequency hopping sequences Sq(i).

A. Frequency Hopping Sequences

Next, we show a design example of the subband sensing

patterns providing the desired diversity D, i.e., the number

of SUs simultaneously sensing any subband. The patterns

are orthogonal hopping sequences that may be generated by

applying cyclic shifts to any full sequence of integers. By

orthogonality we mean that the codes can be phased so that

all codes in the code family will be pairwise non-overlapping.

The desired diversity and consequently the number of hits

among the SUs may be obtained by assigning the users in

each D-tuple to have the same hopping code for one full

code period. Instead of using cyclic shift sequences other

code designs known in the frequency hopping spread spectrum

(FHSS) literature could be applied.

The simplest way to generate a orthogonal code family is to

cyclically shift any full sequence of integer numbers. Cyclic

shifts may be generated by the modulo operation as

Sq(i) ≡ (i + ∆q) mod NP , (1)

where i ∈ [0, NP − 1], q ∈ [0, bNS

Dc − 1] and ∆q is the shift

parameter.

When ∆q 6= ∆w,∀q 6= w the maximum periodic Hamming

[9] correlation between any two codes in the family is NP

and the minimum 0. The shifts between the cooperating D-

tuple can be adjusted so that minimum periodic Hamming

correlation will be obtained.

To obtain a high scanning speed by the secondary system

over the whole spectrum of interest (all subbands) we choose

317

Page 3: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

Table ICYCLIC SHIFT HOPPING CODE FAMILY FOR NS = 12, NP = 10 AND

D = 2

S0(i) 0 1 2 3 4 5 6 7 8 9

S1(i) 2 3 4 5 6 7 8 9 0 1

S2(i) 4 5 6 7 8 9 0 1 2 3

S3(i) 6 7 8 9 0 1 2 3 4 5

S4(i) 8 9 0 1 2 3 4 5 6 7

S5(i) 1 2 3 4 5 6 7 8 9 0

the shift parameter ∆q to be

∆q =

qA mod NP − 1 , if A|Np and A 6= 1

qA mod NP , otherwise

, (2)

where A =⌈

NP DNS

and where A|Np means that NP is

divisible by A. Table I shows an example hopping code family

of (2) for NS = 12, NP = 10 and D = 2.

1) Example of Frequency Hopping Sequence Assignment

for D=2: Next, an example of the frequency hopping code

allocation for the pseudorandom scheme for diversity D = 2is given, which corresponds to an all-play-all round-robin

tournament design [10] in case D = 2.

Table II illustrates a round-robin hopping sequence table for

NS = 6.

Table IIROUND-ROBIN FREQUENCY HOPPING CODE INDEX TABLE FOR 6 SUS.THE CODE ASSIGNMENT CORRESPONDS TO THE HOPPING SEQUENCES

SHOWN IN TABLE 2. EACH ENTRY IN THE TABLE CORRESPONDS TO ONE

HOPPING CODE IN THE SELECTED CODE FAMILY, I.E., ONE FULL SCAN OF

THE SPECTRUM.

time →SU0: S0 S0 S0 S0 S0

SU1: S0 S1 S2 S2 S1

SU2: S1 S0 S1 S2 S2

SU3: S2 S1 S0 S1 S2

SU4: S2 S2 S1 S0 S1

SU5: S1 S2 S2 S1 S0

2) Method for Generating All D-tuples: Next we introduce

a computer search based method for finding the frequency

hopping code assignment that employs all possible sensing

constellations of size D (combinations of D SUs out of NS)

in the smallest number of code periods for NS = KD, where

K is a positive integer. The solution is based on finding cliques

from an undirected simple graph G = (V,E), where V is the

set of all possible D-tuples of NS SUs and where two vertices

are adjacent if the intersection of the corresponding D-tuples

is empty. From graph G we then search all cliques of size

K = NS/D. From the found cliques we build a new graph

G2 = (V2, E2) where V2 is the set of the cliques found from

G. In G2 two vertices are adjacent if the two cliques don’t

contain common vertices in G. Finally, the code assignment

may be found by searching one clique of size W in G2, where

W =(

NS

D

)

DNS

.

Fig. 3 shows graph G for the case NS = 6 and D = 2,

where the SUs are denoted with {SU0,SU1,...,SU5}. Since

there are 15 different 2-tuples in a set of 6 SUs, the number

of vertices is |G| = 15. The non-overlapping cliques of size 3were searched using the Cliquer software [11]. Finding cliques

of a given size is NP-complete, i.e. the algorithm is not linear

time as NS grows. However, finding cliques can be done off-

line and stored into the SUs and the FC in advance.

Figure 3. Five SU cliques of size K = NS/D = 3 in graph G for NS = 6

and D = 2. The five cliques of different colors don’t contain common verticesin G, which results into the round-robin tournament shown in Table II so thateach clique corresponds to one column of the table.

IV. A TWO-STAGE SENSING POLICY

The goal of the proposed sensing policy is twofold. First, it

is desirable to maximize the throughput of the secondary net-

work. And second, it is important to minimize the probability

of missed detection to avoid collisions with PU transmissions.

Consequently, the secondary network should sense more

frequently those subbands that are persistently vacant and

where the average throughput is high. Furthermore, the SUs

sensing each subband should be the ones that have a high

quality channel to the corresponding primary user and thus

have a high probability of detection.

In order to accomplish the above two goals, we propose a

two-stage sensing policy:

1. Select the subbands to be sensed in the next slot.

2. Select the SUs sensing the subbands selected in stage 1.

To achieve the first goal of maximizing the throughput, in

the first stage we select the subbands that persistently provide

high average throughput for the secondary system. To meet

the second goal about detection performance, in the second

stage we assign those SUs to sense the selected subbands that

appear have the best channel to the primary signal.

We propose a reinforcement learning-based action selection

in both of the two stages of the sensing policy. Using a

318

Page 4: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

reinforcement learning approach the goal is to estimate the

value of each subband for the secondary network and the value

of each SU for each particular subband. Then the policy selects

the subbands and SUs that have higher value more frequently.

In the first stage of the sensing policy the value of a subband

may be the estimated throughput obtainable by the secondary

network. In the second stage of the sensing policy the value

of the SU for a particular subband is a function describing the

SUs ability to detect on that subband. Both these values are

estimated from the past rewards as [12]

Qk+1(a) = Qk(a) + αk[rk+1 − Qk(a)], (3)

where Q(a) denotes the value of the action a (i.e. selecting a

particular subband to be sensed or assigning a particular SU

to sense a particular subband), rk+1 is the reward at time step

k + 1 and αk ∈ [0, 1] is a step size parameter.

A natural reward for achieving the first goal of maximiz-

ing the throughput of the secondary system is the achieved

immediate throughput in the sensed subband, hence, resulting

in a sensing policy that favors subbands that are vacated and

have high throughput for secondary transmissions. However,

in certain cases measuring and reporting the throughput to the

FC may not be practical or the subbands sensed as vacant

may not be accessed thus resulting in a zero throughput. An

alternative reward structure could depend only on the sensing

results, e.g., r = 1, if the subband is sensed vacant, and r = 0,

otherwise. Such a scheme would require less control signaling

and would not be affected by the decisions of not accessing a

subband sensed as vacant. However, the second problem could

be also circumvented by not updating the value if the subband

is not accessed although it has been sensed as vacant.

To achieve the second goal of minimizing the probability

of missed detection, the SUs sensing each subband should be

the ones that have high probability of detection. This can be

achieved by favoring SUs with high values according to (3)

where the reward r is a function of the local test statistic i.e.

the local likelihood ratio (LR) as

rk+1 =

{

λa, dFC = 1,

Qk(a), dFC = 0,(4)

where λa is the local LR of the corresponding SU and dFC

is the global decision at the FC. Alternatively the reward may

be a function of the decisions for example as follows

rk+1 =

{

1 − (1 − di), dFC = 1,

Qk(a), dFC = 0,(5)

where di is the local decision of the ith SU. Decision d = 1denotes a decision that the subband is occupied by a primary

system. This strategy ensures that there is no update in the

value Q(a) when the subband is detected as vacant and a

positive or a negative update takes place when the subband is

detected as occupied depending on whether the local decision

was the same as the global decision at the FC. Consequently,

the proposed strategy favors the SUs that correctly sense the

presence of the primary systems.

To ensure convergence in a stationary problem the step size

parameter αk in (3) should satisfy the following conditions

[12]∞∑

k=1

αk = ∞ and

∞∑

k=1

α2k < ∞. (6)

The first condition in (6) guarantees that the step size is

large enough to overcome the initial state, while the second

condition guarantees that the step size is small enough to

assure convergence. Note that a step size αk = 1/(k + 1)fulfills the conditions of (6) and results in the standard sample-

average of the past rewards. On the other hand, for constant

αk = α the estimates will continue varying in response to

the latest observed rewards instead of converging to steady

state. For tracking a rapidly changing spectrum this is actually

desirable, since we want the algorithm to react to the changes

as quickly as possible. However there is a trade-off between

adaptability and variance of the value estimates when choosing

αk. A constant αk = α results in a weighted average of past

rewards and the initial value Q0, i.e. [12]

Qk+1(a) = (1 − α)k+1Q0(a) +

k+1∑

i=1

α(1 − α)k+1−iri. (7)

A constant step size α is more suitable to nonstationary

problems encountered in cognitive radio applications. We can

notice in (7) that when α is close to 0, the algorithm will

give emphasis on the past rewards, whereas when α is larger

more emphasis is given on the latest rewards. This would

suggest that for heavily nonstationary processes large values

of α would be more suitable, whereas for stationary processes

small α would give better results.

A. The ε-greedy Action Selection

The ε-greedy action selection is a simple, yet effective

method that balances between exploitation and exploration.

With a probability 1 − ε the ε-greedy method selects the

action that has the highest estimated action value, i.e. a∗

k =maxa Qk(a), and a random action with a probability ε regard-

less of the action-value estimates [12].

In our framework an actions corresponds to selecting the

subbands to be sensed and selecting the sensing constellations.

I.e. the proposed policy uses the ε-greedy method to the

subbands to be sensed and the SUs to sense them. However, in

order reduce the control signaling between the FC and SUs the

policy takes pseudorandomized actions according to section III

with a probability ε instead of taking purely random actions

(random subband selection and random sensing assignment).

B. The Proposed Sensing Policy

The strength of the ε-greedy action selection is that it is able

to learn from the environment by observing the results of its

actions in an adaptive manner. When the secondary network

has been formed there may not be any initial knowledge about

the environment (the subbands and their SNRs). Hence, we

propose that for a short learning period in the beginning the

sensing is done according to the deterministic pseudorandom

319

Page 5: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

sensing policy. After this initial learning period, the ε-greedy

action selection is started. In the exploration state of the policy

instead of randomly selecting the sensed subbands and the

corresponding SUs to sense them, the pseudorandom sensing

policy is called. The proposed sensing policy proceeds then as

follows.

• Start with the deterministic pseudorandom sensing policy

with diversity order D.

• After the initial learning start ε-greedy action selection

and select the action after each action selection period:

– With probability ε continue the pseudorandom sens-

ing policy.

– With probability 1 − ε select NS/D subbands with

the highest values (the first stage of the policy).

∗ Choose the SU-subband sensing assignment with

the highest sum value (the second stage of the

sensing policy) with diversity order D.

– During the action selection period keep the sensing

assignment constant.

After each sensing instant the FC updates the subband and

SU values according to equation (3). Note that in this paper

the sensing constellations are chosen in a suboptimal manner

by iteratively assigning from the available SUs (unassigned)

the one with the highest value to sense a particular subband.

This approach was chosen in order to reduce the computational

complexity of finding the sensing constellations.

V. SIMULATION EXAMPLE

Next, we show how the proposed sensing policy increases

the throughput of the secondary network and reduces the

number of missed detections of the primary signal. In the

simulations we set NS = 6 and NP = 10 and assume that

three of the ten subbands support 10 times higher throughputs

(e.g. 10 times larger bandwidth) for the SUs when the subband

is free and sensed to be free. We model the primary behavior

at each subband as a Markovian chain according to Fig. 4

with P11 = P00 = 0.9 and assume that the processes at

different subbands are independent. The codes are updated

and transmitted by the FC to the SUs (action selection period)

after every third sensing instance. To clarify the notation in

this section we denote the αs in the first and second stage of

the sensing policy as α1 and α2 respectively.

For the simulations we employ an energy detection scheme

[13] [14], where the FC sums the locally estimated received

signal energies from different SUs and then makes the global

decision about the availability of the subband. Also, any other

detector such as the cyclostationarity based detector of [3]

could be applied. The number of samples used for sensing

by each SU is 50. The SUs’ mean SNRs w.r.t. different PUs

(subbands) is assumed to be normally distributed with 12 dB

standard deviation and mean −3 dB (see Fig. 5). This scenario

approximates a situation where the SUs are shadowed by

different large objects from different primary signals causing

them to have different mean SNRs. This long term fading

effect provides the environment from where the SU network

then learns which SUs should be sensing which subband. At

each sensing instant the SUs are experiencing i.i.d. block

Rayleigh fading, which will cause randomness to how the

spectrum is observed by different SUs. This effect is mitigated

by setting D = 2.

Figure 4. The Markovian model used for simulating the availability of eachsubband. State 0 means that the subband is free and state 1 that the subbandis occupied by a primary user. The state transition probabilities used in thesimulations are P00 = P11 = 0.9 and P01 = P10 = 0.1.

−40 −30 −20 −10 0 10 20 300

1

2

3

4

5

6

SNR (dB)

Nu

mb

er

of

SU

s

Figure 5. Histogram of the SUs’ mean SNRs over all subbands.

Fig. 6 shows the effect of the choice of ε on the convergence

speed of the obtained throughput relative to the ideal sensing

policy. The ideal sensing policy would be the one that senses

at each sensing instant maximum of 3 free subbands that

would provide the highest throughputs. It can be seen how

the choice of ε (eagerness of exploration vs. exploitation)

effects the convergence speed and the steady state value

of the throughput. For example, using the ε-greedy action

selection with ε = 0.1 increases the converged throughput

approximately 2.5-fold compared to the case without learning

(ε = 1).

In the second stage of the sensing policy we apply the strat-

egy where the reward is a function of the local signal power

(according to eq. (4)). Fig. 7 shows the effect of different ε on

the resulting convergence speed of the probability of missed

detection (PM ). For ε = 0.1 the steady state probability of

missed detection goes down from 0.18 to 0.05 compared to

the case with no learning (ε = 1).

VI. DISCUSSION

In the proposed policy the actions are selected after a fixed

number of sensing instances. During this time the sensing

assignment stays fixed. The length of this action selection

320

Page 6: [IEEE 2010 2nd International Workshop on Cognitive Information Processing (CIP) - Elba Island, Italy (2010.06.14-2010.06.16)] 2010 2nd International Workshop on Cognitive Information

0 2000 4000 6000 8000 100000

10

20

30

40

50

60

70

80

90

Sensing instance

% ideal polic

y

ε = 1

ε = 0.5

ε = 0.3

ε = 0.1

Figure 6. Obtained throughputs for the ε-greedy action selection and thedeterministic pseudorandom sensing policy (ε = 1). The step size parametersare α1 = α2 = 0.01.

0 2000 4000 6000 8000 100000

0.05

0.1

0.15

0.2

0.25

Sensing instance

PM

ε = 1

ε = 0.5

ε = 0.3

ε = 0.1

Figure 7. Convergence speed of the probability on miss detection (Pm) forthe ε-greedy action selection and for the deterministic pseudorandom sensingpolicy (ε = 1). The step size parameters are α1 = α2 = 0.01.

period affects to the systems ability to react to quick changes

in the environment, and also to the amount of data transmitted

in the secondary networks control channel. When the spectrum

availability and fading statistics are changing slowly enough,

the FC needs to transmit the codes to the SUs only when

the sensing assignment has changed from the previous action

selection stage. Alternatively, in an extremely rapidly changing

environment, an attempt to learn about the statistics may be

seen pointless and instead the fixed pseudorandom sensing

policy should be applied.

So far there is no learning involved in selecting the diversity

D for a particular subband to be as small as possible. Selecting

the smallest sufficient D for all subbands would ensure that

the sensing resources are not wasted, for example by assigning

a SU to sense a subband that can be already sensed well by

other SUs. However, the incorporation of this kind of selection

of D into the proposed sensing policy is out of the scope of

this paper.

VII. CONCLUSIONS

In this paper a reinforcement learning based cooperative

multiband sensing policy has been proposed. The proposed

sensing policy employs, in addition to the deterministic pseu-

dorandom sensing policy, the ε-greedy action selection to

dynamically select the most promising bands to be sensed and

arranges the best sensing constellations for those subbands.

The simulation results show that the proposed policy is able

to provide significant gains to the obtained throughputs by the

secondary network and to reduce the interference caused to the

primary systems by missed detections. To combat shadowing

and fast fading the policy guarantees a desired diversity, i.e.,

the number D of secondary users simultaneously sensing a

particular subband.

REFERENCES

[1] D. Cabric, S. M. Mishra, and R. W. Brodersen, “Implementation issuesin spectrum sensing for cognitive radios,” in Conference Record of the

Thirty-Eighth Asilomar Conference on Signals, Systems and Computers,Nov. 2004, vol. 1, pp. 772–776.

[2] J. Mitola III, Cognitive Radio: An Integrated Agent Architecture for

Software Defined Radio, Ph.D. thesis, KTH, Stockholm, Sweden, 2000.[3] J. Lundén, V. Koivunen, A. Huttunen, and H. V. Poor, “Collaborative

cyclostationary spectrum sensing for cognitive radio systems,” IEEE

Transactions on Signal Processing, vol. 57, no. 11, pp. 4182–4195, Nov.2009.

[4] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-channel opportunistic access: structure, optimality and performance,”IEEE Transactions on Wireless Communications, pp. 5431–5440, Dec.2008.

[5] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized cognitiveMAC for opportunistic spectrum access in ad hoc networks: A POMDPframework,” IEEE Journal on Selected Areas in Communications, vol.25, no. 3, pp. 589–600, Apr. 2007.

[6] S. Filippi, O. Cappé, F. Clérot, and E. Moulines, “A near optimalpolicy for channel allocation in cognitive radio,” in Recent Advances

in Reinforcement Learning: 8th European Workshop (EWRL 2008), July2008, pp. 69–81.

[7] U. Berthold, F. Fu, M. van der Schaar, and F. Jondral, “Detection ofSpectral Resources in Cognitive Radios Using Reinforcement Learning,”in IEEE DySPAN, Oct. 2008, pp. 1–5.

[8] J. Oksanen, V. Koivunen, J. Lundén, and A. Huttunen, “Diversity-basedspectrum sensing policy for detecting primary signals over multiplefrequency bands,” to appear in the IEEE ICASSP Conference, Dallas

Texas, Mar. 2010.[9] A. A. Shaar and P. A. Davies, “Prime Sequences: Quasi-Optimal Se-

quences For OR Channel Code Division Multiplexing,” IEE Electronics

Letters, vol. 19, no. 21, pp. 888–890, Oct. 1983.[10] C. J. Colbourn and J. H. Dinitz, The CRC handbook of combinatorial

designs, CRC Press, 1996, 753 p.[11] “Cliquer,” http://users.tkk.fi/pat/cliquer.html.[12] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction,

Cambridge, MA: MIT Press, 1998.[13] H. Urkowitz, “Energy detection of unknown deterministic signals,”

Proceedings of the IEEE, vol. 55, no. 4, pp. 523–531, Apr. 1967.[14] F. F. Digham, M. S. Alouini, and M. K. Simon, “On the energy detection

of unknown signals over fading channels,” in IEEE Conference on

Communications, May 2003, vol. 5, pp. 3575–3579.

321