Upload
visa
View
215
Download
3
Embed Size (px)
Citation preview
Reinforcement Learning-Based Multiband Sensing
Policy for Cognitive Radios
Jan Oksanen#1, Jarmo Lundén#2, and Visa Koivunen#3
#Department of Signal Processing and Acoustics, SMARAD CoE
Aalto University School of Science and Technology, [email protected]
Abstract—Cognitive radios (CR) and dynamic spectrum access(DSA) have been proposed as a way to exploit the underutilizedradio spectrum by allowing secondary users to access the licensedfrequencies in an opportunistic manner. The constraint set tothe secondary use is that it should not interfere the primaryusers, i.e., the license holder. Hence, the secondary users needto sense the spectrum in order to classify a licensed frequencyband as vacant or occupied. However, spectrum sensing can bea demanding task for a single user due to the random natureof the wireless channel, and to mitigate the effects of channelfading cooperative detection algorithms have been proposed. Inthis paper a multiband spectrum sensing policy for coordinatingthe cooperative sensing is proposed. It is based on dynamicallyallocating frequency hopping codes to the secondary users. Theproposed policy employs the ε-greedy reinforcement learningaction selection to prioritize the sensing of different subbandsand to select the best secondary users to sense them. The resultsshow the proposed policy is able to significantly increase theobtained throughput in the secondary network and to reducethe number of missed detections of the primary signal.
Index Terms—Cognitive radio, Spectrum sensing, Reinforce-ment learning, Spatial diversity, Frequency hopping.
I. INTRODUCTION
The rapidly growing markets for wireless devices have
made the useful radio spectrum a scarce resource. Spectrum
measurement campaigns, however have shown that although
most of the radio spectrum has been licensed for different
operators, large parts of the spectrum are underutilized [1].
This finding is the key motivation behind cognitive radio [2]
and DSA (dynamic spectrum access), where the goal is to
find unused radio spectrum in time and in space, and to allow
secondary users (SU) to access such licensed frequencies in a
way that the primary user will not be interfered. Consequently,
spectrum sensing is needed by the cognitive radio system to
identify such spectrum opportunities and to characterize the
possible interference levels to the primary system.
Wireless signals go through random channels causing the
instantaneous signal-to-noise ratio (SNR) to fluctuate in time.
Occasionally the SNR may be reduced by tens of dBs. Cooper-
ative detection schemes have been proposed in the literature to
combat against fading by combining the sensing results from
multiple SUs from various locations (see, e.g., [3]).
A cognitive secondary network consists of NS wireless ter-
minals sensing the licensed spectrum of interest. The spectrum
of interest may be noncontiguous as the licensed frequency
bands may be scattered in the radio spectrum depending on the
location and the allocation of the bands to different operators.
This calls for organizing and dividing the sensing task among
the SUs, since a single user may not be able to sense the whole
band at once. Spectrum sensing and allocation policies have
been proposed in the literature [4], [5], [6], [7] for selecting
the subband to be sensed at the next sensing instant. However,
these works do not consider collaborative spectrum sensing
and distribution of work among SUs with different sensing
capabilities. In [8] a diversity based multiband spectrum sens-
ing policy has been proposed, where the design of the sens-
ing policy has been converted into designing and allocating
pseudorandom frequency hopping codes to the SUs guiding
them which subbands are sensed and when. The subbands are
sensed by constellations of D SUs consequently exploiting
spatial diversity. In this paper we further develop this sensing
policy and show how reinforcement learning methods can be
embedded in it to actively select the most promising subbands
to be sensed and the SUs to sense them. This method maintains
the desired diversity, i.e., the number D of SUs sensing the
same subband simultaneously. The proposed sensing policy
provides significant gains to the secondary network throughput
and reduces the interference to the primary system caused by
missed detections.
This paper is organized as follows. The system model of
cooperative sensing over multiple subbands is described in
section II. In section III we present a pseudorandom frequency
hopping sensing policy for organizing the multiband sensing
task in a secondary network. In section IV we propose a
reinforcement learning based sensing policy. In sections V and
VI we show the simulation results for the proposed sensing
policy and discuss about them. The paper is concluded in
section VII.
II. WIDE–BAND COOPERATIVE SPECTRUM SENSING
The bandwidth of the spectrum of interest for the cognitive
radio system may be wide and in some cases even noncontigu-
ous. Sensing the entire spectrum of interest simultaneously is
demanding for the hardware and may be energy inefficient.
Energy efficiency and low cost are especially important in
mobile applications where the SUs are typically battery-
2010 2nd International Workshop on Cognitive Information Processing
978-1-4244-6459-3/10/$26.00 ©2010 IEEE 316
powered devices. To facilitate easier receiver front-end design
and energy efficient operation it may be more practical to
divide a large, possibly noncontiguous frequency band into
subbands that are sensed separately. Hence, we assume that
the spectrum of interest consists of NP subbands with possibly
different bandwidths Bi, i = 1, . . . , NP .
Assuming that the spectrum of interest consists of a set of
subbands, a SU in a simple case has to do the sensing one
subband at a time. However, if there are multiple SUs in the
network that cooperate, the secondary network is able to sense
multiple subbands simultaneously. Alternatively, multiple SUs
might sense the same subbands at the same time and combine
their results, thus providing spatial diversity. By combining
their sensing results, the probability of detection at a given
SNR is increased, or for equal performance, simpler detector
structures could be employed.
We assume a slotted design for the secondary network. That
is, the operation of the secondary network over time is divided
into sensing time slots and potential transmission time slots as
illustrated in Fig. 1. During each sensing time slot each SU
senses one of the subbands and then sends its local test statistic
or decision to the fusion center (FC) via a control channel. The
local test statistics or decisions are then combined at the FC
using some fusion rule.
Figure 1. Slotting of a secondary user’s operation. After sensing a particularsubband and sending its local test statistics to the FC, the FC grants apermission to one of the SUs to transmit.
We propose an approach where the sensing policy is con-
trolled by the FC. That is, for each time slot the FC decides
which subbands are sensed and by which SUs. We propose
a two-stage sensing policy where in the first stage the FC
decides which subbands will be sensed and in the second stage
it decides which SUs they are sensed by.
III. PSEUDORANDOM FREQUENCY HOPPING BASED
SENSING POLICY
In [8] a diversity based multiband spectrum sensing policy
has been proposed, where the design of the sensing policy has
been converted into designing and allocating pseudorandom
frequency hopping codes to the SUs guiding them which
subbands are sensed and when. The assumption is that the
frequency hopping codes can be stored into the SU memory
off-line. Frequency hopping code allocation for the SUs is
made by the FC such that after each hopping code different
D-tuples of the NS SUs will be employed to scan the
spectrum of interest together. To speed up the scanning of the
whole spectrum of interest, different D-tuples sense different
subbands simultaneously. Figure 2 shows an example design
of the hopping codes for NS = 4, NP = 3 and D = 2. The
benefit of this kind of sensing policy is that it reduces the
control signaling from the FC to the SUs about the bands to
be sensed to only signaling of the hopping sequence indices
to the SUs.
Figure 2. Pseudorandom frequency hopping codes for NS = 4, NP = 3
and D = 2. At each sensing instance the SU senses the subband pointed bythe current entry of its hopping code. Hopping code entries are pointing tothe physical frequencies that are maintained in table F .
In frequency hopping based sensing each SU hops according
to its hopping sequence to sense one of the subbands of
interest. The subband to be sensed at time index i is given
by f(i) = F [Sq(i)], where Sq(i) is the qth frequency hopping
sequence, F is a table containing the mappings to the physical
subbands. The assigning of the sequences Sq to different
SUs is discussed later in sections III-A1 and III-A2. Table
F may include links to the subbands’ center frequencies and
bandwidths. As F is primary system dependent mapping to the
physical frequencies, it may be assumed to be the same for all
SUs exploiting the same primary network. However, frequency
reuse in cellular primary systems would make the F to be also
location dependent, but for simplicity we assume here that
licensed frequencies are not reused. Hence, the examination
of hits among SUs sensing may be limited to the examination
of hits among their frequency hopping sequences Sq(i).
A. Frequency Hopping Sequences
Next, we show a design example of the subband sensing
patterns providing the desired diversity D, i.e., the number
of SUs simultaneously sensing any subband. The patterns
are orthogonal hopping sequences that may be generated by
applying cyclic shifts to any full sequence of integers. By
orthogonality we mean that the codes can be phased so that
all codes in the code family will be pairwise non-overlapping.
The desired diversity and consequently the number of hits
among the SUs may be obtained by assigning the users in
each D-tuple to have the same hopping code for one full
code period. Instead of using cyclic shift sequences other
code designs known in the frequency hopping spread spectrum
(FHSS) literature could be applied.
The simplest way to generate a orthogonal code family is to
cyclically shift any full sequence of integer numbers. Cyclic
shifts may be generated by the modulo operation as
Sq(i) ≡ (i + ∆q) mod NP , (1)
where i ∈ [0, NP − 1], q ∈ [0, bNS
Dc − 1] and ∆q is the shift
parameter.
When ∆q 6= ∆w,∀q 6= w the maximum periodic Hamming
[9] correlation between any two codes in the family is NP
and the minimum 0. The shifts between the cooperating D-
tuple can be adjusted so that minimum periodic Hamming
correlation will be obtained.
To obtain a high scanning speed by the secondary system
over the whole spectrum of interest (all subbands) we choose
317
Table ICYCLIC SHIFT HOPPING CODE FAMILY FOR NS = 12, NP = 10 AND
D = 2
S0(i) 0 1 2 3 4 5 6 7 8 9
S1(i) 2 3 4 5 6 7 8 9 0 1
S2(i) 4 5 6 7 8 9 0 1 2 3
S3(i) 6 7 8 9 0 1 2 3 4 5
S4(i) 8 9 0 1 2 3 4 5 6 7
S5(i) 1 2 3 4 5 6 7 8 9 0
the shift parameter ∆q to be
∆q =
qA mod NP − 1 , if A|Np and A 6= 1
qA mod NP , otherwise
, (2)
where A =⌈
NP DNS
⌉
and where A|Np means that NP is
divisible by A. Table I shows an example hopping code family
of (2) for NS = 12, NP = 10 and D = 2.
1) Example of Frequency Hopping Sequence Assignment
for D=2: Next, an example of the frequency hopping code
allocation for the pseudorandom scheme for diversity D = 2is given, which corresponds to an all-play-all round-robin
tournament design [10] in case D = 2.
Table II illustrates a round-robin hopping sequence table for
NS = 6.
Table IIROUND-ROBIN FREQUENCY HOPPING CODE INDEX TABLE FOR 6 SUS.THE CODE ASSIGNMENT CORRESPONDS TO THE HOPPING SEQUENCES
SHOWN IN TABLE 2. EACH ENTRY IN THE TABLE CORRESPONDS TO ONE
HOPPING CODE IN THE SELECTED CODE FAMILY, I.E., ONE FULL SCAN OF
THE SPECTRUM.
time →SU0: S0 S0 S0 S0 S0
SU1: S0 S1 S2 S2 S1
SU2: S1 S0 S1 S2 S2
SU3: S2 S1 S0 S1 S2
SU4: S2 S2 S1 S0 S1
SU5: S1 S2 S2 S1 S0
2) Method for Generating All D-tuples: Next we introduce
a computer search based method for finding the frequency
hopping code assignment that employs all possible sensing
constellations of size D (combinations of D SUs out of NS)
in the smallest number of code periods for NS = KD, where
K is a positive integer. The solution is based on finding cliques
from an undirected simple graph G = (V,E), where V is the
set of all possible D-tuples of NS SUs and where two vertices
are adjacent if the intersection of the corresponding D-tuples
is empty. From graph G we then search all cliques of size
K = NS/D. From the found cliques we build a new graph
G2 = (V2, E2) where V2 is the set of the cliques found from
G. In G2 two vertices are adjacent if the two cliques don’t
contain common vertices in G. Finally, the code assignment
may be found by searching one clique of size W in G2, where
W =(
NS
D
)
DNS
.
Fig. 3 shows graph G for the case NS = 6 and D = 2,
where the SUs are denoted with {SU0,SU1,...,SU5}. Since
there are 15 different 2-tuples in a set of 6 SUs, the number
of vertices is |G| = 15. The non-overlapping cliques of size 3were searched using the Cliquer software [11]. Finding cliques
of a given size is NP-complete, i.e. the algorithm is not linear
time as NS grows. However, finding cliques can be done off-
line and stored into the SUs and the FC in advance.
Figure 3. Five SU cliques of size K = NS/D = 3 in graph G for NS = 6
and D = 2. The five cliques of different colors don’t contain common verticesin G, which results into the round-robin tournament shown in Table II so thateach clique corresponds to one column of the table.
IV. A TWO-STAGE SENSING POLICY
The goal of the proposed sensing policy is twofold. First, it
is desirable to maximize the throughput of the secondary net-
work. And second, it is important to minimize the probability
of missed detection to avoid collisions with PU transmissions.
Consequently, the secondary network should sense more
frequently those subbands that are persistently vacant and
where the average throughput is high. Furthermore, the SUs
sensing each subband should be the ones that have a high
quality channel to the corresponding primary user and thus
have a high probability of detection.
In order to accomplish the above two goals, we propose a
two-stage sensing policy:
1. Select the subbands to be sensed in the next slot.
2. Select the SUs sensing the subbands selected in stage 1.
To achieve the first goal of maximizing the throughput, in
the first stage we select the subbands that persistently provide
high average throughput for the secondary system. To meet
the second goal about detection performance, in the second
stage we assign those SUs to sense the selected subbands that
appear have the best channel to the primary signal.
We propose a reinforcement learning-based action selection
in both of the two stages of the sensing policy. Using a
318
reinforcement learning approach the goal is to estimate the
value of each subband for the secondary network and the value
of each SU for each particular subband. Then the policy selects
the subbands and SUs that have higher value more frequently.
In the first stage of the sensing policy the value of a subband
may be the estimated throughput obtainable by the secondary
network. In the second stage of the sensing policy the value
of the SU for a particular subband is a function describing the
SUs ability to detect on that subband. Both these values are
estimated from the past rewards as [12]
Qk+1(a) = Qk(a) + αk[rk+1 − Qk(a)], (3)
where Q(a) denotes the value of the action a (i.e. selecting a
particular subband to be sensed or assigning a particular SU
to sense a particular subband), rk+1 is the reward at time step
k + 1 and αk ∈ [0, 1] is a step size parameter.
A natural reward for achieving the first goal of maximiz-
ing the throughput of the secondary system is the achieved
immediate throughput in the sensed subband, hence, resulting
in a sensing policy that favors subbands that are vacated and
have high throughput for secondary transmissions. However,
in certain cases measuring and reporting the throughput to the
FC may not be practical or the subbands sensed as vacant
may not be accessed thus resulting in a zero throughput. An
alternative reward structure could depend only on the sensing
results, e.g., r = 1, if the subband is sensed vacant, and r = 0,
otherwise. Such a scheme would require less control signaling
and would not be affected by the decisions of not accessing a
subband sensed as vacant. However, the second problem could
be also circumvented by not updating the value if the subband
is not accessed although it has been sensed as vacant.
To achieve the second goal of minimizing the probability
of missed detection, the SUs sensing each subband should be
the ones that have high probability of detection. This can be
achieved by favoring SUs with high values according to (3)
where the reward r is a function of the local test statistic i.e.
the local likelihood ratio (LR) as
rk+1 =
{
λa, dFC = 1,
Qk(a), dFC = 0,(4)
where λa is the local LR of the corresponding SU and dFC
is the global decision at the FC. Alternatively the reward may
be a function of the decisions for example as follows
rk+1 =
{
1 − (1 − di), dFC = 1,
Qk(a), dFC = 0,(5)
where di is the local decision of the ith SU. Decision d = 1denotes a decision that the subband is occupied by a primary
system. This strategy ensures that there is no update in the
value Q(a) when the subband is detected as vacant and a
positive or a negative update takes place when the subband is
detected as occupied depending on whether the local decision
was the same as the global decision at the FC. Consequently,
the proposed strategy favors the SUs that correctly sense the
presence of the primary systems.
To ensure convergence in a stationary problem the step size
parameter αk in (3) should satisfy the following conditions
[12]∞∑
k=1
αk = ∞ and
∞∑
k=1
α2k < ∞. (6)
The first condition in (6) guarantees that the step size is
large enough to overcome the initial state, while the second
condition guarantees that the step size is small enough to
assure convergence. Note that a step size αk = 1/(k + 1)fulfills the conditions of (6) and results in the standard sample-
average of the past rewards. On the other hand, for constant
αk = α the estimates will continue varying in response to
the latest observed rewards instead of converging to steady
state. For tracking a rapidly changing spectrum this is actually
desirable, since we want the algorithm to react to the changes
as quickly as possible. However there is a trade-off between
adaptability and variance of the value estimates when choosing
αk. A constant αk = α results in a weighted average of past
rewards and the initial value Q0, i.e. [12]
Qk+1(a) = (1 − α)k+1Q0(a) +
k+1∑
i=1
α(1 − α)k+1−iri. (7)
A constant step size α is more suitable to nonstationary
problems encountered in cognitive radio applications. We can
notice in (7) that when α is close to 0, the algorithm will
give emphasis on the past rewards, whereas when α is larger
more emphasis is given on the latest rewards. This would
suggest that for heavily nonstationary processes large values
of α would be more suitable, whereas for stationary processes
small α would give better results.
A. The ε-greedy Action Selection
The ε-greedy action selection is a simple, yet effective
method that balances between exploitation and exploration.
With a probability 1 − ε the ε-greedy method selects the
action that has the highest estimated action value, i.e. a∗
k =maxa Qk(a), and a random action with a probability ε regard-
less of the action-value estimates [12].
In our framework an actions corresponds to selecting the
subbands to be sensed and selecting the sensing constellations.
I.e. the proposed policy uses the ε-greedy method to the
subbands to be sensed and the SUs to sense them. However, in
order reduce the control signaling between the FC and SUs the
policy takes pseudorandomized actions according to section III
with a probability ε instead of taking purely random actions
(random subband selection and random sensing assignment).
B. The Proposed Sensing Policy
The strength of the ε-greedy action selection is that it is able
to learn from the environment by observing the results of its
actions in an adaptive manner. When the secondary network
has been formed there may not be any initial knowledge about
the environment (the subbands and their SNRs). Hence, we
propose that for a short learning period in the beginning the
sensing is done according to the deterministic pseudorandom
319
sensing policy. After this initial learning period, the ε-greedy
action selection is started. In the exploration state of the policy
instead of randomly selecting the sensed subbands and the
corresponding SUs to sense them, the pseudorandom sensing
policy is called. The proposed sensing policy proceeds then as
follows.
• Start with the deterministic pseudorandom sensing policy
with diversity order D.
• After the initial learning start ε-greedy action selection
and select the action after each action selection period:
– With probability ε continue the pseudorandom sens-
ing policy.
– With probability 1 − ε select NS/D subbands with
the highest values (the first stage of the policy).
∗ Choose the SU-subband sensing assignment with
the highest sum value (the second stage of the
sensing policy) with diversity order D.
– During the action selection period keep the sensing
assignment constant.
After each sensing instant the FC updates the subband and
SU values according to equation (3). Note that in this paper
the sensing constellations are chosen in a suboptimal manner
by iteratively assigning from the available SUs (unassigned)
the one with the highest value to sense a particular subband.
This approach was chosen in order to reduce the computational
complexity of finding the sensing constellations.
V. SIMULATION EXAMPLE
Next, we show how the proposed sensing policy increases
the throughput of the secondary network and reduces the
number of missed detections of the primary signal. In the
simulations we set NS = 6 and NP = 10 and assume that
three of the ten subbands support 10 times higher throughputs
(e.g. 10 times larger bandwidth) for the SUs when the subband
is free and sensed to be free. We model the primary behavior
at each subband as a Markovian chain according to Fig. 4
with P11 = P00 = 0.9 and assume that the processes at
different subbands are independent. The codes are updated
and transmitted by the FC to the SUs (action selection period)
after every third sensing instance. To clarify the notation in
this section we denote the αs in the first and second stage of
the sensing policy as α1 and α2 respectively.
For the simulations we employ an energy detection scheme
[13] [14], where the FC sums the locally estimated received
signal energies from different SUs and then makes the global
decision about the availability of the subband. Also, any other
detector such as the cyclostationarity based detector of [3]
could be applied. The number of samples used for sensing
by each SU is 50. The SUs’ mean SNRs w.r.t. different PUs
(subbands) is assumed to be normally distributed with 12 dB
standard deviation and mean −3 dB (see Fig. 5). This scenario
approximates a situation where the SUs are shadowed by
different large objects from different primary signals causing
them to have different mean SNRs. This long term fading
effect provides the environment from where the SU network
then learns which SUs should be sensing which subband. At
each sensing instant the SUs are experiencing i.i.d. block
Rayleigh fading, which will cause randomness to how the
spectrum is observed by different SUs. This effect is mitigated
by setting D = 2.
Figure 4. The Markovian model used for simulating the availability of eachsubband. State 0 means that the subband is free and state 1 that the subbandis occupied by a primary user. The state transition probabilities used in thesimulations are P00 = P11 = 0.9 and P01 = P10 = 0.1.
−40 −30 −20 −10 0 10 20 300
1
2
3
4
5
6
SNR (dB)
Nu
mb
er
of
SU
s
Figure 5. Histogram of the SUs’ mean SNRs over all subbands.
Fig. 6 shows the effect of the choice of ε on the convergence
speed of the obtained throughput relative to the ideal sensing
policy. The ideal sensing policy would be the one that senses
at each sensing instant maximum of 3 free subbands that
would provide the highest throughputs. It can be seen how
the choice of ε (eagerness of exploration vs. exploitation)
effects the convergence speed and the steady state value
of the throughput. For example, using the ε-greedy action
selection with ε = 0.1 increases the converged throughput
approximately 2.5-fold compared to the case without learning
(ε = 1).
In the second stage of the sensing policy we apply the strat-
egy where the reward is a function of the local signal power
(according to eq. (4)). Fig. 7 shows the effect of different ε on
the resulting convergence speed of the probability of missed
detection (PM ). For ε = 0.1 the steady state probability of
missed detection goes down from 0.18 to 0.05 compared to
the case with no learning (ε = 1).
VI. DISCUSSION
In the proposed policy the actions are selected after a fixed
number of sensing instances. During this time the sensing
assignment stays fixed. The length of this action selection
320
0 2000 4000 6000 8000 100000
10
20
30
40
50
60
70
80
90
Sensing instance
% ideal polic
y
ε = 1
ε = 0.5
ε = 0.3
ε = 0.1
Figure 6. Obtained throughputs for the ε-greedy action selection and thedeterministic pseudorandom sensing policy (ε = 1). The step size parametersare α1 = α2 = 0.01.
0 2000 4000 6000 8000 100000
0.05
0.1
0.15
0.2
0.25
Sensing instance
PM
ε = 1
ε = 0.5
ε = 0.3
ε = 0.1
Figure 7. Convergence speed of the probability on miss detection (Pm) forthe ε-greedy action selection and for the deterministic pseudorandom sensingpolicy (ε = 1). The step size parameters are α1 = α2 = 0.01.
period affects to the systems ability to react to quick changes
in the environment, and also to the amount of data transmitted
in the secondary networks control channel. When the spectrum
availability and fading statistics are changing slowly enough,
the FC needs to transmit the codes to the SUs only when
the sensing assignment has changed from the previous action
selection stage. Alternatively, in an extremely rapidly changing
environment, an attempt to learn about the statistics may be
seen pointless and instead the fixed pseudorandom sensing
policy should be applied.
So far there is no learning involved in selecting the diversity
D for a particular subband to be as small as possible. Selecting
the smallest sufficient D for all subbands would ensure that
the sensing resources are not wasted, for example by assigning
a SU to sense a subband that can be already sensed well by
other SUs. However, the incorporation of this kind of selection
of D into the proposed sensing policy is out of the scope of
this paper.
VII. CONCLUSIONS
In this paper a reinforcement learning based cooperative
multiband sensing policy has been proposed. The proposed
sensing policy employs, in addition to the deterministic pseu-
dorandom sensing policy, the ε-greedy action selection to
dynamically select the most promising bands to be sensed and
arranges the best sensing constellations for those subbands.
The simulation results show that the proposed policy is able
to provide significant gains to the obtained throughputs by the
secondary network and to reduce the interference caused to the
primary systems by missed detections. To combat shadowing
and fast fading the policy guarantees a desired diversity, i.e.,
the number D of secondary users simultaneously sensing a
particular subband.
REFERENCES
[1] D. Cabric, S. M. Mishra, and R. W. Brodersen, “Implementation issuesin spectrum sensing for cognitive radios,” in Conference Record of the
Thirty-Eighth Asilomar Conference on Signals, Systems and Computers,Nov. 2004, vol. 1, pp. 772–776.
[2] J. Mitola III, Cognitive Radio: An Integrated Agent Architecture for
Software Defined Radio, Ph.D. thesis, KTH, Stockholm, Sweden, 2000.[3] J. Lundén, V. Koivunen, A. Huttunen, and H. V. Poor, “Collaborative
cyclostationary spectrum sensing for cognitive radio systems,” IEEE
Transactions on Signal Processing, vol. 57, no. 11, pp. 4182–4195, Nov.2009.
[4] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-channel opportunistic access: structure, optimality and performance,”IEEE Transactions on Wireless Communications, pp. 5431–5440, Dec.2008.
[5] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized cognitiveMAC for opportunistic spectrum access in ad hoc networks: A POMDPframework,” IEEE Journal on Selected Areas in Communications, vol.25, no. 3, pp. 589–600, Apr. 2007.
[6] S. Filippi, O. Cappé, F. Clérot, and E. Moulines, “A near optimalpolicy for channel allocation in cognitive radio,” in Recent Advances
in Reinforcement Learning: 8th European Workshop (EWRL 2008), July2008, pp. 69–81.
[7] U. Berthold, F. Fu, M. van der Schaar, and F. Jondral, “Detection ofSpectral Resources in Cognitive Radios Using Reinforcement Learning,”in IEEE DySPAN, Oct. 2008, pp. 1–5.
[8] J. Oksanen, V. Koivunen, J. Lundén, and A. Huttunen, “Diversity-basedspectrum sensing policy for detecting primary signals over multiplefrequency bands,” to appear in the IEEE ICASSP Conference, Dallas
Texas, Mar. 2010.[9] A. A. Shaar and P. A. Davies, “Prime Sequences: Quasi-Optimal Se-
quences For OR Channel Code Division Multiplexing,” IEE Electronics
Letters, vol. 19, no. 21, pp. 888–890, Oct. 1983.[10] C. J. Colbourn and J. H. Dinitz, The CRC handbook of combinatorial
designs, CRC Press, 1996, 753 p.[11] “Cliquer,” http://users.tkk.fi/pat/cliquer.html.[12] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction,
Cambridge, MA: MIT Press, 1998.[13] H. Urkowitz, “Energy detection of unknown deterministic signals,”
Proceedings of the IEEE, vol. 55, no. 4, pp. 523–531, Apr. 1967.[14] F. F. Digham, M. S. Alouini, and M. K. Simon, “On the energy detection
of unknown signals over fading channels,” in IEEE Conference on
Communications, May 2003, vol. 5, pp. 3575–3579.
321