Channel Hopping as a Bandit Problem

June 25, 2023

Every production WiFi monitoring tool hops channels the same way it did in 2006: cycle through a fixed list, dwell for 250ms, repeat. Can we do better by treating channel selection as a learning problem?

# What Production Tools Do

airodump-ng hops through an interleaved sequence {1, 7, 13, 2, 8, 3, 14, 9, 4, 10, 5, 11, 6, 12} with a fixed 250ms dwell. The interleaving maximizes frequency distance between consecutive hops -- channel 1 at 2412 MHz to channel 7 at 2442 MHz is a 30 MHz gap, well past the 22 MHz channel width. No adaptation.

Kismet hops at 5 Hz (200ms dwell), reasoning from the Nyquist theorem: APs beacon at ~10/sec, so 5 hops/sec captures at least one beacon per AP. It supports manual per-channel weights (channellist=1:3,6:3,11:3,2,3,4,5,7,8,9,10 to spend 3x time on the three non-overlapping 2.4 GHz channels) and splits channel lists across multiple interfaces automatically.

WiFi Explorer Pro is the only commercial tool with adaptive behavior: it halves dwell time to 60ms on channels where no 802.11 activity was detected in the last scan.

tshark/dumpcap does basic round-robin through a channel list with configurable dwell. No adaptation, no interleaving. The typical invocation cycles 1 through 11 sequentially.

None uses a per-hop learning algorithm.Pwnagotchi uses A2C reinforcement learning to select which channels to include in its hop set -- but it optimizes at the epoch level (which channel set to use), not per-hop (which channel next). The bandit formulation below addresses the per-hop decision. The gap between what practitioners have asked for (airodump-ng issue #119, 2007: "dwell longer on channels with traffic") and what tools provide has been open for nearly two decades.

# Background

# 802.11 Channel Architecture

The 2.4 GHz band has 14 channels (11 in the US), each 22 MHz wide, spaced 5 MHz apart. Only channels 1, 6, and 11 are non-overlapping -- the rest bleed into their neighbors. This overlap is not a minor technicality; it's the central complication in any channel selection scheme.

The 5 GHz band is wider and cleaner: channels are 20 MHz wide with no overlap at the base width. UNII-1 through UNII-3 give roughly 25 non-overlapping channels. DFS (Dynamic Frequency Selection) channels add another ~15 but require radar detection, which complicates monitor mode.

WiFi 6E and 7 opened the 6 GHz band: 59 new 20 MHz channels in the US (the full 1200 MHz from 5925 to 7125 MHz). This changes the problem's character entirely. With 14 channels, a static schedule visits each one every 3.5 seconds. With 59 channels, the same schedule takes 14.75 seconds -- long enough to miss transient devices, short-lived connections, and bursty traffic patterns.

# Monitor Mode and Passive Capture

A wireless NIC in monitor mode receives raw 802.11 frames without associating to any access point. The kernel exposes each frame with a radiotap header containing metadata: signal strength, data rate, channel, and flags.

iw dev wlan0 set channel 6    # tune to channel 6
sleep 0.25                    # dwell for 250ms
iw dev wlan0 set channel 1    # hop to channel 1

During each dwell, we observe some number of packets. Between dwells, the channel switch costs 15--60ms of dead time on typical hardware. USB chipsets are worse: the MT7612U averages 450ms per switch, and even single-band adapters like the RT5372 hit 20ms(1). These switching costs are real -- at 250ms dwell with 100ms switching overhead, you lose 29% of observation time to transitions.

The question: which channel should we hop to next, and for how long?

# The Multi-Armed Bandit Framework

The multi-armed bandit (MAB) is the simplest formalization of the explore-exploit tradeoff. You face $K$ slot machines (arms), each with an unknown reward distribution. At each time step you pull one arm and observe a reward. The goal is to maximize cumulative reward, or equivalently, minimize regret: the gap between what you earned and what you would have earned always pulling the best arm.

Lai and Robbins (1985) established the fundamental lower bound: for any consistent policy, the expected number of times a suboptimal arm $i$ is pulled must grow at least as $\ln T / d(\lambda_i, \lambda^*)$ , where $d$ is the KL divergence between the suboptimal and optimal arm distributions(2). The intuition: the KL divergence $d(\lambda_i, \lambda^*)$ measures how many samples you need to statistically distinguish arm $i$ from the best arm. Arms that are nearly as good as the best require many pulls before you can confidently stop pulling them, so they contribute more regret. No algorithm can do better than logarithmic regret. The algorithms below differ in how tightly they approach this bound and how they handle the practical complications of WiFi monitoring.

# Modeling as a Bandit

We have $K$ channels. At each time step $t$ , we pick a channel $a_t \in \{1, \ldots, K\}$ , dwell, and observe a reward $r_{a_t}^{(t)}$ -- the number of packets captured.

The standard MAB goal is to minimize cumulative regret: the gap between what we earned and what we would have earned always picking the best channel.

R_T = \sum_{t=1}^{T} r_{a^*}^{(t)} - r_{a_t}^{(t)}

where $a^* = \arg\max_i \mathbb{E}[r_i]$ is the channel with highest expected packet rate.

# The mapping

MAB concept	WiFi monitoring
Arm $i$	Channel $i$ (e.g., 2.4 GHz channel 6)
Pull arm $i$	Dwell on channel $i$ for one interval
Reward $r_i$	Number of packets captured during dwell
Reward distribution	$\text{Poisson}(\lambda_i)$ -- independent frame arrivals
Best arm $a^*$	Channel with highest traffic rate
Regret	Packets missed by not dwelling on the best channel
Switching cost	Dead time during channel change (15--450ms)
Non-stationarity	Devices join/leave, traffic shifts over hours

# Assumptions and Where They Break

The standard stochastic MAB assumes independent arms with stationary reward distributions. Both assumptions are violated in WiFi monitoring:

Stationarity. Traffic is bursty. Devices join and leave the network. A channel that was quiet at 2 AM will be busy at 10 AM. The Poisson rate $\lambda_i$ changes over time, which means an algorithm that converges to a single channel will eventually be wrong.

Independence. Dwelling on channel 6 picks up packets from channels 5 and 7 due to spectral overlap. The reward from one arm depends on the true state of adjacent arms. This violates the independent-arms assumption and biases naive estimators.

Both assumptions break in practice; the sections below address each in turn.

# A Complication: Cross-Channel Leakage

In practice, dwelling on channel $i$ picks up some packets from adjacent channels $j$ due to spectral overlap. If we observe counts $\{o_{ij}\}$ while dwelling on channel $i$ , the naive reward counts everything:

r_i^{(t)} = \sum_j o_{ij}^{(t)}

This leakage means channels aren't truly independent arms. We'll return to this in the mixture model section.

# Algorithms

# Uniform Hopping

Visit each channel in round-robin order with equal dwell time. This is what airodump-ng does (modulo the interleaving order).

If channel $a^*$ has rate $\lambda^*$ and the average rate across channels is $\bar{\lambda}$ , uniform hopping incurs linear regret:

R_T = T \cdot (\lambda^* - \bar{\lambda})

Regret grows linearly with time because we never stop exploring channels with low traffic. The advantage is coverage: every channel gets visited, so no device goes undetected.

# Epsilon-Greedy

The simplest adaptive policy: with probability $\varepsilon$ , pick a random channel (explore); otherwise, pick the channel with highest observed mean (exploit).

a_t = \begin{cases} \arg\max_i \hat{\lambda}_i & \text{with probability } 1-\varepsilon \\ \text{Uniform}(\{1,\ldots,K\}) & \text{with probability } \varepsilon \end{cases}

With fixed $\varepsilon$ , regret is linear: $R_T = O(\varepsilon T)$ since the exploration fraction never decays. With a decaying schedule $\varepsilon_t = \min(1, cK / (d^2 t))$ for suitable constants $c, d$, the regret becomes $O(K \ln T)$ (3).

Epsilon-greedy is easy to implement and understand. Its weakness is that exploration is undirected -- it wastes pulls on channels that are clearly bad. UCB1 and Thompson Sampling fix this by exploring where uncertainty is highest.

# UCB1

Upper Confidence Bound selects the channel that maximizes the sample mean plus an exploration bonus:

a_t = \arg\max_i \left[ \hat{\lambda}_i + \sqrt{\frac{2 \ln t}{n_i}} \right]

where $\hat{\lambda}_i$ is the average packet count on channel $i$ and $n_i$ is the number of times we've dwelled there.

The exploration term $\sqrt{2 \ln t / n_i}$ shrinks as a channel is visited more, so UCB1 naturally shifts from exploration to exploitation. The intuition: we're optimistic in the face of uncertainty. A channel we haven't visited much gets a large bonus, so we'll try it. Once we've seen enough to know it's bad, the bonus can't save it.

The regret bound comes from applying Hoeffding's inequality to the confidence width. For sub-Gaussian rewards (which includes bounded packet counts):

R_T \le \sum_{i: \lambda_i < \lambda^*} \frac{8 \ln T}{\Delta_i} + \left(1 + \frac{\pi^2}{3}\right) \sum_i \Delta_i

where $\Delta_i = \lambda^* - \lambda_i$ is the gap between channel $i$ and the best channel(3).

For the 14-channel simulation with $T = 10{,}000$ steps and gaps $\Delta_i$ ranging from 5 to 40 packets/dwell, the dominant term $\sum 8 \ln T / \Delta_i$ evaluates to roughly 150 -- total regret of 150 packets out of a potential 500,000 (the best channel producing ~50 packets/dwell for 10,000 steps). That is 0.03% loss from the optimal policy, compared to uniform hopping's ~40% loss.

Why $\ln T / \Delta_i$ ? The UCB selects arm $i$ only while $\hat{\lambda}_i + \sqrt{2 \ln t / n_i} \ge \hat{\lambda}^*$ , which (by Hoeffding) happens with probability $\exp(-n_i \Delta_i^2 / 2)$ . Setting this equal to $1/t$ and solving gives $n_i \approx 2 \ln t / \Delta_i^2$ pulls before the bound tightens enough to exclude arm $i$ . Each of those pulls costs $\Delta_i$ regret, giving total contribution $\approx 2 \ln T / \Delta_i$ . This is logarithmic in $T$ -- the key improvement over uniform.

# KL-UCB

For Poisson-distributed packet counts, KL-UCB gives tighter bounds by exploiting the distribution structure. It selects the channel maximizing $\lambda$ subject to $n_i \cdot d(\hat{\lambda}_i, \lambda) \le \ln t$ , where $d$ is the Poisson KL divergence:

d(\lambda_1, \lambda_2) = \lambda_1 \ln\frac{\lambda_1}{\lambda_2} - \lambda_1 + \lambda_2

This matches the Lai-Robbins lower bound asymptotically(4). The improvement over UCB1 is not just theoretical: when packet counts follow a Poisson distribution (a reasonable model for independent frame arrivals), KL-UCB converges to the optimal channel in fewer samples because the Poisson KL divergence is tighter than the sub-Gaussian bound that UCB1 uses.

# Thompson Sampling

Rather than constructing a confidence bound, Thompson Sampling maintains a posterior distribution over each channel's packet rate and samples from it.

Model packet counts as Poisson with unknown rate: $r_i \sim \text{Poisson}(\lambda_i)$ . The conjugate prior for a Poisson rate is the Gamma distribution:

\lambda_i \sim \text{Gamma}(\alpha_i, \beta_i)

Initialize with a weak prior: $\alpha_i = 1, \beta_i = 1$ for all channels (prior mean of 1 packet per dwell).

At each step:

Sample $\tilde{\lambda}_i \sim \text{Gamma}(\alpha_i, \beta_i)$ for each channel
Select $a_t = \arg\max_i \tilde{\lambda}_i$
Dwell on channel $a_t$ , observe packet count $c$
Update: $\alpha_{a_t} \leftarrow \alpha_{a_t} + c$ , $\beta_{a_t} \leftarrow \beta_{a_t} + 1$

The posterior mean $\alpha_i / \beta_i$ converges to the true rate $\lambda_i$ . Channels with uncertain estimates (wide posterior) are sampled from more variably, which drives exploration. Channels with well-estimated low rates rarely produce a high sample, so they're naturally avoided.

Gamma posterior evolution for Thompson Sampling across three time steps — Three channels with true rates 45, 30, 10 packets/dwell. Left: flat priors (t=0). Center: after 50 dwells, the best channel

The regret bound matches KL-UCB asymptotically(5):

R_T = O\left(\sum_{i: \lambda_i < \lambda^*} \frac{\ln T}{d(\lambda_i, \lambda^*)}\right)

Why does this match the Lai-Robbins lower bound? The posterior concentrates around the true rate at rate $1/\sqrt{n_i}$ , so the probability of sampling a rate higher than $\lambda^*$ from a suboptimal arm decays as $\exp(-n_i \cdot d(\lambda_i, \lambda^*))$ . The expected number of pulls before this probability becomes negligible is $\ln T / d(\lambda_i, \lambda^*)$ , matching the information-theoretic minimum.

In empirical evaluations, Thompson Sampling matches or beats UCB1 in cumulative reward and convergence speed(4). It's also simpler to implement: the update is two additions, and sampling from a Gamma distribution is a standard library call.

import numpy as np

class ThompsonChannelHopper:
    def __init__(self, K):
        self.alpha = np.ones(K)   # Gamma shape
        self.beta = np.ones(K)    # Gamma rate

    def select(self):
        samples = np.random.gamma(self.alpha, 1.0 / self.beta)
        return np.argmax(samples)

    def update(self, channel, packets):
        self.alpha[channel] += packets
        self.beta[channel] += 1

# The Diversity Problem

Pure regret minimization converges to a single channel -- the one with highest traffic. For passive monitoring, this is unacceptable: we need to see devices on all channels, not just the busiest one.

# Entropy Regularization

One approach is an entropy-regularized reward. Define a modified objective:

J(p) = (1 - \gamma) \sum_i p_i \lambda_i + \gamma H(p)

where $p = (p_1, \ldots, p_K)$ is the channel visit distribution (the fraction of time spent on each channel), $H(p) = -\sum_i p_i \ln p_i$ is its entropy, and $\gamma \in [0, 1]$ controls the diversity-exploitation tradeoff. At $\gamma = 0$ , the optimal policy concentrates on the best channel. At $\gamma = 1$ , the optimal policy is uniform.

This has theoretical grounding: Zimmert and Seldin (2021) showed that Tsallis entropy regularization with $\alpha = 1/2$ achieves $O(\sqrt{KT})$ regret in adversarial settings and $O(\ln T / \Delta)$ in stochastic settings simultaneously(6). In the WiFi context, this means the entropy bonus helps when traffic is non-stationary (devices joining and leaving) without sacrificing convergence when traffic is stable.

# Minimum Dwell Floor

A simpler practical alternative: impose a minimum dwell floor. Visit every channel at least once per sweep (the Kismet model), then allocate remaining time proportional to learned rates. This gives coverage guarantees without modifying the reward.

Concretely: in each round of $K$ steps, dedicate one step per channel (the "floor"), then allocate the remaining $T - K$ steps using Thompson Sampling. The coverage guarantee is exact: every channel gets at least $T/K$ visits. The regret cost of the floor is at most $K \cdot \Delta_{\max}$ per round, which is constant overhead.

# Constrained Formulation

A more principled version: minimize regret subject to a minimum visit frequency per channel. This falls under the constrained bandit framework. Define $f_i = n_i / T$ as the fraction of time spent on channel $i$ . The constraint is $f_i \ge f_{\min}$ for all $i$ .

Badanidiyuru et al. (2013) give an algorithm for bandits with knapsack constraints that achieves $O(\sqrt{T})$ regret while satisfying the constraints(7). The WiFi application is a special case where the constraint is a lower bound on each arm's pull frequency.

# Non-Stationary Extensions

Real WiFi traffic is non-stationary. The Poisson rates $\lambda_i(t)$ drift as devices join and leave, as usage patterns shift through the day, and as interference from neighboring networks changes. An algorithm that converges to a fixed channel allocation will eventually be wrong.

# Sliding-Window UCB

Garivier and Moulines (2011) proposed Sliding-Window UCB (SW-UCB): instead of using all historical observations, compute the sample mean and confidence bound using only the last $\tau$ observations(8):

a_t = \arg\max_i \left[ \hat{\lambda}_{i,\tau} + \sqrt{\frac{2 \ln \min(t, \tau)}{n_{i,\tau}}} \right]

where $\hat{\lambda}_{i,\tau}$ and $n_{i,\tau}$ are the mean and count using only the window $[t-\tau, t]$ .

The window size $\tau$ controls the bias-variance tradeoff: a short window reacts quickly to changes but has high variance; a long window is stable but slow to adapt. For WiFi monitoring with traffic that shifts on the scale of minutes, $\tau$ in the range of 100--500 steps (25--125 seconds at 250ms dwell) is reasonable.

# Discounted Thompson Sampling

An alternative: instead of a hard window, apply exponential discounting to the Gamma posterior. Replace the update rule with:

\alpha_{a_t} \leftarrow \delta \cdot \alpha_{a_t} + c, \qquad \beta_{a_t} \leftarrow \delta \cdot \beta_{a_t} + 1

where $\delta \in (0, 1)$ is the discount factor. This exponentially down-weights old observations, making the posterior responsive to recent traffic. At $\delta = 1$ , this reduces to standard Thompson Sampling. At $\delta = 0.99$ with 250ms dwell, the effective memory is roughly $1/(1-\delta) = 100$ steps or 25 seconds.

Raj and Kalyani (2017) analyzed discounted Thompson Sampling for piecewise-stationary environments, showing that the probability of selecting a suboptimal arm decays with the discount factor in a way that adapts to change points(9).

# Accounting for Cross-Channel Leakage

Consider three 2.4 GHz channels: 1, 6, and 11. While dwelling on channel 6, you might observe 40 packets from channel 6, 8 from channel 5, and 3 from channel 7. The "leakage" weights $w_{ij}$ capture this: $w_{6,6} \approx 1.0$ , $w_{6,5} \approx 0.2$ , $w_{6,7} \approx 0.07$ . The weights decay roughly with frequency distance.

In general, dwelling on channel $i$ gives us counts $o_{ij}$ from each source channel $j$ . The total reward is $r_i = \sum_j o_{ij} = \sum_j w_{ij} x_j$ , where $x_j \sim \text{Poisson}(\lambda_j)$ is the true traffic rate on channel $j$ and $w_{ij}$ is the overlap weight.

The simplest correction: if the overlap weight matrix $W$ is known (and for the 2.4 GHz band, it is approximately determined by the channel frequency layout -- 20 MHz channels spaced 5 MHz apart give a predictable overlap function), then deconvolve before updating. The observation vector $\mathbf{o}_i = W_i \boldsymbol{\lambda}$ is a linear mixture; solving $\hat{\boldsymbol{\lambda}} = W^{-1} \mathbf{o}_i$ (or a regularized variant, since $W$ is ill-conditioned for closely-spaced channels) recovers estimates of the per-channel rates. In the common case where only three non-overlapping channels carry most traffic (1, 6, 11 in 2.4 GHz), $W$ is nearly diagonal and the correction reduces to simple rescaling. Update the Gamma posterior on each $\lambda_j$ with the deconvolved count. This is cheap and effective when the geometry is stable.

When the weights are not known (indoor multipath, non-line-of-sight), they must be learned jointly with the rates(10). In practice, the known-geometry approximation is usually sufficient for the 2.4 GHz and 5 GHz bands.

The channel hopping problem sits at the intersection of cognitive radio (bandit-based channel selection for transmitting, not monitoring(11)(12)), network measurement (coverage metrics for passive scanning(13)), and restless bandits (channels evolve while unobserved(14)). None of these communities has applied learning algorithms to passive WiFi monitoring specifically.

# Regret Comparison

Three-panel comparison: cumulative regret, per-step regret, and channel selection fraction for Uniform, UCB1, and Thompson Sampling — 14 channels with Poisson rates 3--48 packets/dwell, averaged over 20 runs. Left: cumulative regret -- uniform grows linearly, UCB1 and TS are sublinear. Center: per-step regret -- UCB1 drops below 1, TS settles around 3.5, uniform stays at ~23. Right: channel selection fraction -- UCB1 concentrates on the best channel; TS spreads across the top 2--3; uniform visits all equally.

In this high-rate Poisson regime, UCB1 outperforms Thompson Sampling because the Poisson variance scales with the mean -- the posterior updates for high-rate channels are noisier, causing TS to explore the top few channels longer before committing. Both achieve sublinear regret; both vastly outperform uniform.

# What breaks in deployment

The simulation above is clean: stationary rates, independent channels, no switching cost. Real WiFi monitoring breaks all three assumptions simultaneously. Delayed rewards (you don't know what you missed on channel 6 while you were on channel 1), non-stationary interference (a microwave oven turns on, flooding 2.4 GHz), and cold start (a new 6 GHz AP appears with no prior observations) all degrade convergence. Industry deployments of bandits (Spotify for recommendations, Netflix for artwork personalization) report similar gaps between simulation and production -- the common mitigations are contextual features (time of day, known AP locations), conservative exploration bonuses, and explicit warm-up phases. None of these have been tried for WiFi monitoring.

# Open Problems

6 GHz channel explosion. With 59 channels in 6 GHz alone (plus 14 in 2.4 GHz and ~25 in 5 GHz), the exploration budget is severe. A uniform sweep takes ~25 seconds. Contextual information (time of day, known AP locations, channel utilization reports from 802.11k/v/r) could cut the exploration phase from minutes to seconds. This pushes toward contextual bandits, where side information informs the channel selection policy.

Multi-interface parallelism. Kismet already supports splitting channel lists across multiple NICs. The multi-agent bandit formulation -- where $M$ monitors must coordinate which channels to observe simultaneously -- is well-studied theoretically (Anandkumar et al. 2011) but not implemented in any monitoring tool. With USB WiFi adapters at $15 each, running 4--8 monitors in parallel is practical, and the coordination question is open: should they explore independently, or share observations?

Multi-user collision. The formulation above assumes a single monitor. In practice, multiple independent monitors (or a monitor competing with active users) face a coordination problem: naive independent-UCB-per-agent causes all agents to cluster on the same "best" channel, degrading throughput through collisions. Bande and Veeravalli (2019)(17) showed that orthogonalization strategies -- where agents learn to avoid each other's channels -- are needed to achieve sublinear regret in the multi-user case.

Non-stationarity detection. Rather than always applying a sliding window (which adds a tuning parameter), a change-point detection algorithm could trigger re-exploration only when the traffic distribution shifts. Tartakovsky et al. (2014) give sequential change-point tests that could be layered on top of Thompson Sampling(15).

Switching cost. We've treated channel switches as having a fixed time cost, but they also have an information cost: during the switch, no packets are captured on any channel. The bandit-with-switching-costs formulation (Dekel et al. 2014(16)) penalizes frequent switches, which would favor longer dwell times and fewer hops. This is the formal version of the practitioner's intuition that "hopping too fast wastes time."

Integration with active scanning. Passive monitoring can be supplemented with probe requests to solicit responses from hidden APs. The decision of when to send a probe (which temporarily makes the monitor visible) is a separate explore-exploit tradeoff layered on top of the channel selection problem.

# Summary

Method	Regret	Diversity	Complexity
Uniform	$O(T)$ -- linear	Full coverage	None
$\varepsilon$ -greedy (decaying)	$O(K \ln T)$	Moderate	Sample mean
UCB1	$O(\ln T / \Delta)$	Decreasing	Sample mean + bonus
KL-UCB	$O(\ln T / d(\lambda, \lambda^*))$	Decreasing	KL optimization
Thompson Sampling	$O(\ln T / d(\lambda, \lambda^*))$	Decreasing	Gamma posterior sampling
TS + entropy	$O(\sqrt{KT})$ worst case	Tunable via $\gamma$	Gamma + entropy term
Sliding-Window UCB	$O(\ln T / \Delta)$ per segment	Decreasing	Windowed mean + bonus
Discounted TS	$O(\ln T / d)$ per segment	Decreasing	Discounted Gamma

The right choice depends on the monitoring goal. For intrusion detection (must see every device), uniform or entropy-regularized TS with high $\gamma$ preserves coverage. For traffic analysis (maximize total observed packets), pure Thompson Sampling with Gamma-Poisson conjugate is hard to beat. For environments where traffic patterns shift throughout the day, discounted TS or SW-UCB adapt without manual retuning.

Every production monitoring tool today uses the first row. The per-hop decision problem remains open.

# References

[1] USB-WiFi issue #376. Channel switching latency benchmarks across chipsets. github.com/morrownr/USB-WiFi ↩

[2] Lai, T. L. & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4--22. ↩

[3] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235--256. ↩

[4] Garivier, A. & Cappe, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. COLT. ↩

[5] Agrawal, S. & Goyal, N. (2012). Further optimal regret bounds for Thompson Sampling. arXiv:1209.3353. Finite-time Poisson-Gamma bounds in: Jin, T. et al. (2022). Finite-time regret of Thompson Sampling algorithms for exponential family multi-armed bandits. NeurIPS. ↩

[6] Zimmert, J. & Seldin, Y. (2021). Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. JMLR, 22(28), 1--28. ↩

[7] Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. FOCS, 207--216. ↩

[8] Garivier, A. & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. ALT, 174--188. ↩

[9] Raj, V. & Kalyani, S. (2017). Taming non-stationary bandits: A Bayesian approach. arXiv:1707.09727. ↩

[10] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian Data Analysis (3rd ed., p. 73). Chapman & Hall/CRC. ↩

[11] Jouini, W., Ernst, D., Moy, C., & Palicot, J. (2010). Upper confidence bound based decision making strategies and dynamic spectrum access. ICC, 1--5. ↩

[12] Anandkumar, A., Michael, N., Tang, A. K., & Swami, A. (2011). Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE JSAC, 29(4), 731--745. ↩

[13] Schulman, A., Levin, D., & Spring, N. (2008). On the fidelity of 802.11 packet traces. PAM, 132--141. ↩

[14] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A), 287--298. ↩

[15] Tartakovsky, A., Nikiforov, I., & Basseville, M. (2014). Sequential Analysis: Hypothesis Testing and Changepoint Detection. Chapman & Hall/CRC. ↩

[16] Dekel, O., Ding, J., Koren, T., & Peres, Y. (2014). Bandits with switching costs: $T^{2/3}$ regret. STOC, 459--467. ↩

[17] Bande, M. & Veeravalli, V. V. (2019). Multi-user multi-armed bandits for uncoordinated spectrum access. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4509--4513. ↩