{"version":"https://jsonfeed.org/version/1","title":"attobop.net","home_page_url":"https://attobop.net/","feed_url":"https://attobop.net/feed.json","description":"Technical blog","items":[{"id":"https://attobop.net/posts/region-embeddings/","url":"https://attobop.net/posts/region-embeddings/","title":"Why Regions, Not Points: Geometric Embeddings for Subsumption","content_html":"<p>&quot;Dog is-a Animal.&quot; This is a statement about sets: every dog is an animal, so the set of dogs is contained in the set of animals. If you want an embedding to capture this, the embedding of Dog should be <em>inside</em> the embedding of Animal in some geometric sense. Points have no interior, no volume, and no way to express that one concept is a subset of another — not just close, but <em>contained</em>. You need regions.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The entire field follows from taking &quot;is-a&quot; literally as set containment.</span></p>\n<!--more-->\n<h2>Background</h2>\n<p><strong>Notation.</strong> Throughout this post: $m, M$ are box corners (min, max); $c, o $ are center and half-width (offset); $ d$ is the embedding dimension;  $\\beta$  is the Gumbel temperature;  $\\sqsubseteq$  is subsumption (&quot;is-a&quot;);  $\\sqcap$  is concept conjunction (&quot;and&quot;);  $P(B \\subseteq A)$  is containment probability.</p>\n<h3>Knowledge graph embeddings</h3>\n<p>A knowledge graph is a collection of triples $(h, r, t)$: head entity, relation, tail entity. (&quot;Berlin&quot;, capitalOf, &quot;Germany&quot;) or (&quot;Dog&quot;, isA, &quot;Animal&quot;). These graphs are large but sparse: 93.8% of people in <a href=\"https://en.wikipedia.org/wiki/Freebase_(database)\">Freebase</a> have no recorded birthplace, and 78.5% have no recorded nationality. The standard task is <em>link prediction</em>: given (Berlin, capitalOf, ?) or (?, isA, Animal), rank the missing entity.</p>\n<p>The dominant approaches since 2013 embed entities as vectors and score triples geometrically — either by treating relations as transformations (<code>TransE</code>, <code>RotatE</code>) or as bilinear forms like  $h^\\top M_r t$  where  $M_r$  is a learnable matrix per relation (<code>DistMult</code>, <code>ComplEx</code>).<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The point-embedding models discussed here are implemented in <a href=\"https://github.com/arclabs561/tranz\">tranz</a> — named after the Trans* family (<code>TransE</code>, <code>TransR</code>, <code>TransH</code>, <code>TransD</code>), which all model relations as geometric operations on points: translations, projections into relation-specific subspaces, or hyperplane reflections. They differ in how they handle 1-to-N and N-to-N relations, but none can represent containment. The region embeddings that follow are in <a href=\"https://github.com/arclabs561/subsume\">subsume</a>, named for the subsumption relation ( $\\sqsubseteq$ ) these embeddings model.</span></p>\n<p><strong><code>TransE</code></strong> (<span id=\"cite2\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref2\">2</a>) — short for &quot;translating embeddings&quot; — models each relation as a translation: score a triple $(h, r, t)$ by  $\\|h + r - t\\|$ .<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Lower is better. The choice of L1 vs L2 norm matters less than you'd expect.</span> If the triple is true, the head plus the relation vector should land near the tail. It is simple, scalable, and became the baseline most subsequent work is measured against, reaching competitive performance on FB15k (a standard link prediction benchmark derived from Freebase, with 15,000 entities) despite the simple scoring function.</p>\n<p>But TransE fails structurally on certain relation types. For 1-to-N relations like (Britain, hasCity, ?), every correct tail entity must satisfy  $h + r \\approx t$ , which forces London, Manchester, and Edinburgh to have nearly identical embeddings. The model can't distinguish cities that share a country. Symmetric relations are worse: if  $h + r = t$  and  $t + r = h$ , then  $r = 0$ , which collapses the relation entirely.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The &quot;married to&quot; relation forces every married couple to share an embedding. Not ideal.</span></p>\n<p><strong><code>RotatE</code></strong> (<span id=\"cite7\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref7\">7</a>) — &quot;rotation embedding&quot; — fixes the symmetry problem by working in complex space: each relation is an element-wise rotation  $t = h \\circ r$  where  $|r_i| = 1$ . Symmetric relations get angle  $\\pi$  (rotation by 180 degrees is its own inverse). This handles symmetry, antisymmetry, inversion, and composition.</p>\n<p>But neither <code>TransE</code> nor <code>RotatE</code> can represent <em>containment</em>.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Distance can approximate containment, but it can never encode it. &quot;Close to Animal&quot; is not the same as &quot;inside Animal.&quot;</span> Both embed entities as points. A point has no volume, no interior, no boundary. You can measure the distance between two points, but you can't ask whether one point is <em>inside</em> another.</p>\n<h3>What containment requires</h3>\n<p>Consider the subsumption hierarchy: Dog  $\\sqsubseteq$  Animal  $\\sqsubseteq$  LivingThing, where  $\\sqsubseteq$  (read &quot;is subsumed by&quot;) means every instance of the left concept is also an instance of the right. In set-theoretic terms,  $\\text{ext}(\\text{Dog}) \\subseteq \\text{ext}(\\text{Animal}) \\subseteq \\text{ext}(\\text{LivingThing})$ , where  $\\text{ext}(C)$  is the set of individuals that fall under concept  $C$ .</p>\n<p>An embedding that captures this needs three properties:</p>\n<ol>\n<li>\n<p><strong>Volume.</strong> More general concepts (Animal) should have larger representations than specific ones (Dog). &quot;Animal&quot; covers more ground than &quot;Dog.&quot;</p>\n</li>\n<li>\n<p><strong>Containment.</strong> The representation of Dog should be geometrically inside the representation of Animal. Not just &quot;close to&quot; — <em>inside</em>.</p>\n</li>\n<li>\n<p><strong>Intersection.</strong> The concepts &quot;Animal&quot; and &quot;Pet&quot; overlap (some animals are pets, some aren't). Their representations should intersect, and the intersection should itself be a valid representation — of the concept  $\\text{Animal} \\sqcap \\text{Pet}$ , whether or not that conjunction has a name in the ontology.</p>\n</li>\n</ol>\n<p>Points fail all three. They have no volume. No interior for containment. Two distinct points have empty intersection — there is no partial overlap.</p>\n<h3>Order embeddings: the first attempt</h3>\n<p>Vendrov et al. [<span id=\"cite3\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref3\">3</a>] had the right instinct: embed the partial order directly into the geometry. Their approach mapped concepts into the non-negative orthant  $\\mathbb{R}^d_+$  (the region where all coordinates are  $\\ge 0$ , the  $d$ -dimensional analogue of the first quadrant) and imposed the reverse product order:  $x$  is more general than  $y$  if  $x_i \\le y_i$  for every coordinate. The origin is the top element — the most general concept.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The &quot;entity&quot; concept, containing everything, sits at zero. Specificity grows with coordinate magnitude.</span> Each point defines two cones: the cone extending toward the origin (smaller coordinates) contains its ancestors (more general concepts); the cone extending away from the origin (larger coordinates) contains its descendants (more specific concepts).</p>\n<figure>\n  <img src=\"https://attobop.net/posts/region-embeddings/order_cones.png\" alt=\"Order embeddings in the positive orthant: sibling cones overlap\" width=\"480\">\n  <figcaption> In order embeddings, each concept</figcaption>\n</figure>\n<p>This gets transitivity for free: if  $x \\le y$  and  $y \\le z$  coordinate-wise, then  $x \\le z$ . But the representation is still points, and the cones are unbounded. Vilnis et al. [<span id=\"cite4\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref4\">4</a>] proved the deeper problem: for <em>any</em> product probability measure  $p(x) = \\prod_i p_i(x_i)$  over the non-negative orthant, the covariance of the indicator functions (1 if a random point lands in the cone, 0 otherwise) of any two cones is non-negative —  $\\text{Cov}(\\mathbf{1}_{C_A}, \\mathbf{1}_{C_B}) \\ge 0$ . If &quot;Mammal&quot; and &quot;Reptile&quot; are both under &quot;Animal,&quot; their cones necessarily overlap — you cannot express that they are disjoint under any such measure. The geometry forces all concepts under a common ancestor to be positively correlated, which is wrong for siblings that should be mutually exclusive.</p>\n<p>This motivated boxes: a bounded region closed under intersection that admits a probabilistic containment score.</p>\n<h2>Box Embeddings</h2>\n<figure>\n  <img src=\"https://attobop.net/posts/region-embeddings/box_taxonomy.png\" alt=\"Box taxonomy showing containment, disjointness, and partial overlap\" width=\"550\">\n  <figcaption> All three properties in one geometry. Containment: Dog, Cat, Fish inside Animal (P=1). Disjointness: Dog and Cat do not overlap (P=0). Partial overlap: Pet (dashed) intersects Dog and Cat but not Fish. Hierarchy: LivingThing contains Animal contains Dog; LivingThing contains Plant contains Tree.</figcaption>\n</figure>\n<p>The idea of representing words as regions rather than points goes back to Erk [<span id=\"cite1\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref1\">1</a>], who embedded words as convex regions in vector space to model graded entailment.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Erk's 2009 paper predates the deep learning era entirely. The idea of &quot;words as regions&quot; waited nearly a decade for the right optimization tools.</span> Vilnis et al. [<a href=\"https://attobop.net/posts/region-embeddings/#ref4\">4</a>] made this precise for knowledge graphs, proposing axis-aligned hyperrectangles — <em>boxes</em> — in  $\\mathbb{R}^d$ . A box  $B$  is parameterized by its minimum and maximum corners:</p>\n\n\n$$B = \\{x \\in \\mathbb{R}^d : m_i \\le x_i \\le M_i, \\; i = 1, \\ldots, d\\}$$\n\n<p>where  $m = (m_1, \\ldots, m_d)$  and  $M = (M_1, \\ldots, M_d)$  with  $m_i \\le M_i$  for each coordinate.</p>\n<p>This is the simplest region that supports all three properties. Volume is the product of side lengths:</p>\n\n\n$$\\text{Vol}(B) = \\prod_{i=1}^{d} (M_i - m_i)$$\n\n<p>Containment is coordinate-wise:  $A \\subseteq B$  if and only if  $m^B_i \\le m^A_i$  and  $M^A_i \\le M^B_i$  for all  $i$ . The intersection of two boxes is a box (or empty):</p>\n\n\n$$(A \\cap B)_i = [\\max(m^A_i, m^B_i), \\; \\min(M^A_i, M^B_i)]$$\n\n<p>The intersection is non-empty when  $\\max(m^A_i, m^B_i) \\le \\min(M^A_i, M^B_i)$  for every coordinate. If any coordinate has an empty interval, the boxes are disjoint. Disjoint boxes represent mutually exclusive concepts — the thing order embeddings could not express. And when boxes partially overlap, the intersection region is itself a box representing the conjunction: the box for &quot;Animal&quot; intersected with the box for &quot;Pet&quot; gives a box for &quot;things that are both animals and pets.&quot;</p>\n<p>The axis-alignment is a deliberate restriction.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Why not balls? The intersection of two balls is not a ball. Boxes are the simplest shape where &quot;overlap&quot; stays in the family.</span> Rotated boxes would be more expressive, but their intersection is not necessarily a rotated box — you lose closure under intersection. Axis-aligned boxes form a lattice — a structure where any two elements have a greatest lower bound (intersection/meet) and a least upper bound (bounding box/join) — mirroring concept hierarchies in description logics (formal languages for representing ontologies; see the EL⁺⁺ section below). Each dimension contributes an independent &quot;vote&quot; on containment — the box analogue of diagonal covariance in Gaussian models.</p>\n<h3>Containment probability</h3>\n<p>The natural scoring function for subsumption is the <em>containment probability</em>:</p>\n\n\n$$P(B \\subseteq A) = \\frac{\\text{Vol}(A \\cap B)}{\\text{Vol}(B)}$$\n\n<p>Concretely: draw a point uniformly from  $B$ ; this is the probability it lands in  $A$ .<span class=\"sidenote-ref\"></span><span class=\"sidenote\">This uniform-measure interpretation is what makes the  $P(\\cdot)$  notation honest. Without it,  $P(B \\subseteq A)$  applies probability notation to a deterministic set-theoretic predicate — either  $B$  is inside  $A$  or it isn't. The uniform draw gives it a genuine probabilistic meaning.</span> It returns 1 if  $B$  is entirely inside  $A$ , 0 if they are disjoint, and a value between 0 and 1 for partial overlap. If we want to express &quot;Dog is-a Animal,&quot; we train so that  $P(\\text{Dog} \\subseteq \\text{Animal}) \\approx 1$ .</p>\n<p>The asymmetry is important:  $P(B \\subseteq A) \\ne P(A \\subseteq B)$  in general, because the denominator is always the volume of the thing being tested for containment. A small box inside a large box gives  $P = 1$  in one direction and  $P \\ll 1$  in the other. This matches the semantics: every dog is an animal, but not every animal is a dog.</p>\n<h3>A worked example</h3>\n<p>Take  $d = 2$ . Let Animal  $= [0, 10] \\times [0, 10]$  and Dog  $= [2, 5] \\times [3, 7]$ .</p>\n\n\n$$\\text{Vol}(\\text{Animal}) = 10 \\times 10 = 100$$\n\n\n\n$$\\text{Vol}(\\text{Dog}) = 3 \\times 4 = 12$$\n\n<p>The intersection is Dog itself (Dog is inside Animal), so  $\\text{Vol}(\\text{Animal} \\cap \\text{Dog}) = 12$ .</p>\n\n\n$$P(\\text{Dog} \\subseteq \\text{Animal}) = \\frac{12}{12} = 1$$\n\n\n\n$$P(\\text{Animal} \\subseteq \\text{Dog}) = \\frac{12}{100} = 0.12$$\n\n<p>Now add Cat  $= [6, 9] \\times [3, 7]$ . Dog and Cat are disjoint (no overlap in the first coordinate: Dog's $[2, 5]$ doesn't intersect Cat's $[6, 9]$), so  $P(\\text{Dog} \\subseteq \\text{Cat}) = 0$ . Both are contained in Animal. The geometry directly encodes the taxonomy.</p>\n<pre><code class=\"language-python\">import numpy as np\n\ndef containment_prob(a_min, a_max, b_min, b_max):\n    &quot;&quot;&quot;P(B subset A): fraction of B's volume inside A.&quot;&quot;&quot;\n    inter_min = np.maximum(a_min, b_min)\n    inter_max = np.minimum(a_max, b_max)\n    inter_sides = np.maximum(inter_max - inter_min, 0)\n    b_sides = np.maximum(b_max - b_min, 1e-10)\n    return np.prod(inter_sides) / np.prod(b_sides)\n\nanimal = (np.array([0, 0]), np.array([10, 10]))\ndog    = (np.array([2, 3]), np.array([5, 7]))\ncat    = (np.array([6, 3]), np.array([9, 7]))\n\nprint(f&quot;P(Dog ⊆ Animal) = {containment_prob(*animal, *dog)}&quot;)   # 1.0\nprint(f&quot;P(Animal ⊆ Dog) = {containment_prob(*dog, *animal)}&quot;)   # 0.12\nprint(f&quot;P(Dog ⊆ Cat)    = {containment_prob(*cat, *dog)}&quot;)      # 0.0\n</code></pre>\n<p>In high dimensions, computing the raw volume  $\\prod_i (M_i - m_i)$  numerically is unstable — floating-point underflow to zero occurs well before the full product is computed. The fix is to work in log-space:  $\\log \\text{Vol}(B) = \\sum_{i=1}^{d} \\log(M_i - m_i)$ . The product becomes a sum, and everything stays tractable. All practical implementations operate on log-volumes, not volumes.</p>\n<h2>The Gradient Problem</h2>\n<p>Hard boxes have a serious training defect.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">This section is why the Gumbel paper exists. The gradient problem is not a minor inconvenience — it makes naive box training fail completely.</span> Consider two disjoint boxes — say Dog  $= [2, 5] \\times [3, 7]$  and Fish  $= [20, 25] \\times [30, 35]$ . Their intersection volume is zero. If we move Dog slightly — say shift it by  $\\epsilon$  in any direction — the intersection is still zero. The loss doesn't change. The gradient is zero.</p>\n<p>This is the <em>local identifiability problem</em> — the formal name for the gradient problem in box embeddings.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The name comes from statistics: a parameter is locally identifiable if small perturbations change the likelihood. Here, they don't. Dasgupta et al. (2020) use &quot;local identifiability&quot; as the motivation for Gumbel boxes.</span> When boxes are disjoint, the containment probability is identically zero in a neighborhood of the current parameters. The optimizer receives no signal about <em>which direction</em> to move to bring the boxes closer.</p>\n<p>The problem gets worse in high dimensions. In  $\\mathbb{R}^d$ , two boxes are disjoint if <em>any single coordinate</em> has non-overlapping intervals. With  $d = 200$  and random initialization, the probability that two boxes overlap in all 200 coordinates simultaneously is negligible. Almost every pair starts disjoint, and the optimizer is blind to almost every relationship it needs to learn.</p>\n<p>The intersection volume is piecewise multilinear in the box coordinates — within each combinatorial regime (fully contained, partially overlapping, disjoint), it is a product of  $d$  linear terms, one per coordinate. The containment probability, as a ratio of two such products, is a piecewise rational function. At the boundaries between regimes, the function is continuous but not differentiable. And in the disjoint regime, it is flat: identically zero, with zero gradient everywhere.</p>\n<p>This is analogous to the dead ReLU problem in neural networks, but worse. A dead ReLU neuron has zero gradient for negative inputs but can recover if the bias shifts; a pair of disjoint boxes in  $d$  dimensions has zero gradient across all  $4d$  parameters (two boxes, each with  $2d$  corner coordinates).</p>\n<figure>\n  <img src=\"https://attobop.net/posts/region-embeddings/gradient_landscape.png\" alt=\"Hard box vs Gumbel box volume as a function of gap between boxes\" width=\"550\">\n  <figcaption> Left: hard box intersection volume drops linearly as overlap decreases, then is flat zero once boxes are disjoint — the gradient vanishes entirely. Right: Gumbel expected volume (softplus) decays smoothly through the transition — gradients are nonzero everywhere, including deep in the disjoint regime (dashed tangent).</figcaption>\n</figure>\n<h3>First fix: smoothing (Li et al., 2019)</h3>\n<p>Li et al. [<span id=\"cite6\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref6\">6</a>] attacked the gradient problem by convolving the hard box indicator functions with Gaussian kernels. The smoothed intersection is never exactly zero — even for disjoint boxes, the Gaussian tails overlap, providing gradient signal.</p>\n<p>The smoothed volume has a closed-form expression involving the Gaussian CDF, which is differentiable everywhere. This works: disjoint boxes now produce nonzero loss, and the optimizer can move them toward overlap.</p>\n<p>Smoothing breaks the lattice.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Softboxes trade one structural guarantee (closure under intersection) for another (smooth gradients).</span> The intersection of two smoothed boxes is <em>not itself a smoothed box</em>. The representation is no longer closed under intersection — the lattice structure, which was one of the main reasons for choosing boxes in the first place, is lost. The smoothed model is also not idempotent (applying an operation twice gives a different result than once):  $P(A \\subseteq A) \\ne 1$  in general, which is semantically wrong — every set contains itself. And in one-dimensional tree experiments (where the ground truth is recoverable), softbox Mean Reciprocal Rank (MRR, a ranking metric where 1.0 is perfect) plateaued at 0.691 compared to 0.971 for Gumbel boxes — a ceiling imposed by the smoothing's loss of lattice structure. Smoothing fixed the zero-gradient problem but introduced new ones.</p>\n<h3>Second fix: Gumbel boxes (Dasgupta et al., 2020)</h3>\n<p>Dasgupta et al. [<span id=\"cite8\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref8\">8</a>] found a solution that fixes the gradient problem <em>without</em> breaking the lattice.</p>\n<p>Instead of a deterministic box  $[m_i, M_i]$ , make each endpoint a random variable drawn from a Gumbel distribution. The Gumbel distribution arises naturally as the limiting distribution of the maximum (or minimum) of many independent samples — it is to maxima what the Gaussian is to averages. It comes in two variants: Gumbel-max (right-skewed, models maxima) and Gumbel-min (left-skewed, models minima). The lower bound  $m_i$  gets a Gumbel-max because box intersection takes the  $\\max$  of lower bounds; the upper bound  $M_i$  gets a Gumbel-min because intersection takes the  $\\min$  of upper bounds.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">The pairing sounds backwards until you think about intersection: new lower bound = max of old lower bounds (Gumbel-max stays Gumbel-max), new upper bound = min of old upper bounds (Gumbel-min stays Gumbel-min).</span></p>\n\n\n$$m_i \\sim \\text{Gumbel}_{\\max}(\\mu^m_i, \\beta), \\qquad M_i \\sim \\text{Gumbel}_{\\min}(\\mu^M_i, \\beta)$$\n\n<p>where  $\\mu^m_i, \\mu^M_i$  are learnable location parameters and  $\\beta > 0$  is a temperature controlling the &quot;softness&quot; of the boundaries. At  $\\beta \\to 0$ , the Gumbel distributions collapse to point masses and we recover hard boxes.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Think of  $\\beta$  as wall thickness. Large  $\\beta$ : fuzzy walls, easy to push through. Small  $\\beta$ : sharp walls, crisp decisions.</span></p>\n<p>Why Gumbel specifically, and not Gaussian or logistic or any other smooth distribution? Because the Gumbel family is <em>min-max stable</em>: if you take the maximum of several independent Gumbel-max variables, the result is itself a Gumbel-max (and likewise, the minimum of Gumbel-min variables is Gumbel-min). No other common distribution has this property. The new location parameter is computed via LogSumExp ( $\\text{LSE}(x_1, \\ldots, x_k) = \\log \\sum_i e^{x_i}$ , a smooth approximation to max):</p>\n\n\n$$\\max(X_1, \\ldots, X_k) \\sim \\text{Gumbel}_{\\max}\\!\\left(\\beta \\ln \\sum_i e^{\\mu_i / \\beta}, \\; \\beta\\right)$$\n\n<p>Recall that box intersection takes the coordinate-wise max of lower bounds and min of upper bounds. With Gumbel endpoints, the intersection of two Gumbel boxes is again a Gumbel box — computed in closed form via LogSumExp. The lattice structure that Gaussian smoothing destroyed is preserved.</p>\n<h3>Where the Bessel function comes from</h3>\n<p>The expected side length of a Gumbel box along one coordinate involves a modified Bessel function  $K_0$ . What matters is its behavior:  $K_0$  starts at infinity for argument zero, then drops off smoothly — monotonically decreasing, convex, and asymptotically  $\\sqrt{\\pi / 2z} \\, e^{-z}$ . For large positive gap  $\\Delta_i$  between box endpoints (box is wide open), the expected volume is large. For  $\\Delta_i \\approx 0$  (box is barely open), it is small but nonzero. For  $\\Delta_i < 0$  (endpoints are &quot;inverted,&quot; meaning the box is empty in expectation), it is tiny but still has a nonzero derivative  $-K_1$ . The optimizer always has a signal — this is the property that hard boxes and softboxes lack.</p>\n<p>The formula: the expected side length is  $\\mathbb{E}[\\max(M_i - m_i, 0)]$  where  $M_i$  is Gumbel-min and  $m_i$  is Gumbel-max. Integrating the product of a Gumbel-max PDF and a Gumbel-min survival function gives:</p>\n\n\n$$\\mathbb{E}[\\max(M_i - m_i, 0)] = 2\\beta \\, K_0\\!\\left(\\frac{2}{\\beta} \\, e^{-\\Delta_i / (2\\beta)}\\right)$$\n\n<p>where  $\\Delta_i = \\mu^M_i - \\mu^m_i$  is the gap between the location parameters. Why a Bessel function?  $K_0$  arises whenever you integrate products of exponential-family distributions with a maximum operation. The Gumbel distribution is the Type I extreme value distribution, so the integral reduces directly to the standard Laplace-type representation of  $K_0$ .</p>\n<p>A concrete comparison in 1D: two intervals that should overlap but are currently disjoint, with a gap of 2 units.</p>\n<ul>\n<li><strong>Hard box</strong>: gradient zero everywhere. The optimizer has no direction.</li>\n<li><strong>Softbox</strong> ( $\\sigma = 1$ ): gradient nonzero (Gaussian tails give containment probability  $\\sim 0.05$ ), but the smoothing creates a flat ceiling — different box configurations produce the same loss. 1D tree MRR stalls at 0.691.</li>\n<li><strong>Gumbel box</strong> ( $\\beta = 1$ ): gradient nonzero and correctly directed, scaling with gap size. Expected overlap  $\\sim 0.13$  via  $K_0$ . 1D tree MRR: 0.971.</li>\n</ul>\n<p>Gumbel boxes don't just &quot;smooth a little better&quot; — they fix the gradient problem while keeping the expected intersection volume well-defined and closed-form, so the loss function is coherent across all box configurations, not just the overlapping ones.</p>\n<h3>The softplus approximation</h3>\n<p>Computing Bessel functions in every forward pass is expensive. Dasgupta et al. [<a href=\"https://attobop.net/posts/region-embeddings/#ref8\">8</a>] observed that  $2\\beta K_0(\\frac{2}{\\beta} e^{-x/(2\\beta)})$  is nearly indistinguishable from a shifted <code>softplus</code>, and proposed:</p>\n\n\n$$2\\beta \\, K_0\\!\\left(\\frac{2}{\\beta} \\, e^{-\\Delta_i/(2\\beta)}\\right) \\approx \\beta \\, \\log\\!\\left(1 + \\exp\\!\\left(\\frac{\\Delta_i}{\\beta} - 2\\gamma\\right)\\right)$$\n\n<p>where  $\\gamma \\approx 0.5772$  is the Euler-Mascheroni constant (the mean of a standard Gumbel distribution). The  $2\\gamma$  shift accounts for the difference of means of the two endpoints, each contributing  $\\gamma\\beta$  to its expected value. The approximation error is bounded by  $0.062\\beta$ , which is negligible in training.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">Numerically computed (Appendix C of Dasgupta et al., [<a href=\"https://attobop.net/posts/region-embeddings/#ref8\">8</a>]), not a formal proof. Exact value:  $0.0617013\\beta$ .</span></p>\n<p>This is what implementations actually compute: a <code>softplus</code> with a constant shift. It is cheap, differentiable, monotonic with respect to containment, and numerically stable. The full expected log-volume of a Gumbel box becomes a sum of  $d$  <code>softplus</code> evaluations — the same computational complexity as a hard box, but with smooth gradients everywhere.</p>\n<figure>\n  <img src=\"https://attobop.net/posts/region-embeddings/gumbel_walls.png\" alt=\"Gumbel box membership probability for different beta values\" width=\"500\">\n  <figcaption> Membership probability along one coordinate for three temperatures. Small beta (green): sharp walls, nearly a hard box. Large beta (red): fuzzy boundaries that extend well beyond the box interior. The gradient signal at the boundary decays smoothly with distance — this is why disjoint Gumbel boxes still produce gradients.</figcaption>\n</figure>\n<p>The temperature  $\\beta$  acts as a curriculum. Start with large  $\\beta$  (soft, fuzzy boxes with wide gradients) and anneal toward small  $\\beta$  (hard boxes with crisp boundaries). Early in training, the soft boundaries let the optimizer explore; late in training, the hard boundaries let the model make precise containment judgments.</p>\n<p>Where does the Gumbel model break? The same place boxes do: union and complement. The max of a Gumbel-max and a Gumbel-min is not Gumbel-anything — the distribution family is closed under intersection but not union. And the complement of a Gumbel box is not a Gumbel box. For negation, you still need cones or fuzzy operators. The Gumbel contribution is surgical: it fixes the gradient problem for the operations boxes already support, without pretending to solve the operations they don't.</p>\n<h2>Query2Box: Answering Logical Queries</h2>\n<p>With the gradient problem solved, box embeddings can represent containment and train reliably. The next question: can boxes do more than subsumption? Ren et al. [<span id=\"cite9\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref9\">9</a>] introduced <strong><code>Query2Box</code></strong> — the name says it: translate a logical query into a box — applying box embeddings to <em>multi-hop logical queries</em> over knowledge graphs. The idea: represent a query as a box, and rank answer entities by their distance to the query box.</p>\n<p>Consider the query: &quot;Where did Canadian Turing Award winners graduate?&quot;<span class=\"sidenote-ref\"></span><span class=\"sidenote\">This is the running example from Ren et al. [<a href=\"https://attobop.net/posts/region-embeddings/#ref9\">9</a>]. Leskovec uses it in his <a href=\"https://web.stanford.edu/class/cs224w/\">Stanford CS224W lectures</a> as the canonical multi-hop box query.</span> This decomposes into: start with TuringAward (a point), apply the &quot;Win&quot; relation projection to get a box of Turing winners, separately start with Canada, project via &quot;Citizen&quot; to get a box of Canadians, intersect the two boxes, then project via &quot;Graduate&quot; to get universities. Each relation projection simply translates the box center and adds to the box offset — the box can only grow, never shrink. This makes sense: following a one-to-many relation (like &quot;cities in a country&quot;) should expand the answer set.</p>\n<figure>\n  <img src=\"https://attobop.net/posts/region-embeddings/query2box_pipeline.png\" alt=\"Query2Box pipeline: anchor entities projected into boxes, intersected, then projected again to produce the answer box\" width=\"480\">\n  <figcaption> The full Query2Box pipeline for a two-branch conjunctive query. Anchor entities start as points (zero-offset boxes). Each relation projection translates and expands the box. Intersection shrinks the box to the overlap region. The final answer box contains green points (correct answers); the red point outside has high d_out and scores poorly.</figcaption>\n</figure>\n<p>Note the asymmetry: <em>queries</em> are boxes, but <em>answer entities</em> are still embedded as points.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">After spending the whole post arguing against points, Query2Box puts entities back as points. The resolution: entities are answer candidates, not concepts. The box is the question, not the thing.</span> The box represents the set of plausible answers; each candidate entity is a point that may or may not land inside it. The scoring function for a candidate answer entity  $v$  (embedded as a point) relative to a query box  $q$  is:</p>\n\n\n$$d(q, v) = d_{\\text{out}}(q, v) + \\alpha \\cdot d_{\\text{in}}(q, v)$$\n\n<p>where  $d_{\\text{out}}$  is the L1 distance from  $v$  to the nearest point on the box boundary (zero if  $v$  is inside),  $d_{\\text{in}}$  is the L1 distance from  $v$  to the box center (measuring how deep inside the box  $v$  is), and  $\\alpha \\in (0, 1)$  is a hyperparameter (set to  $0.2$  in the original paper) that downweights the inside distance.</p>\n<p>The asymmetry between  $d_{\\text{out}}$  and  $d_{\\text{in}}$  is intentional. Outside entities should be penalized heavily (they are not answers to the query). Inside entities are all plausible answers, but we mildly prefer those closer to the center — answers nearest the box center.</p>\n<p>Intersection of query boxes models conjunction: &quot;countries that border France <em>and</em> have population &gt; 50M&quot; corresponds to intersecting two query boxes. Recall that geometric intersection is coordinate-wise max of lower bounds, min of upper bounds. This works when boxes overlap, but after multiple projection steps the boxes may not overlap at all — and an exact empty intersection gives zero volume with zero gradient (the same dead-zone problem from the previous section). Query2Box sidesteps this with a learned intersection operator: an attention mechanism over the input boxes' centers produces the new center, combined with a coordinate-wise minimum of offsets to shrink the box. This is an approximation, not a geometric intersection, but it provides gradient signal in all configurations.</p>\n<p>Two operations that boxes cannot handle:</p>\n<p><strong>Union.</strong> The union of two boxes is generally not a box. Query2Box works around this by transforming queries into disjunctive normal form (DNF) — push all disjunctions to the last step, compute each conjunctive branch as a box, then aggregate scores across branches.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">DNF means rewriting the query so all ORs are at the outermost level. Each branch is a pure conjunction, representable as a single box.</span> This is sound but adds computational cost proportional to the number of disjuncts.</p>\n<p><strong>Negation.</strong> The complement of a box in  $\\mathbb{R}^d$  is an unbounded region that is not a box. Queries like &quot;European countries that do <em>not</em> border France&quot; require a different approach. This limitation motivated two lines of follow-up work: geometric alternatives (cones) and algebraic alternatives (fuzzy logic).</p>\n<h2>Beyond Boxes: Cones and Negation</h2>\n<p>Zhang et al. [<span id=\"cite12\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref12\">12</a>] introduced <strong><code>ConE</code></strong> — &quot;cone embeddings&quot; — which represents queries as Cartesian products of two-dimensional cones. The  $d$ -dimensional embedding space is split into  $d/2$  independent 2D planes, and in each plane the query is an angular sector emanating from the origin, parameterized by an axis direction  $\\theta$  and an aperture  $\\phi$ : it covers all directions within angle  $\\phi$  of  $\\theta$ . The full query region is the Cartesian product of these  $d/2$  sectors.</p>\n<p>The key property: within each 2D plane, the complement of an angular sector is another angular sector. If a sector covers the angular range  $[\\theta - \\phi, \\theta + \\phi]$ , the remaining arc of the circle is centered at  $\\theta + \\pi$  with aperture  $\\pi - \\phi$ . Since each plane's complement is again a sector, the Cartesian product of complements is again a valid <code>ConE</code> representation. Negation stays within the representation class.</p>\n<p>The aperture encodes concept generality: wider cones are more general, narrower cones more specific. Containment is angular: cone  $A$  contains cone  $B$  if  $B$ 's angular extent falls entirely within  $A$ 's. Intersection of two angular sectors is a contiguous arc (another cone) whenever the sectors overlap in a connected region — which the model's parameter constraints on aperture ensure.</p>\n<p>ConE was the first geometry-based query embedding model to handle all three propositional connectives — conjunction, disjunction, and negation — over multi-hop queries. The cost is that angular containment is a weaker form of &quot;containment&quot; than volumetric box inclusion — cones have no finite volume, so you lose the probabilistic interpretation that makes box embeddings attractive for subsumption. For query answering, where the goal is ranking rather than exact set membership, angular containment works. For ontology completion, where you want  $P(\\text{Dog} \\subseteq \\text{Animal}) \\approx 1$  as a calibrated score, boxes remain the better tool.</p>\n<p>Ren and Leskovec [<span id=\"cite10\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref10\">10</a>] took a different route with <strong><code>BetaE</code></strong> — &quot;Beta embeddings,&quot; named after the Beta distribution — embed queries as Beta distributions rather than geometric regions. The complement of  $\\text{Beta}(\\alpha, \\beta)$  is  $\\text{Beta}(\\beta, \\alpha)$  — negation is a parameter swap. Conjunction and disjunction use learned neural networks that map pairs of Beta parameters to a new Beta, keeping the output within the representation class (the product of two Betas is not a Beta, so an exact closed form doesn't exist). <code>BetaE</code> is the probabilistic counterpart to <code>ConE</code>'s geometric solution.</p>\n<p>A third line of work avoids new geometry entirely. Arakelyan et al. [<span id=\"cite11\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref11\">11</a>] showed that complex queries can be decomposed into atomic link prediction calls and aggregated via t-norm fuzzy logic — requiring no training on complex queries at all. T-norms are binary operations that generalize logical AND to continuous values in $[0, 1]$; common choices include the product  $x \\cdot y$  and the minimum  $\\min(x, y)$ . Chen et al. [<span id=\"cite13\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref13\">13</a>] applied this directly to box containment probabilities: t-norms for conjunction, t-conorms (the dual, generalizing OR) for disjunction,  $1 - x$  for negation. The geometry handles containment; the algebra handles composition. Because the operators are fixed (not learned), a model trained on simple triples can in principle answer complex queries at test time — though performance depends on how well the embeddings support the assumed t-norm structure.</p>\n<h2>EL⁺⁺ and Ontology Completion</h2>\n<p>The Description Logic (DL) EL⁺⁺ — a fragment of first-order logic designed for efficient automated reasoning — underpins biomedical ontologies like SNOMED CT (the standard vocabulary for electronic health records), <a href=\"http://geneontology.org/\">Gene Ontology</a> (a structured vocabulary of gene functions), and GALEN (a medical terminology system).<span class=\"sidenote-ref\"></span><span class=\"sidenote\">SNOMED CT alone has over 350,000 concepts. You can't inspect this by hand — you need automated reasoning, and embedding methods extend it to plausible inferences.</span> Its language allows concept conjunction ( $C \\sqcap D$ ), existential restriction ( $\\exists r.C$ , &quot;things that have an  $r$ -relationship to some  $C$ &quot;), and a bottom concept ( $\\bot$ ). The key inference is <em>subsumption</em>: given an ontology, determine whether  $C \\sqsubseteq D$  holds (every instance of  $C$  is an instance of  $D$ ).</p>\n<p>Classical reasoners (<a href=\"https://github.com/liveontologies/elk-reasoner\">ELK</a>, Snorocket) compute subsumption exactly by rewriting the ontology's axioms (its TBox, the terminological component that defines concept relationships) into a set of normal forms — standardized shapes like  $C \\sqsubseteq D$  or  $C_1 \\sqcap C_2 \\sqsubseteq D$  that decompose complex axioms into simple building blocks. This is complete but does not generalize — it cannot predict subsumptions that are plausible but not logically entailed by the axioms. Embedding methods can.</p>\n<p>Jackermeier et al. [<span id=\"cite16\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref16\">16</a>] developed <strong><code>Box²EL</code></strong> — &quot;dual box embeddings for EL,&quot; using separate box pairs for concepts and roles — which embeds EL⁺⁺ concepts as boxes and roles as affine transformations (a linear map plus a translation, mapping one box to another by shifting its center and rescaling its half-widths). The four normal forms of EL⁺⁺ each get a geometric loss:</p>\n<p><strong>NF1:  $C_1 \\sqcap C_2 \\sqsubseteq D$ .</strong> The intersection of the boxes for  $C_1$  and  $C_2$  should be contained in the box for  $D$ . Loss: penalize volume of  $(C_1 \\cap C_2) \\setminus D$ .</p>\n<p><strong>NF2:  $C \\sqsubseteq D$ .</strong> The box for  $C$  should be inside the box for  $D$ . In center-offset parameterization (storing a box as its midpoint  $c$  and half-widths  $o$  instead of min/max corners), the boundary extends from  $c - o$  to  $c + o$ . The quantity  $|c_C - c_D| + o_C - o_D$  measures how far  $C$ 's boundary protrudes past  $D$ 's boundary along each coordinate — zero means  $C$  fits inside  $D$ , positive means it sticks out.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">This is the protrusion on the worse of the two sides (left or right boundary), not the total. If  $C$  sticks out equally on both sides by  $\\delta/2$ , the formula gives  $\\delta$ , same as one side sticking out by  $\\delta$ .</span> The loss penalizes protrusion:</p>\n\n\n$$\\mathcal{L}_{\\text{NF2}} = \\left\\| \\text{ReLU}\\!\\left(|c_C - c_D| + o_C - o_D - \\epsilon\\right) \\right\\|_2$$\n\n<p>where  $\\epsilon$  is a margin. This is zero when  $C$  is inside  $D$  with room to spare, and grows as  $C$  protrudes.</p>\n<p><strong>NF3:  $C \\sqsubseteq \\exists r.D$ .</strong> Every instance of  $C$  has an  $r$ -relationship to some instance of  $D$ . Geometrically: applying the role transformation  $r$  to the box for  $C$  should land inside the box for  $D$ .</p>\n<p><strong>NF4:  $\\exists r.C \\sqsubseteq D$ .</strong> Everything that has an  $r$ -relationship to a  $C$  is a  $D$ . Geometrically: the pre-image of the box for  $C$  under the role transformation  $r$  should be contained in the box for  $D$ .</p>\n<p>Training on the GALEN medical ontology (24,353 concepts, 951 roles), <code>Box²EL</code> predicts subsumptions that are plausible but not explicitly stated in the ontology — a form of ontology completion that classical reasoners cannot do. On GALEN, box embeddings substantially outperform point-based methods (<code>TransE</code>, <code>ELEm</code>) because the geometry directly encodes the containment structure of the ontology.</p>\n<h3>Volume correlates with generality</h3>\n<p>Trained box embeddings recover a property the loss never explicitly optimizes: log-volume correlates with hierarchy position. After training on Gene Ontology, concepts near the root (&quot;biological process,&quot; &quot;molecular function&quot;) have large boxes, while leaf concepts (&quot;mitotic spindle assembly checkpoint signaling&quot;) have small ones. The Spearman correlation between log-volume and depth is strongly negative.</p>\n<p>This is not enforced by an explicit loss term — it emerges from the containment constraints alone. If  $C \\sqsubseteq D$ , then  $\\text{Vol}(C) \\le \\text{Vol}(D)$  (a subset can't be larger than its superset). Training on thousands of such constraints pushes the model toward a volume assignment that reflects the concept hierarchy.<span class=\"sidenote-ref\"></span><span class=\"sidenote\">You could add an explicit volume-depth loss term, but it would be redundant. Containment constraints already imply it.</span></p>\n<table>\n<thead>\n<tr>\n<th>Year</th>\n<th>Model</th>\n<th>Geometry</th>\n<th>Key contribution</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>2009</td>\n<td>Regions (Erk)</td>\n<td>Convex regions</td>\n<td>First proposal of words as regions, not points</td>\n</tr>\n<tr>\n<td>2013</td>\n<td>TransE (Bordes)</td>\n<td>Point + translation</td>\n<td>Baseline KGE; simple, scalable, limited</td>\n</tr>\n<tr>\n<td>2016</td>\n<td>Order Emb. (Vendrov)</td>\n<td>Orthant cones</td>\n<td>Partial order from coordinate comparison</td>\n</tr>\n<tr>\n<td>2018</td>\n<td>Box Lattice (Vilnis)</td>\n<td>Axis-aligned boxes</td>\n<td>Boxes as lattice; containment probability</td>\n</tr>\n<tr>\n<td>2018</td>\n<td>Ent. Cones (Ganea)</td>\n<td>Hyperbolic cones</td>\n<td>Hierarchy in hyperbolic space</td>\n</tr>\n<tr>\n<td>2019</td>\n<td>Smoothing (Li)</td>\n<td>Smoothed boxes</td>\n<td>Gaussian smoothing fixes zero gradients</td>\n</tr>\n<tr>\n<td>2019</td>\n<td>RotatE (Sun)</td>\n<td>Point + rotation</td>\n<td>Complex-space rotation handles symmetry</td>\n</tr>\n<tr>\n<td>2020</td>\n<td>Gumbel Box (Dasgupta)</td>\n<td>Probabilistic boxes</td>\n<td>Min-max stable smoothing preserves lattice</td>\n</tr>\n<tr>\n<td>2020</td>\n<td>Query2Box (Ren)</td>\n<td>Boxes for queries</td>\n<td>Multi-hop query answering with boxes</td>\n</tr>\n<tr>\n<td>2020</td>\n<td>BetaE (Ren)</td>\n<td>Beta distributions</td>\n<td>Negation via parameter swap</td>\n</tr>\n<tr>\n<td>2021</td>\n<td>ConE (Zhang)</td>\n<td>2D angular sectors</td>\n<td>Negation via complement sectors</td>\n</tr>\n<tr>\n<td>2021</td>\n<td>CQD (Arakelyan)</td>\n<td>Any + fuzzy logic</td>\n<td>Decompose queries into atomic predictions</td>\n</tr>\n<tr>\n<td>2022</td>\n<td>BoxEL (Xiong)</td>\n<td>Boxes for EL⁺⁺</td>\n<td>First faithful ontology box embedding</td>\n</tr>\n<tr>\n<td>2024</td>\n<td>Box²EL (Jackermeier)</td>\n<td>Dual boxes for EL⁺⁺</td>\n<td>Affine role transformations</td>\n</tr>\n<tr>\n<td>2024</td>\n<td>Octagons (Charpenay)</td>\n<td>Octagonal regions</td>\n<td>Diagonal constraints capture cross-dim correlations</td>\n</tr>\n<tr>\n<td>2025</td>\n<td>TransBox (Yang)</td>\n<td>Boxes for EL⁺⁺</td>\n<td>EL⁺⁺-closed: logical operations stay in representation</td>\n</tr>\n<tr>\n<td>2026</td>\n<td>TaxoBell (Mishra)</td>\n<td>Gaussian boxes</td>\n<td>Self-supervised taxonomy expansion</td>\n</tr>\n</tbody>\n</table>\n<p>After 2018, the box lattice foundation split into two threads: the gradient fix (Gumbel, 2020) and multi-hop query answering (<code>Query2Box</code>, <code>BetaE</code>, <code>ConE</code>, <code>CQD</code>, 2020–2021). Ontology applications followed (<code>BoxEL</code>, <code>Box²EL</code>, <code>TransBox</code>, 2022–2025).</p>\n<h2>The Landscape</h2>\n<table>\n<thead>\n<tr>\n<th>Geometry</th>\n<th>Containment</th>\n<th>Intersection closed?</th>\n<th>Negation?</th>\n<th>Parameters</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Boxes</td>\n<td>Volumetric</td>\n<td>Yes</td>\n<td>No</td>\n<td> $2d$ </td>\n</tr>\n<tr>\n<td>Gumbel boxes</td>\n<td>Soft volumetric</td>\n<td>Yes</td>\n<td>No</td>\n<td> $2d + \\beta$ </td>\n</tr>\n<tr>\n<td>Cones</td>\n<td>Angular</td>\n<td>Yes</td>\n<td>Yes</td>\n<td> $2d$  (2D sectors)</td>\n</tr>\n<tr>\n<td>Octagons</td>\n<td>Volumetric + diagonal</td>\n<td>Yes</td>\n<td>No</td>\n<td> $O(d^2)$ </td>\n</tr>\n<tr>\n<td>Gaussians</td>\n<td>KL divergence</td>\n<td>No</td>\n<td>No</td>\n<td> $2d$ </td>\n</tr>\n<tr>\n<td>Entailment cones (hyperbolic)</td>\n<td>Geodesic</td>\n<td>Yes</td>\n<td>No</td>\n<td> $2d$ </td>\n</tr>\n<tr>\n<td>Subspaces</td>\n<td>Subspace inclusion</td>\n<td>Yes</td>\n<td>No</td>\n<td> $O(d \\cdot k)$ </td>\n</tr>\n</tbody>\n</table>\n<p>No single geometry dominates. In my experience implementing several of these in <a href=\"https://github.com/arclabs561/subsume\">subsume</a>, the right choice depends on the task. For ontology completion (subsumption only, no negation), boxes or Gumbel boxes suffice. For queries requiring negation and disjunction, cones or the fuzzy approach are needed. For taxonomy expansion with uncertainty, Gaussians (Mishra et al., <span id=\"cite21\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref21\">21</a>) encode uncertainty through variance parameters. Entailment cones in hyperbolic space (Ganea et al., <span id=\"cite5\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref5\">5</a>) — a non-Euclidean geometry where volume grows exponentially with radius, naturally accommodating trees with high branching factor — exploit this growth for tree-like hierarchies. Octagons (Charpenay and Schockaert, <span id=\"cite15\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref15\">15</a>) add diagonal constraints  $x_i \\pm x_j \\le c$  to capture correlations between dimensions that axis-aligned boxes miss, at the cost of  $O(d^2)$  parameters per concept. Most recently, Moreira et al. [<span id=\"cite19\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref19\">19</a>] proposed linear subspaces as the geometric primitive: more general concepts get higher-dimensional subspaces, and subsumption is subspace inclusion. This is closed under intersection (subspace intersection is a subspace) but trades the intuitive &quot;volume&quot; interpretation for algebraic structure.</p>\n<p><span class=\"sidenote-ref\"></span><span class=\"sidenote\">Choose your geometry before your optimizer. The shape of the embedding space determines what relationships the model can even express.</span></p>\n<h2>Open Problems</h2>\n<p><strong>Faithfulness.</strong> Box embeddings can approximate but may not perfectly represent the logical structure of an ontology. Xiong et al. [<span id=\"cite14\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref14\">14</a>] proved <em>soundness</em> for <code>BoxEL</code> — if the model scores a subsumption as true (zero loss), then it actually holds in the ontology. The complementary property, completeness (every true subsumption scores high), is not claimed. <strong><code>TransBox</code></strong> (Yang et al., <span id=\"cite20\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref20\">20</a>) — &quot;translation + box,&quot; combining <code>TransE</code>-style relation modeling with box containment — goes further, achieving &quot;EL⁺⁺-closure&quot; — intersections, role-filler constraints, and TBox axioms all stay within the box representation.</p>\n<p>Leemhuis and Kutz [<span id=\"cite18\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref18\">18</a>] formalized what box semantics can and cannot represent, characterizing the expressiveness limits of box-based embeddings for EL⁺⁺ and extensions. A pragmatic alternative: <code>DELE</code> (Mashkova et al., <span id=\"cite17\"></span><a href=\"https://attobop.net/posts/region-embeddings/#ref17\">17</a>) sidesteps geometric faithfulness entirely by precomputing the deductive closure — the complete set of subsumptions that logically follow from the ontology's axioms — with a classical reasoner, then training the embedding to reproduce it. No general impossibility result rules out faithful box embeddings, and no paper has achieved <em>strong</em> faithfulness (entailed axioms score high AND non-entailed axioms score low). Whether the gap matters in practice is unsettled.</p>\n<p><strong>Beyond axis-alignment.</strong> Axis-aligned boxes assume that dimensions are independent — the containment of Dog within Animal decomposes into  $d$  independent interval containments. For concept hierarchies with correlated features, this is restrictive. Octagons partially address this, but only for pairs of dimensions. Full covariance (oriented boxes, ellipsoids) would be more expressive, but the intersection of oriented boxes is not an oriented box. Finding the right trade-off between expressiveness and closure is open.</p>\n<p><strong>Composition of geometries.</strong> Each geometry handles some operations well and others poorly. Can different geometries be composed? For instance, use boxes for conjunction and cones for negation within the same query. The challenge is defining a consistent scoring function across heterogeneous representations. This is related to the broader question of whether there exists a single continuous geometry of polynomial dimension that is closed under intersection, union, complement, and projection — a universal query embedding. No such geometry is currently known for continuous embeddings at practical dimensionality.</p>\n<p><strong>Calibration.</strong> The containment probability  $P(B \\subseteq A) = \\text{Vol}(A \\cap B) / \\text{Vol}(B)$  is a geometric quantity, not a calibrated probability. A value of 0.7 does not mean &quot;70% chance of subsumption&quot; — it means &quot;70% of  $B$ 's volume overlaps with  $A$ .&quot;<span class=\"sidenote-ref\"></span><span class=\"sidenote\">This matters for downstream decisions. If you threshold at 0.5, are you accepting everything with &gt;50% overlap, or everything with &gt;50% probability of being true? The model doesn't know.</span> Whether these geometric scores can be calibrated to meaningful probabilities, and whether that calibration generalizes across domains, is unexplored.</p>\n<p><strong>Density matrices.</strong> Quantum-inspired representations model concepts as positive semidefinite matrices (matrices with all eigenvalues  $\\ge 0$ , used in quantum mechanics to represent probabilistic mixtures of states) on a Hilbert space. One natural analogue of containment is the Loewner order: model  $A \\sqsubseteq B$  by requiring  $B - A$  to be positive semidefinite (all eigenvalues of the difference are  $\\ge 0$ ). This is richer than box containment — it captures arbitrary subspace relationships, not just axis-aligned intervals. The computational cost is  $O(d^2)$  per concept, the intersection structure is more complex, and the connection to DL subsumption semantics is an analogy rather than a derivation. As of 2026, quantum KGE implementations exist but use quantum circuits for scoring, not density matrices for concept representation. The theoretical proposal remains unimplemented.</p>\n<p><strong>Evaluation methodology.</strong> A pervasive but underacknowledged flaw: most ontology embedding benchmarks treat logically entailed axioms as negatives during training and evaluation. <code>DELE</code> [<a href=\"https://attobop.net/posts/region-embeddings/#ref17\">17</a>] showed this is broken — the model is penalized for predicting things that are provably true. The deductive closure must be computed first and filtered from the negative set. More broadly, no standardized benchmark suite exists for ontology embeddings; different papers use incompatible datasets and protocols, making cross-paper comparison unreliable.</p>\n<p><strong>Geometric vs. neural.</strong> For multi-hop query answering, the geometric approach (<code>Query2Box</code>, <code>ConE</code>, <code>BetaE</code>) is being superseded. Newer geometric approaches like cylinder embeddings and fully geometric methods — where all logical operations are geometric transformations, not learned neural operators — show that the &quot;geometric&quot; label applied to <code>Query2Box</code> was partially misleading: its intersection operator was a neural network. GNN-based query answering substantially outperforms all purely geometric methods on standard benchmarks. The niche where geometric methods retain a defensible advantage is ontology-specific settings with explicit DL semantics: GNN foundation models don't enforce logical closure, but <code>TransBox</code> and <code>DELE</code> do.</p>\n<hr>\n<h2>References</h2>\n<p><span id=\"ref1\">[1]</span> Erk, K. (2009). &quot;Representing Words as Regions in Vector Space.&quot; <em>CoNLL</em>, 57--65. <a href=\"https://attobop.net/posts/region-embeddings/#cite1\">↩</a></p>\n<p><span id=\"ref2\">[2]</span> Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. &amp; Yakhnenko, O. (2013). &quot;Translating Embeddings for Modeling Multi-relational Data.&quot; <em>NeurIPS</em>, 2787--2795. <a href=\"https://attobop.net/posts/region-embeddings/#cite2\">↩</a></p>\n<p><span id=\"ref3\">[3]</span> Vendrov, I., Kiros, R., Fidler, S. &amp; Urtasun, R. (2016). &quot;Order-Embeddings of Images and Language.&quot; <em>ICLR</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite3\">↩</a></p>\n<p><span id=\"ref4\">[4]</span> Vilnis, L., Li, X., Murty, S. &amp; McCallum, A. (2018). &quot;Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures.&quot; <em>ACL</em>, 263--272. <a href=\"https://attobop.net/posts/region-embeddings/#cite4\">↩</a></p>\n<p><span id=\"ref5\">[5]</span> Ganea, O.-E., Becigneul, G. &amp; Hofmann, T. (2018). &quot;Hyperbolic Entailment Cones for Learning Hierarchical Embeddings.&quot; <em>ICML</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite5\">↩</a></p>\n<p><span id=\"ref6\">[6]</span> Li, X. L., Vilnis, L., Zhang, D., Boratko, M. &amp; McCallum, A. (2019). &quot;Smoothing the Geometry of Probabilistic Box Embeddings.&quot; <em>ICLR</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite6\">↩</a></p>\n<p><span id=\"ref7\">[7]</span> Sun, Z., Deng, Z.-H., Nie, J.-Y. &amp; Tang, J. (2019). &quot;RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.&quot; <em>ICLR</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite7\">↩</a></p>\n<p><span id=\"ref8\">[8]</span> Dasgupta, S. S., Boratko, M., Zhang, D., Vilnis, L., Li, X. L. &amp; McCallum, A. (2020). &quot;Improving Local Identifiability in Probabilistic Box Embeddings.&quot; <em>NeurIPS</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite8\">↩</a></p>\n<p><span id=\"ref9\">[9]</span> Ren, H., Hu, W. &amp; Leskovec, J. (2020). &quot;Query2Box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings.&quot; <em>ICLR</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite9\">↩</a></p>\n<p><span id=\"ref10\">[10]</span> Ren, H. &amp; Leskovec, J. (2020). &quot;Beta Embeddings for Multi-Hop Logical Reasoning in Knowledge Graphs.&quot; <em>NeurIPS</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite10\">↩</a></p>\n<p><span id=\"ref11\">[11]</span> Arakelyan, E., Daza, D., Minervini, P. &amp; Cochez, M. (2021). &quot;Complex Query Answering with Neural Link Predictors.&quot; <em>ICLR</em> (Outstanding Paper). <a href=\"https://attobop.net/posts/region-embeddings/#cite11\">↩</a></p>\n<p><span id=\"ref12\">[12]</span> Zhang, Z., Wang, J., Chen, J., Ji, S. &amp; Wu, F. (2021). &quot;ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs.&quot; <em>NeurIPS</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite12\">↩</a></p>\n<p><span id=\"ref13\">[13]</span> Chen, X., Hu, Z. &amp; Sun, Y. (2022). &quot;Fuzzy Logic Based Logical Query Answering on Knowledge Graphs.&quot; <em>AAAI</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite13\">↩</a></p>\n<p><span id=\"ref14\">[14]</span> Xiong, B., Potyka, N., Tran, T.-K., Nayyeri, M. &amp; Staab, S. (2022). &quot;Faithful Embeddings for EL⁺⁺ Knowledge Bases.&quot; <em>ECML-PKDD</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite14\">↩</a></p>\n<p><span id=\"ref15\">[15]</span> Charpenay, V. &amp; Schockaert, S. (2024). &quot;Embedding Ontologies with Octagons.&quot; <em>IJCAI</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite15\">↩</a></p>\n<p><span id=\"ref16\">[16]</span> Jackermeier, M., Chen, J. &amp; Horrocks, I. (2024). &quot;Dual Box Embeddings for the Description Logic EL⁺⁺.&quot; <em>WWW '24</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite16\">↩</a></p>\n<p><span id=\"ref17\">[17]</span> Mashkova, O., Zhapa-Camacho, F. &amp; Hoehndorf, R. (2024). &quot;DELE: Deductive EL++ Embeddings for Knowledge Base Completion.&quot; <em>arXiv:2411.01574</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite17\">↩</a></p>\n<p><span id=\"ref18\">[18]</span> Leemhuis, M. &amp; Kutz, O. (2025). &quot;Understanding the Expressive Capabilities of Knowledge Base Embeddings under Box Semantics.&quot; <em>KR '25</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite18\">↩</a></p>\n<p><span id=\"ref19\">[19]</span> Moreira, G., Marinho, Z., Marques, M., Costeira, J. P. &amp; Xiong, B. (2025). &quot;Native Logical and Hierarchical Representations with Subspace Embeddings.&quot; <em>arXiv:2508.16687</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite19\">↩</a></p>\n<p><span id=\"ref20\">[20]</span> Yang, H., Chen, J. &amp; Sattler, U. (2025). &quot;TransBox: EL⁺⁺-closed Ontology Embedding.&quot; <em>WWW '25</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite20\">↩</a></p>\n<p><span id=\"ref21\">[21]</span> Mishra, S. et al. (2026). &quot;TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion.&quot; <em>arXiv:2601.09633</em>. <a href=\"https://attobop.net/posts/region-embeddings/#cite21\">↩</a></p>\n<h3>Implementations</h3>\n<ul>\n<li><a href=\"https://github.com/arclabs561/tranz\">tranz</a> — <code>TransE</code>, <code>RotatE</code>, <code>ComplEx</code>, and <code>DistMult</code> with GPU training via candle. The point-embedding baseline.</li>\n<li><a href=\"https://github.com/arclabs561/subsume\">subsume</a> — Box, Gumbel box, cone, octagon, Gaussian, and hyperbolic interval embeddings with EL⁺⁺ losses, <code>Query2Box</code> scoring, and fuzzy t-norm query answering.</li>\n</ul>\n","date_published":"Fri, 03 Apr 2026 00:00:00 GMT"},{"id":"https://attobop.net/posts/primality-testing/","url":"https://attobop.net/posts/primality-testing/","title":"From Fermat to AKS: A History of Primality Testing","content_html":"<p>Is 341 prime? Trial division says no ( $11 \\times 31$ ). But if you check  $2^{340} \\bmod 341$ , you get 1 -- exactly what a prime would give. This failure of Fermat's test, and the centuries it took to fix it, is the story of probabilistic primality testing.</p>\n<!--more-->\n<h2>Background</h2>\n<p>A positive integer  $n > 1$  is prime if its only divisors are 1 and itself. This is a clean definition but a hard computational problem: given an arbitrary  $n$ , how do you decide?</p>\n<p>The stakes are not abstract. RSA key generation requires finding two large primes  $p$  and  $q$ , typically 1024 bits each, so that their product  $n = pq$  is hard to factor. The security of RSA rests on the assumption that factoring  $n$  is computationally infeasible -- but that assumption is worthless if  $p$  or  $q$  is not actually prime. A composite that slips through primality testing produces a key that can be factored trivially. RSA, Diffie-Hellman, and most public-key cryptography depend on reliably generating large primes.</p>\n<h3>Trial division</h3>\n<p>The oldest approach: to test whether  $n$  is prime, try dividing it by every integer from 2 up to  $\\sqrt{n}$ . If none divide evenly,  $n$  is prime.</p>\n<p>This works because if  $n = ab$  with  $a \\le b$ , then  $a \\le \\sqrt{n}$ . So we only need to check potential factors up to the square root. A standard optimization: after checking 2, only test odd divisors (or better, only test primes up to  $\\sqrt{n}$ , but generating those primes is itself a sub-problem -- the <a href=\"https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes\">Sieve of Eratosthenes</a> handles it for moderate bounds).</p>\n<p>The complexity is  $O(\\sqrt{n})$  divisions. For a 10-digit number, that is roughly 100,000 divisions -- fast on modern hardware. For a 300-digit number (a typical RSA prime),  $\\sqrt{n}$  has 150 digits. There are approximately  $10^{150}$  candidate divisors. At  $10^{12}$  divisions per second, this would take about  $10^{138}$  seconds. The universe is approximately  $4 \\times 10^{17}$  seconds old.</p>\n<p>The issue is that  $O(\\sqrt{n})$  is exponential in the bit-length of  $n$ . If  $n$  has  $b$  bits, then  $\\sqrt{n} = 2^{b/2}$ , and we need  $2^{b/2}$  divisions. A polynomial-time algorithm would need  $O(b^c)$  operations for some constant  $c$ . Trial division is correct and deterministic, but for cryptographic sizes it is useless -- and this is not a constant-factor problem that faster hardware will solve. The gap is exponential.</p>\n<h3>The probabilistic trade</h3>\n<p>This gap -- between what we need (test 1024-bit numbers) and what deterministic methods can do in reasonable time -- motivates a radical idea: accept a test that might be wrong.</p>\n<p>A probabilistic primality test gives one of two answers: &quot;definitely composite&quot; or &quot;probably prime.&quot; The &quot;probably&quot; comes with a quantifiable error bound. If the probability of a false &quot;probably prime&quot; answer is less than  $2^{-128}$ , that is a smaller failure probability than a cosmic ray flipping a bit in your CPU during the computation. For engineering purposes, &quot;probably prime with error  $< 2^{-128}$ &quot; is as good as &quot;proven prime.&quot;</p>\n<p>The challenge is building a test with: (1) an error probability that decreases exponentially with the number of iterations, and (2) no blind spots -- no class of composites that always fools the test regardless of how many iterations you run.</p>\n<p>The first condition is achievable: run the test multiple times with independent random choices, and error probabilities multiply (shrinking exponentially). The second took 340 years from Fermat to Rabin, because the natural first attempt -- the Fermat test -- has exactly the blind spot we need to avoid.</p>\n<h2>Fermat's Little Theorem</h2>\n<p>In 1640, Pierre de Fermat stated a theorem that would become foundational to primality testing:</p>\n\n\n$$a^{p-1} \\equiv 1 \\pmod{p}$$\n\n<p>for any prime  $p$  and integer  $a$  not divisible by  $p$ .</p>\n<p>Consider the  $p-1$  nonzero residues  $\\{1, 2, \\ldots, p-1\\}$  modulo  $p$ . Multiplying each by  $a$  (with  $\\gcd(a, p) = 1$ ) permutes this set: the map  $x \\mapsto ax \\bmod p$  is a bijection on  $\\{1, \\ldots, p-1\\}$ . Therefore:</p>\n\n\n$$\\prod_{i=1}^{p-1} (a \\cdot i) \\equiv \\prod_{i=1}^{p-1} i \\pmod{p}$$\n\n<p>The left side is  $a^{p-1} \\cdot (p-1)!$  and the right side is  $(p-1)!$ . Since  $p$  is prime,  $(p-1)!$  is invertible mod  $p$ , giving  $a^{p-1} \\equiv 1$ .</p>\n<p>The contrapositive gives us a primality test: if  $a^{n-1} \\not\\equiv 1 \\pmod{n}$  for some  $a$  coprime to  $n$ , then  $n$  is definitely composite.</p>\n<h2>The Fermat Primality Test</h2>\n<p>This observation leads to a simple probabilistic test:</p>\n<ol>\n<li>Pick a random base  $a$  with  $1 < a < n-1$ </li>\n<li>Compute  $a^{n-1} \\mod n$ </li>\n<li>If the result is not 1,  $n$  is composite</li>\n<li>If the result is 1,  $n$  is <em>probably</em> prime</li>\n</ol>\n<p>Step 2 uses modular exponentiation by repeated squaring, which runs in  $O(\\log n)$  multiplications mod  $n$ , each costing  $O((\\log n)^2)$  with schoolbook multiplication. Total:  $O((\\log n)^3)$ . This scales with the number of digits, not the magnitude -- exactly what we need.</p>\n<p>Note an asymmetry: the test can <em>prove</em> compositeness (if  $a^{n-1} \\not\\equiv 1$ ,  $n$  is definitely composite) but can only <em>suggest</em> primality (if  $a^{n-1} \\equiv 1$ ,  $n$  might be prime). This one-sided error structure defines probabilistic primality testing. We will never get a false &quot;composite&quot; answer, only a false &quot;probably prime.&quot;</p>\n<p>The weakness: composite numbers that pass this test exist.</p>\n<h2>Pseudoprimes</h2>\n<p>A composite number  $n$  is called a <em>Fermat pseudoprime</em> to base  $a$  if  $a^{n-1} \\equiv 1 \\pmod{n}$ .</p>\n<p>For example,  $341 = 11 \\times 31$  is a pseudoprime to base 2:</p>\n\n\n$$2^{340} \\equiv 1 \\pmod{341}$$\n\n<p>Why does this happen? By Fermat's theorem applied to the factors:  $2^{10} \\equiv 1 \\pmod{11}$  and  $2^{30} \\equiv 1 \\pmod{31}$ . Since  $10 \\mid 340$  and  $30 \\mid 340$ , we get  $2^{340} \\equiv 1$  modulo both 11 and 31, hence modulo  $341$  by the Chinese Remainder Theorem.</p>\n<p>This means the Fermat test with base 2 incorrectly identifies 341 as &quot;probably prime.&quot;</p>\n<p>The pseudoprimes to base 2 below 1000 are just 341 and 561. They are rare -- there are only 245 base-2 pseudoprimes below  $10^6$  -- but they exist, and their rarity is deceptive. It might suggest that testing multiple bases would catch all composites: if  $n$  is a pseudoprime to base 2, surely it fails for base 3? Often yes. But not always.</p>\n<h2>Carmichael Numbers</h2>\n<p>In 1885, Vaclav Simerka found several composite numbers that are pseudoprimes to <em>every</em> base coprime to them. He published in <em>Casopis pro pestovani mathematiky a fysiky</em>, a Czech-language journal with limited circulation outside Bohemia. His work went unnoticed for over a century.</p>\n<p>In 1899, Alwin Korselt characterized such numbers: a composite  $n$  is a Carmichael number if and only if:</p>\n<ol>\n<li> $n$  is square-free (no repeated prime factors)</li>\n<li>For every prime  $p$  dividing  $n$ , we have  $(p-1) \\mid (n-1)$ </li>\n</ol>\n<p>But Korselt knew of no examples.</p>\n<p>In 1910, Robert D. Carmichael independently discovered the smallest such number:</p>\n\n\n$$561 = 3 \\times 11 \\times 17$$\n\n<p>Verification:  $561 - 1 = 560 = 2^4 \\times 5 \\times 7$ , and indeed:</p>\n<ul>\n<li> $3 - 1 = 2$  divides 560</li>\n<li> $11 - 1 = 10$  divides 560</li>\n<li> $17 - 1 = 16$  divides 560</li>\n</ul>\n<p>The Fermat test fails completely on Carmichael numbers -- no choice of base (coprime to  $n$ ) will reveal their compositeness. For the Fermat test, Carmichael numbers are an unconditional blind spot. No amount of repetition helps.</p>\n<p>Korselt's criterion has an intuitive interpretation: it says that  $n$  &quot;pretends&quot; to be prime by having its prime factors' orders all divide  $n - 1$ . The Chinese Remainder Theorem then forces  $a^{n-1} \\equiv 1 \\pmod{n}$  for all  $a$  coprime to  $n$ . The deception is structural, not coincidental.</p>\n<h3>How many exist?</h3>\n<p>For decades, mathematicians wondered whether there are infinitely many Carmichael numbers. In 1994, Alford, Granville, and Pomerance proved there are: for sufficiently large  $x$ , the number of Carmichael numbers up to  $x$  exceeds  $x^{2/7}$ . The proof is non-constructive and the actual density appears to be much higher than this bound. Erdos conjectured the count up to  $x$  is  $x^{1-o(1)}$ , but this remains open.</p>\n<h2>Solovay-Strassen: The First Probabilistic Test</h2>\n<p>In 1977, Robert Solovay and Volker Strassen gave the first polynomial-time probabilistic primality test with a rigorously proven error bound. Their test uses the Euler criterion, which connects modular exponentiation to the Legendre symbol.</p>\n<p><strong>Lemma</strong> (Euler criterion). For an odd prime  $p$  and  $\\gcd(a, p) = 1$ :</p>\n\n\n$$a^{(p-1)/2} \\equiv \\left(\\frac{a}{p}\\right) \\pmod{p}$$\n\n<p>where  $\\left(\\frac{a}{p}\\right)$  is the Legendre symbol ( $+1$  if  $a$  is a quadratic residue mod  $p$ ,  $-1$  otherwise).</p>\n<p><em>Proof sketch.</em> Since  $\\mathbb{F}_p^\\times$  is cyclic of order  $p-1$ , we have  $a^{p-1} \\equiv 1$ , so  $a^{(p-1)/2}$  is a square root of 1, hence  $\\pm 1$ . It equals  $+1$  if and only if  $a = g^{2k}$  for some generator  $g$  (i.e.,  $a$  is a quadratic residue), because  $a^{(p-1)/2} = g^{k(p-1)} = 1$  when  $k$  is even and  $g^{k(p-1)} = g^{(p-1)/2 \\cdot \\text{odd}} = -1$  when  $k$  is odd.</p>\n<p>The Solovay-Strassen test: pick a random  $a$ , compute both  $a^{(n-1)/2} \\bmod n$  and the Jacobi symbol  $\\left(\\frac{a}{n}\\right)$  (which can be computed in  $O(\\log^2 n)$  by reciprocity, without knowing the factorization of  $n$ ). If they disagree,  $n$  is composite.</p>\n<p>Why this breaks the Carmichael barrier: a Carmichael number  $n$  satisfies  $a^{n-1} \\equiv 1 \\pmod{n}$  for all  $a$  coprime to  $n$ , but this does not force  $a^{(n-1)/2} \\equiv \\left(\\frac{a}{n}\\right)$ . The Jacobi symbol factors multiplicatively over the prime factors of  $n$ , while the power  $a^{(n-1)/2}$  does not decompose the same way. Solovay and Strassen proved that for any composite  $n$ , at least half of all bases  $a$  in  $\\{1, \\ldots, n-1\\}$  are witnesses -- the Euler criterion fails for them. This gives a  $1/2$  error bound per round, worse than Miller-Rabin's  $1/4$  but sufficient to bypass Carmichael numbers entirely.</p>\n<h2>Strong Pseudoprimes and Miller's Test</h2>\n<p>In 1976, Gary Miller observed that we can strengthen the Fermat test by exploiting a deeper property of primes.</p>\n<p>Write  $n - 1 = 2^s \\cdot d$  where  $d$  is odd. For a prime  $p$  and base  $a$  coprime to  $p$ , one of the following must hold:</p>\n\n\n$$a^d \\equiv 1 \\pmod{p}$$\n\n<p>or</p>\n\n\n$$a^{2^r d} \\equiv -1 \\pmod{p} \\quad \\text{for some } 0 \\le r < s$$\n\n<p>Why must one of these hold? Start from Fermat:  $a^{p-1} = a^{2^s d} \\equiv 1 \\pmod{p}$ . Now consider the squaring chain:</p>\n\n\n$$a^d, \\; a^{2d}, \\; a^{4d}, \\; \\ldots, \\; a^{2^s d}$$\n\n<p>The last term is 1. Each term is the square of the previous. Working backwards from  $a^{2^s d} \\equiv 1$ : if  $a^{2^r d} \\equiv 1$ , then  $a^{2^{r-1} d}$  is a square root of 1 modulo  $p$ . In a field (which  $\\mathbb{Z}/p\\mathbb{Z}$  is, since  $p$  is prime), the polynomial  $x^2 - 1$  has at most two roots:  $+1$  and  $-1$ . So either the previous term is also 1 (and we continue backwards) or it is  $-1$  (and we stop). Eventually we either reach  $a^d \\equiv 1$  or find some  $a^{2^r d} \\equiv -1$ .</p>\n<p>The critical point: for a composite  $n$ ,  $\\mathbb{Z}/n\\mathbb{Z}$  is <em>not</em> a field, and  $x^2 \\equiv 1 \\pmod{n}$  can have more than two solutions. For example, modulo  $n = 15$ , we have  $4^2 = 16 \\equiv 1$  -- a &quot;non-trivial square root of unity.&quot; This is the crack that Miller's test exploits.</p>\n<p><strong>Worked example:  $n = 341$ ,  $a = 2$ .</strong> Recall that 341 passes the Fermat test for base 2. Write  $341 - 1 = 340 = 2^2 \\times 85$ , so  $s = 2$  and  $d = 85$ . The squaring chain is:</p>\n\n\n$$2^{85} \\bmod 341 = 32, \\quad 32^2 \\bmod 341 = 1024 \\bmod 341 = 1$$\n\n<figure>\n  <img src=\"https://attobop.net/posts/primality-testing/squaring_comparison.png\" alt=\"Squaring chains compared: prime n=97 reaches -1 cleanly, composite n=341 has a non-trivial square root\" width=\"560\">\n  <figcaption> Top: prime n=97, the chain reaches -1 at the third squaring -- normal behavior in a field. Bottom: composite n=341, the chain jumps from 32 to 1 without passing through +/-1. The non-trivial square root witnesses compositeness.</figcaption>\n</figure>\n<p>The chain went from 32 to 1 in one squaring. But  $32 \\not\\equiv 1$  and  $32 \\not\\equiv -1 \\pmod{341}$  (since  $-1 \\equiv 340$ ). So 32 is a non-trivial square root of 1 modulo 341 -- exactly the kind of witness that cannot exist modulo a prime. Miller's test correctly identifies 341 as composite. The Fermat test missed this because it only checks the final value  $2^{340} \\bmod 341 = 1$ , discarding the information in the intermediate steps.</p>\n<p>A composite number that passes this stronger test for base  $a$  is called a <em>strong pseudoprime</em> to base  $a$ . Unlike the Fermat test, there are no &quot;strong Carmichael numbers&quot; -- no composite passes for all bases.</p>\n<h2>The Miller-Rabin Test</h2>\n<p>In 1980, Michael Rabin made Miller's test probabilistic and proved a key result: for any composite  $n$ , at most  $1/4$  of bases  $a$  in  $\\{2, \\ldots, n-2\\}$  are <em>strong liars</em> (bases for which  $n$  passes the strong test).</p>\n<p><strong>Theorem</strong> (Rabin, 1980). If  $n$  is an odd composite, the number of strong liars in  $\\{1, \\ldots, n-1\\}$  is at most  $\\frac{n-1}{4}$ .</p>\n<p><em>Proof sketch.</em> Write  $n - 1 = 2^s d$  with  $d$  odd. By the Chinese Remainder Theorem,  $(\\mathbb{Z}/n\\mathbb{Z})^\\times \\cong \\prod_{i} (\\mathbb{Z}/p_i^{e_i}\\mathbb{Z})^\\times$ . A strong liar  $a$  must have its squaring chain &quot;look prime&quot; -- either  $a^d \\equiv 1$  or  $a^{2^r d} \\equiv -1$  for some  $r$ .</p>\n<p>The key constraint:  $a^{2^r d} \\equiv -1 \\pmod{n}$  requires  $a^{2^r d} \\equiv -1$  modulo <em>every</em> prime power factor simultaneously. But each factor has its own 2-adic structure ( $p_i - 1 = 2^{s_i} d_i$  with possibly different  $s_i$ ), so the squaring chains must &quot;synchronize&quot; across all factors at the same step  $r$ . The strong liars form a subgroup of  $(\\mathbb{Z}/n\\mathbb{Z})^\\times$ . The synchronization constraint forces this subgroup to have index at least 4.</p>\n<p>The worst case is  $n = pq$  with  $p \\equiv q \\equiv 3 \\pmod{4}$  and  $\\gcd(p-1, q-1) = 2$ , which gives exactly  $\\frac{n-1}{4}$  strong liars. All other composites have fewer.</p>\n<p>This means:</p>\n<ul>\n<li>Each iteration has at most  $1/4$  probability of being fooled</li>\n<li>After  $k$  iterations with independent random bases, the error probability is at most  $(1/4)^k$ </li>\n<li>With 40 iterations, the error probability is less than  $10^{-24}$ </li>\n</ul>\n<p>But most composites are caught by the first random base. Damgard, Landrock, and Pomerance (1993) showed that for random odd  $k$ -bit composites, the average-case error probability after  $t$  rounds drops as  $\\exp(-c\\sqrt{kt})$  for a constant  $c$ , much faster than the worst-case  $4^{-t}$ . For 1024-bit numbers with even a single round, the average-case error probability is negligible.</p>\n<p>Unlike the Fermat test, Miller-Rabin is <em>not</em> fooled by Carmichael numbers. For any Carmichael number  $n$ , at least 75% of bases reveal its compositeness through the strong pseudoprime check. The &quot;non-trivial square roots of unity&quot; that exist in  $\\mathbb{Z}/n\\mathbb{Z}$  (but not in  $\\mathbb{Z}/p\\mathbb{Z}$ ) always provide witnesses.</p>\n<h3>Implementation</h3>\n<p>Modular exponentiation by repeated squaring makes each round efficient. Here is a complete Miller-Rabin test in Python:</p>\n<pre><code class=\"language-python\">import random\n\ndef is_probable_prime(n, k=40):\n    &quot;&quot;&quot;Miller-Rabin with k rounds. False = composite, True = probably prime.&quot;&quot;&quot;\n    if n &lt; 2: return False\n    if n &lt; 4: return True\n    if n % 2 == 0: return False\n    # Write n - 1 = 2^s * d with d odd\n    d, s = n - 1, 0\n    while d % 2 == 0:\n        d //= 2\n        s += 1\n    for _ in range(k):\n        a = random.randrange(2, n - 1)\n        x = pow(a, d, n)          # a^d mod n\n        if x == 1 or x == n - 1:\n            continue\n        for _ in range(s - 1):\n            x = pow(x, 2, n)      # square mod n\n            if x == n - 1:\n                break\n        else:\n            return False           # composite witness found\n    return True\n</code></pre>\n<p>Python's built-in three-argument <code>pow(a, d, n)</code> performs modular exponentiation by repeated squaring internally -- it does not compute  $a^d$  and then reduce. This makes the implementation efficient even for thousand-bit inputs. With  $k = 40$  rounds, the error probability is below  $4^{-40} \\approx 10^{-24}$ , which is the typical choice for cryptographic key generation.</p>\n<p>The <code>for/else</code> construct is Python-specific: the <code>else</code> clause executes only if the inner <code>for</code> loop completes without <code>break</code>, meaning none of the squarings produced  $-1$ , so  $a$  is a witness to compositeness.</p>\n<h2>Deterministic Variants</h2>\n<h3>Miller's test under GRH</h3>\n<p>Miller's original 1976 test was not probabilistic -- it was deterministic, conditional on the extended Riemann hypothesis (ERH). Under ERH, Miller proved that for any composite  $n$ , there exists a witness  $a < 2 (\\ln n)^2$ . This means you only need to test bases up to  $O((\\log n)^2)$ , giving a deterministic polynomial-time test -- <em>if</em> ERH is true.</p>\n<p>ERH remains unproven. The conditional result ties the difficulty of deterministic primality testing to the distribution of zeros of the Riemann zeta function.</p>\n<h3>Known deterministic witness sets</h3>\n<p>Even without ERH, exhaustive computation has established that specific small sets of bases suffice for bounded ranges. These results turn Miller-Rabin into a deterministic test for numbers below each bound:</p>\n<table>\n<thead>\n<tr>\n<th>Bound on  $n$ </th>\n<th>Sufficient bases</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td> $< 2{,}047$ </td>\n<td> $\\{2\\}$ </td>\n</tr>\n<tr>\n<td> $< 1{,}373{,}653$ </td>\n<td> $\\{2, 3\\}$ </td>\n</tr>\n<tr>\n<td> $< 3{,}215{,}031{,}751$ </td>\n<td> $\\{2, 3, 5, 7\\}$ </td>\n</tr>\n<tr>\n<td> $< 3{,}317{,}044{,}064{,}679{,}887{,}385{,}961{,}981$ </td>\n<td> $\\{2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41\\}$ </td>\n</tr>\n</tbody>\n</table>\n<p>The last bound is approximately  $3.3 \\times 10^{24}$ . For 64-bit integers ( $< 2^{64} \\approx 1.8 \\times 10^{19}$ ), the first 12 primes as bases give a deterministic test. This is how many implementations handle &quot;small&quot; numbers: no randomness needed, no probability of error, and the test runs in microseconds.</p>\n<p>The smallest strong pseudoprime to base 2 is 2047 -- which is why base  $\\{2\\}$  alone suffices below that threshold.</p>\n<h3>BPSW: the pragmatic gold standard</h3>\n<p>The <a href=\"https://en.wikipedia.org/wiki/Baillie%E2%80%93PSW_primality_test\">Baillie-PSW test</a>, introduced in 1980 by Robert Baillie, Carl Pomerance, John Selfridge, and Samuel Wagstaff, combines two tests that fail on disjoint sets of composites:</p>\n<ol>\n<li>A Miller-Rabin test to base 2</li>\n<li>A strong Lucas probable prime test (with a specific parameter selection method)</li>\n</ol>\n<p>The idea is that a composite which is a strong pseudoprime to base 2 is unlikely to also be a strong Lucas pseudoprime, and vice versa. This is not just a hope: the number-theoretic properties that make a number a Fermat/Miller-Rabin pseudoprime (multiplicative order structure) are largely independent of those that make it a Lucas pseudoprime (quadratic residue structure).</p>\n<p>The Lucas test works in a different algebraic setting. Given parameters  $P$  and  $Q$  with discriminant  $D = P^2 - 4Q$ , define the Lucas sequences  $U_k(P,Q)$  and  $V_k(P,Q)$  via the recurrence  $U_{k+1} = PU_k - QU_{k-1}$ . For a prime  $p$  with Jacobi symbol  $(D/p) = -1$ , we have  $U_{p+1} \\equiv 0 \\pmod{p}$ . The &quot;strong&quot; variant adds a squaring chain similar to Miller-Rabin. The parameter selection (Selfridge's Method A: choose the first  $D$  in the sequence  $5, -7, 9, -11, \\ldots$  such that  $(D/n) = -1$ ) is designed to maximize the independence between the Miller-Rabin and Lucas tests.</p>\n<p>No BPSW counterexample has ever been found. The original 1980 paper offered a $30 reward for a counterexample (from Pomerance, Selfridge, and Wagstaff). In 2021, Baillie, Fiori, and Wagstaff offered $2,000 for a counterexample to a strengthened variant. Neither reward has been claimed. Exhaustive search has verified that no counterexample exists below  $2^{64}$ .</p>\n<p>BPSW requires only two tests (one modular exponentiation + one Lucas computation), making it faster than running 40 rounds of Miller-Rabin. It is the default primality test in many implementations, including <code>Mathematica</code>, <code>Maple</code>, and <a href=\"https://www.sagemath.org/\"><code>SageMath</code></a>.</p>\n<h2>Beyond Miller-Rabin</h2>\n<p>Miller-Rabin and BPSW are probabilistic (or deterministic only for bounded ranges). For some applications -- primality certificates, mathematical proofs, record-keeping -- we need tests that <em>prove</em> primality unconditionally.</p>\n<h3>AKS: theoretical breakthrough</h3>\n<p>In 2002, Manindra Agrawal, Neeraj Kayal, and Nitin Saxena proved that primality is in P: there exists a deterministic polynomial-time algorithm that decides primality without any unproven hypothesis. Their test, known as AKS, was a landmark in computational number theory.</p>\n<p>The original algorithm runs in  $\\tilde{O}((\\log n)^{12})$  time. Lenstra and Pomerance improved this to  $\\tilde{O}((\\log n)^6)$  in 2005. The key idea starts from a classical observation: if  $n$  is prime, then the polynomial identity</p>\n\n\n$$(x + a)^n \\equiv x^n + a \\pmod{n}$$\n\n<p>holds in  $\\mathbb{Z}_n[x]$  for all  $a$  (this is essentially the Frobenius endomorphism). For composite  $n$ , this identity generally fails. But checking it directly requires working with a polynomial of degree  $n$  -- which is as expensive as trial division.</p>\n<p>AKS's insight is to check this identity modulo a polynomial  $x^r - 1$  for a suitable small  $r$ : verify  $(x + a)^n \\equiv x^{n \\bmod r} + a$  in  $\\mathbb{Z}_n[x]/(x^r - 1)$ . Now the polynomials have degree at most  $r$ , and  $r$  can be chosen as  $O((\\log n)^5)$  or better.</p>\n<p>The key lemma: if  $(x + a)^n \\equiv x^n + a \\pmod{n, x^r - 1}$  holds for  $a = 1, 2, \\ldots, \\lfloor\\sqrt{\\phi(r)} \\log n\\rfloor$ , and  $n$  has no prime factor  $\\le r$ , and  $n$  is not a perfect power, then  $n$  is prime. The proof constructs a group  $G$  of residues in  $(\\mathbb{Z}/n\\mathbb{Z})[x]/(x^r - 1)$  generated by  $\\{x + 1, x + 2, \\ldots\\}$  and shows that  $|G|$  is both large (from the many verified identities) and small (from the structure of the quotient ring), forcing  $n$  to be prime. The total number of checks is polynomial in  $\\log n$ .</p>\n<p>AKS is not used in practice. For 64-bit numbers, BPSW is orders of magnitude faster. For larger numbers, ECPP (below) is faster and also produces a certificate. AKS is a &quot;galactic algorithm&quot; -- polynomial in theory, but with constants so large that Miller-Rabin with 64 rounds is faster by orders of magnitude even for 1024-bit inputs. Its significance is theoretical, settling a long-standing complexity question, not providing a practical tool.</p>\n<h3>ECPP: the practical gold standard for proven primes</h3>\n<p>Elliptic Curve Primality Proving (ECPP) was developed by Shafi Goldwasser and Joe Kilian in 1986, and refined into a practical algorithm by A. O. L. Atkin and Francois Morain in 1993. It runs in  $\\tilde{O}((\\log n)^4)$  heuristic time and produces a primality certificate -- a compact proof that anyone can verify much faster than it took to produce.</p>\n<p>The core idea uses elliptic curves over  $\\mathbb{Z}/n\\mathbb{Z}$ . Given a point  $P$  on an elliptic curve  $E$  modulo  $n$ , if we can show that the group order  $|E(\\mathbb{Z}/n\\mathbb{Z})|$  has a large prime factor  $q$ , and certain conditions hold, then either  $n$  is prime or  $n$  has a very small factor (which we can check by trial division). The problem then reduces to proving  $q$  is prime -- a smaller instance of the same problem. This recursive structure terminates quickly.</p>\n<p>The certificate is an Atkin-Goldwasser-Kilian-Morain certificate: a chain of elliptic curves and points that witnesses primality at each recursive step. Verification runs in polynomial time and does not require trusting the prover's computation. This is the key distinction from probabilistic tests: ECPP does not say &quot;probably prime with high confidence.&quot; It says &quot;here is a proof; check it yourself.&quot; The certificate for a 10,000-digit prime might be a few megabytes, but verifying it takes minutes rather than the hours needed to produce it.</p>\n<p>As of 2025, ECPP holds the record for the largest proven prime with a general-purpose algorithm:  $R(109297) = (10^{109297} - 1)/9$ , a repunit with 109,297 digits. (Mersenne primes are tested with the deterministic Lucas-Lehmer test, which exploits their special form.)</p>\n<h3>Comparison</h3>\n<table>\n<thead>\n<tr>\n<th>Test</th>\n<th>Type</th>\n<th>Complexity</th>\n<th>Certainty</th>\n<th>Practical use</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Trial division</td>\n<td>Deterministic</td>\n<td> $O(\\sqrt{n})$ </td>\n<td>Proven</td>\n<td>Small  $n$  only</td>\n</tr>\n<tr>\n<td>Fermat</td>\n<td>Probabilistic</td>\n<td> $O(k \\log^3 n)$ </td>\n<td>Probable (with blind spots)</td>\n<td>Obsolete</td>\n</tr>\n<tr>\n<td>Solovay-Strassen</td>\n<td>Probabilistic</td>\n<td> $O(k \\log^3 n)$ </td>\n<td>Error  $\\le 2^{-k}$ </td>\n<td>Superseded by Miller-Rabin</td>\n</tr>\n<tr>\n<td>Miller-Rabin</td>\n<td>Probabilistic</td>\n<td> $O(k \\log^3 n)$ </td>\n<td>Error  $\\le 4^{-k}$ </td>\n<td>General purpose</td>\n</tr>\n<tr>\n<td>BPSW</td>\n<td>Probabilistic</td>\n<td> $O(\\log^3 n)$ </td>\n<td>No known counterexample</td>\n<td>Default in most libraries</td>\n</tr>\n<tr>\n<td>Miller (GRH)</td>\n<td>Deterministic (conditional)</td>\n<td> $O(\\log^5 n)$ </td>\n<td>Proven <em>if</em> ERH holds</td>\n<td>Theoretical</td>\n</tr>\n<tr>\n<td>AKS</td>\n<td>Deterministic</td>\n<td> $\\tilde{O}(\\log^6 n)$ </td>\n<td>Proven</td>\n<td>Not practical</td>\n</tr>\n<tr>\n<td>ECPP</td>\n<td>Deterministic</td>\n<td> $\\tilde{O}(\\log^4 n)$  heuristic</td>\n<td>Proven + certificate</td>\n<td>Largest proven primes</td>\n</tr>\n</tbody>\n</table>\n<h2>Timeline</h2>\n<table>\n<thead>\n<tr>\n<th>Year</th>\n<th>Event</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>1640</td>\n<td>Fermat states his &quot;little theorem&quot;</td>\n</tr>\n<tr>\n<td>1885</td>\n<td>Simerka finds first Carmichael numbers (work overlooked)</td>\n</tr>\n<tr>\n<td>1899</td>\n<td>Korselt gives characterization criterion</td>\n</tr>\n<tr>\n<td>1910</td>\n<td>Carmichael discovers 561 independently</td>\n</tr>\n<tr>\n<td>1976</td>\n<td>Miller introduces strong pseudoprime test (deterministic under ERH)</td>\n</tr>\n<tr>\n<td>1977</td>\n<td>Solovay and Strassen give first polynomial-time probabilistic test with proven  $1/2$  error bound</td>\n</tr>\n<tr>\n<td>1980</td>\n<td>Rabin proves probabilistic  $1/4$  error bound</td>\n</tr>\n<tr>\n<td>1980</td>\n<td>Baillie, Pomerance, Selfridge, Wagstaff propose BPSW test</td>\n</tr>\n<tr>\n<td>1986</td>\n<td>Goldwasser and Kilian develop elliptic curve primality proving</td>\n</tr>\n<tr>\n<td>1993</td>\n<td>Atkin and Morain refine ECPP into a practical algorithm</td>\n</tr>\n<tr>\n<td>1994</td>\n<td>Alford-Granville-Pomerance prove infinitely many Carmichael numbers</td>\n</tr>\n<tr>\n<td>2002</td>\n<td>Agrawal-Kayal-Saxena prove primality is in P</td>\n</tr>\n</tbody>\n</table>\n<h2>The Underlying Structure: Finite Fields</h2>\n<p>The preceding sections described <em>what</em> each test checks. This section explains <em>why</em> those checks work -- the algebraic structure that makes primes distinguishable from composites.</p>\n<p>When  $p$  is prime, the integers modulo  $p$  form a <strong>finite field</strong>  $\\mathbb{F}_p$ . Every nonzero element has a multiplicative inverse, and Fermat's little theorem tells us the multiplicative group  $\\mathbb{F}_p^\\times$  has order  $p-1$ .</p>\n<h3>The multiplicative group is cyclic</h3>\n<p>A deeper fact:  $\\mathbb{F}_p^\\times$  is cyclic. There exists a generator  $g$  (called a <em>primitive root</em>) such that every nonzero element of  $\\mathbb{F}_p$  is a power of  $g$ :</p>\n\n\n$$\\mathbb{F}_p^\\times = \\{g^0, g^1, g^2, \\ldots, g^{p-2}\\}$$\n\n<p>This is not obvious. Here is the proof.</p>\n<p><strong>Theorem.</strong> For any prime  $p$ , the group  $\\mathbb{F}_p^\\times$  is cyclic.</p>\n<p><em>Proof.</em> For each divisor  $d$  of  $p-1$ , let  $\\psi(d) = |\\{a \\in \\mathbb{F}_p^\\times : \\text{ord}(a) = d\\}|$ . Every element has some order dividing  $p-1$ , so  $\\sum_{d \\mid (p-1)} \\psi(d) = p-1$ .</p>\n<p>Now we claim  $\\psi(d) \\le \\phi(d)$ . If  $\\psi(d) \\ge 1$ , there exists an element  $g$  of order  $d$ . The  $d$  elements  $g^0, g^1, \\ldots, g^{d-1}$  are all roots of  $x^d - 1$ . Since  $\\mathbb{F}_p$  is a field,  $x^d - 1$  has at most  $d$  roots, so these are <em>all</em> the elements of order dividing  $d$ . Among  $g^0, \\ldots, g^{d-1}$ , exactly  $\\phi(d)$  have order exactly  $d$  (those  $g^k$  with  $\\gcd(k, d) = 1$ ). So  $\\psi(d) \\in \\{0, \\phi(d)\\}$ .</p>\n<p>Since  $\\sum_{d \\mid (p-1)} \\psi(d) = p - 1 = \\sum_{d \\mid (p-1)} \\phi(d)$  (the latter is a standard identity), and  $\\psi(d) \\le \\phi(d)$  for all  $d$ , we must have  $\\psi(d) = \\phi(d)$  for every  $d$ . In particular,  $\\psi(p-1) = \\phi(p-1) \\ge 1$ , so primitive roots exist.</p>\n<h3>Why cyclicity matters for Miller's test</h3>\n<p>The cyclicity of  $\\mathbb{F}_p^\\times$  is what makes the squaring chain in Miller's test work. Write every element as  $g^j$  for some  $j$ . Then  $a = g^j$  and:</p>\n\n\n$$a^d = g^{jd}, \\quad a^{2d} = g^{2jd}, \\quad \\ldots, \\quad a^{2^s d} = g^{2^s jd}$$\n\n<p>The condition  $a^{2^s d} = 1$  means  $g^{2^s jd} = g^0$ , i.e.,  $(p-1) \\mid 2^s jd$ . Since  $p - 1 = 2^s d$ , this is automatic. Now work backwards. At each step, squaring in the cyclic group means doubling the exponent mod  $p-1$ . The only elements whose square is 1 are  $g^0 = 1$  and  $g^{(p-1)/2} = -1$ . This is exactly the &quot;only two square roots of unity&quot; property that Miller's test uses, and it holds <em>because</em> the group is cyclic of known order.</p>\n<p>For composite  $n$ , the group  $(\\mathbb{Z}/n\\mathbb{Z})^\\times$  is not cyclic in general (by the Chinese Remainder Theorem, it decomposes as a product of the groups for each prime power factor). This product structure creates extra square roots of unity -- elements  $x$  with  $x^2 \\equiv 1 \\pmod{n}$  but  $x \\not\\equiv \\pm 1 \\pmod{n}$ . These non-trivial square roots are the witnesses that Miller-Rabin detects.</p>\n<h3>Beyond  $\\mathbb{F}_p$ </h3>\n<p>For prime powers  $q = p^k$ , there exist unique finite fields  $\\mathbb{F}_q$  of each size (constructed as polynomial quotient rings, not just  $\\mathbb{Z}/q\\mathbb{Z}$ ). Their multiplicative groups are also cyclic of order  $q - 1$ .</p>\n<p>The connection between finite fields and primality testing runs deep. The Fermat test is really a test of multiplicative group order. Miller-Rabin refines this by probing the 2-Sylow subgroup structure. The Lucas test probes quadratic extensions  $\\mathbb{F}_{p^2}$ . ECPP uses the group structure of elliptic curves over  $\\mathbb{F}_p$ . Each successive test exploits more algebraic structure of finite fields to distinguish primes from composites.</p>\n<h2>Open Problems</h2>\n<h3>Density of pseudoprimes</h3>\n<p>There are infinitely many Fermat pseudoprimes to base 2. In fact, the count of base-2 pseudoprimes up to  $x$  is  $x / L(x)^{1/2 + o(1)}$  where  $L(x) = e^{\\log x \\cdot \\log \\log \\log x / \\log \\log x}$  -- a result of Pomerance (1981). This grows faster than any power of  $\\log x$  but slower than any positive power of  $x$ . In practical terms: pseudoprimes are sparse but not negligibly so.</p>\n<p>For strong pseudoprimes (the kind Miller-Rabin uses), the density is lower still. The count of base-2 strong pseudoprimes below  $x$  appears to grow roughly as  $x^{1/3}$  empirically, though the exact asymptotics are not fully established.</p>\n<p>Carmichael numbers (pseudoprimes to <em>all</em> coprime bases, a much stricter condition) have a proven lower bound of  $x^{2/7}$  (Alford-Granville-Pomerance 1994). Erdos conjectured the true count up to  $x$  is  $x^{1-o(1)}$ , but this remains open. Pinch computed the count up to  $10^{18}$  as 1,401,644 -- rare compared to Fermat pseudoprimes to any single base, but not as rare as one might expect.</p>\n<h3>The BPSW question</h3>\n<p>The central open problem in practical primality testing: does a BPSW counterexample exist?</p>\n<p>Heuristic arguments suggest counterexamples should exist but be astronomically rare. Pomerance has estimated that if they exist, the smallest might exceed  $10^{10000}$ . Exhaustive search has checked all numbers below  $2^{64}$  with no counterexample found.</p>\n<p>The $2,000 bounty (Baillie-Fiori-Wagstaff, 2021) remains unclaimed. If no counterexample exists, proving this would likely require new techniques in analytic number theory. If one exists, finding it would likely require new techniques in constructive algebra.</p>\n<p>Either resolution would be significant. In the meantime, BPSW occupies a peculiar status: universally trusted in practice, with no theoretical proof of correctness, and with heuristic arguments that counterexamples should exist but haven't been found.</p>\n<h3>How real systems test primality</h3>\n<p>The gap between theory and practice matters here. Real systems do not just run Miller-Rabin in a loop. The full pipeline for generating an RSA prime in <a href=\"https://github.com/openssl/openssl\"><code>OpenSSL</code></a> (as of version 3.x) looks like this:</p>\n<ol>\n<li>Generate a random odd number  $n$  of the desired bit length, with the top two bits set (to ensure the product  $pq$  has the right bit length).</li>\n<li>Check divisibility by a table of small primes (typically the first 2048 primes). This is trial division with a fixed bound, and it rejects about 80% of candidates cheaply.</li>\n<li>Run one round of Miller-Rabin with base 2.</li>\n<li>Run a Lucas test (the second half of BPSW).</li>\n<li>Depending on the security level and standard being followed, run additional Miller-Rabin rounds with random bases.</li>\n</ol>\n<p><a href=\"https://gmplib.org/\">GMP</a>'s <code>mpz_probab_prime_p</code> function follows a similar structure: trial division, then BPSW, then optional additional Miller-Rabin rounds requested by the caller.</p>\n<p>The <a href=\"https://csrc.nist.gov/pubs/fips/186-5/final\">FIPS 186-5</a> standard (the current revision for US government systems) specifies Miller-Rabin with iteration counts that depend on the required error probability and the bit length of the candidate. For 1024-bit primes (half of a 2048-bit RSA modulus), FIPS requires enough rounds to achieve an error probability below  $2^{-112}$ .</p>\n<p><code>libsodium</code>, used widely for NaCl-family cryptography, relies on Ed25519 (which uses a fixed curve, not generated primes) for signatures, so primality testing is not in its hot path. But for applications that do need prime generation, libsodium defers to the OS or to a constant-time Miller-Rabin implementation.</p>\n<p>An important engineering detail: the primality test itself is rarely the bottleneck. For a 2048-bit RSA modulus, the expected number of random candidates before finding a prime is approximately  $\\ln(2^{1024}) \\approx 710$  (by the prime number theorem). Each candidate undergoes trial division first, which filters out most composites in microseconds. Only the ~20% that survive trial division reach the expensive probabilistic test. End-to-end, generating both primes typically takes a few hundred milliseconds on modern hardware.</p>\n<p>Every step in this pipeline must run in constant time. If the Miller-Rabin test takes longer for primes than composites (because primes survive more squaring steps), an attacker observing timing can learn information about the generated primes. Production implementations use Montgomery multiplication with fixed iteration counts -- the same number of multiplications regardless of intermediate values. This is a first-order correctness requirement, not a performance optimization.</p>\n<h2>References</h2>\n<ul>\n<li>Solovay, R. &amp; Strassen, V. (1977). &quot;A Fast Monte-Carlo Test for Primality.&quot; <em>SIAM Journal on Computing</em>, 6(1), 84-85.</li>\n<li>Miller, G. L. (1976). &quot;Riemann's Hypothesis and Tests for Primality.&quot; <em>Journal of Computer and System Sciences</em>, 13(3), 300-317.</li>\n<li>Rabin, M. O. (1980). &quot;Probabilistic Algorithm for Testing Primality.&quot; <em>Journal of Number Theory</em>, 12(1), 128-138.</li>\n<li>Baillie, R. &amp; Wagstaff, S. S. (1980). &quot;Lucas Pseudoprimes.&quot; <em>Mathematics of Computation</em>, 35(152), 1391-1417.</li>\n<li>Pomerance, C., Selfridge, J. L. &amp; Wagstaff, S. S. (1980). &quot;The Pseudoprimes to 25 x 10^9.&quot; <em>Mathematics of Computation</em>, 35(151), 1003-1026.</li>\n<li>Alford, W. R., Granville, A. &amp; Pomerance, C. (1994). &quot;There Are Infinitely Many Carmichael Numbers.&quot; <em>Annals of Mathematics</em>, 139(3), 703-722.</li>\n<li>Agrawal, M., Kayal, N. &amp; Saxena, N. (2004). &quot;PRIMES Is in P.&quot; <em>Annals of Mathematics</em>, 160(2), 781-793.</li>\n<li>Lenstra, H. W. &amp; Pomerance, C. (2005). &quot;Primality Testing with Gaussian Periods.&quot; Manuscript.</li>\n<li>Atkin, A. O. L. &amp; Morain, F. (1993). &quot;Elliptic Curves and Primality Proving.&quot; <em>Mathematics of Computation</em>, 61(203), 29-68.</li>\n<li>Goldwasser, S. &amp; Kilian, J. (1986). &quot;Almost All Primes Can Be Quickly Certified.&quot; <em>Proceedings of the 18th STOC</em>, 316-329.</li>\n<li>Damgard, I., Landrock, P. &amp; Pomerance, C. (1993). &quot;Average Case Error Estimates for the Strong Probable Prime Test.&quot; <em>Mathematics of Computation</em>, 61(203), 177-194.</li>\n<li>Pomerance, C. (1981). &quot;On the Distribution of Pseudoprimes.&quot; <em>Mathematics of Computation</em>, 37(156), 587-593.</li>\n<li>Crandall, R. &amp; Pomerance, C. (2005). <em>Prime Numbers: A Computational Perspective</em>. 2nd ed. Springer.</li>\n<li>Baillie, R., Fiori, A. &amp; Wagstaff, S. S. (2021). &quot;Strengthening the Baillie-PSW Primality Test.&quot; <em>Mathematics of Computation</em>, 90(330), 1931-1955.</li>\n<li><a href=\"https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test\">Miller-Rabin primality test</a> (Wikipedia)</li>\n<li><a href=\"https://en.wikipedia.org/wiki/Carmichael_number\">Carmichael numbers</a> (Wikipedia)</li>\n<li><a href=\"https://en.wikipedia.org/wiki/Finite_field\">Finite field</a> (Wikipedia)</li>\n</ul>\n","date_published":"Tue, 17 Dec 2024 00:00:00 GMT"},{"id":"https://attobop.net/posts/difference-tables/","url":"https://attobop.net/posts/difference-tables/","title":"The Secret Life of Difference Tables","content_html":"<p>Take any sequence of integers. Subtract each term from the next. You get a new sequence. Do it again. And again. What emerges is a <em>difference table</em>---and it reveals far more about your sequence than you might expect.</p>\n<!--more-->\n<h2>A Simple Game</h2>\n<p>Let's play with the triangular numbers: 1, 3, 6, 10, 15, 21, ...</p>\n<p>These are <em>figurate numbers</em> -- they count dots in regular geometric arrangements. The triangular numbers count dots in triangles:</p>\n<pre><code>    •          •          •\n              • •        • •\n                        • • •\n    1          3          6\n</code></pre>\n<p>Now build the difference table. Write the sequence, then below each pair write their difference:</p>\n<figure>\n  <img src=\"https://attobop.net/posts/difference-tables/triangular_diff.png\" alt=\"Difference table of the triangular numbers\" width=\"480\">\n  <figcaption> The triangular numbers differenced: first differences are the counting numbers, second differences are constant at 1, third differences are zero. The left edge reconstructs the sequence via Newton</figcaption>\n</figure>\n<p>The first differences are the counting numbers. The second differences are all 1. The third differences are all 0.</p>\n<p>This is no accident. The triangular numbers are given by  $T_n = \\frac{n(n+1)}{2}$ , a quadratic polynomial---and <em>every</em> quadratic has constant second differences.</p>\n<h2>Background</h2>\n<p>The idea of systematically differencing a sequence is old. Newton used it in the 1660s for interpolation, but the roots go back further---to James Gregory (1668) and Henry Briggs (1624), who built logarithm tables by the method of finite differences. Thomas Harriot had the essential idea even earlier, in unpublished manuscripts from the 1610s.</p>\n<p>The practical motive was always numerical tables. Briggs computed his <em>Arithmetica Logarithmica</em> (1624) by exploiting the fact that consecutive values of a polynomial can be computed by addition alone---no multiplication needed---if you maintain a running difference table. The same idea powered the Nautical Almanac computations for two centuries, and it's what Babbage later mechanized in his Difference Engine.</p>\n<p>Newton's contribution was to see beyond the computational trick: he formulated the general interpolation formula and recognized the connection to the binomial series. In a 1676 letter to Leibniz (the <em>Epistola Prior</em>), Newton described how any function tabulated at integer points could be approximated by its differences, extending finite differences from a computational device to a theoretical tool.</p>\n<h3>Operators:  $\\Delta$ ,  $E$ , and  $D$ </h3>\n<p>The analogy with continuous calculus is not a coincidence. Define the <strong>forward difference operator</strong>  $\\Delta$  by</p>\n\n\n$$\\Delta f(x) = f(x+1) - f(x)$$\n\n<p>and the <strong>shift operator</strong>  $E$  by</p>\n\n\n$$E f(x) = f(x+1).$$\n\n<p>Then  $\\Delta = E - I$ , where  $I$  is the identity operator. This is the discrete analogue of the derivative: where  $\\frac{d}{dx}$  measures instantaneous rate of change,  $\\Delta$  measures the change over a unit step.</p>\n<p>The parallel runs deep. In continuous calculus,  $\\frac{d}{dx} x^n = n x^{n-1}$ . In discrete calculus,  $\\Delta$  acts cleanly not on ordinary powers  $x^n$  but on <strong>falling factorials</strong>:</p>\n\n\n$$x^{\\underline{n}} = x(x-1)(x-2)\\cdots(x-n+1)$$\n\n<p>The rule is  $\\Delta x^{\\underline{n}} = n \\cdot x^{\\underline{n-1}}$ , a perfect mirror of the power rule. The falling factorial plays the role of  $x^n$  in the discrete world, and the binomial coefficient  $\\binom{x}{n} = \\frac{x^{\\underline{n}}}{n!}$  plays the role of  $\\frac{x^n}{n!}$ .</p>\n<p>The parallel between discrete and continuous calculus is systematic:</p>\n<table>\n<thead>\n<tr>\n<th>Continuous</th>\n<th>Discrete</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Derivative  $\\frac{d}{dx}$ </td>\n<td>Forward difference  $\\Delta$ </td>\n</tr>\n<tr>\n<td> $x^n$ </td>\n<td>Falling factorial  $x^{\\underline{n}}$ </td>\n</tr>\n<tr>\n<td> $\\frac{d}{dx} x^n = n x^{n-1}$ </td>\n<td> $\\Delta x^{\\underline{n}} = n \\cdot x^{\\underline{n-1}}$ </td>\n</tr>\n<tr>\n<td> $\\int x^n dx = \\frac{x^{n+1}}{n+1} + C$ </td>\n<td> $\\Delta^{-1} x^{\\underline{n}} = \\frac{x^{\\underline{n+1}}}{n+1} + C$ </td>\n</tr>\n<tr>\n<td>Taylor series:  $f(x) = \\sum \\frac{f^{(k)}(0)}{k!} x^k$ </td>\n<td>Newton series:  $f(n) = \\sum \\Delta^k f(0) \\cdot \\binom{n}{k}$ </td>\n</tr>\n<tr>\n<td> $\\frac{x^n}{n!}$ </td>\n<td> $\\binom{x}{n} = \\frac{x^{\\underline{n}}}{n!}$ </td>\n</tr>\n<tr>\n<td> $e^x$  (eigenfunction of  $\\frac{d}{dx}$ )</td>\n<td> $2^x$  (eigenfunction of  $\\Delta$ :  $\\Delta 2^x = 2^x$ )</td>\n</tr>\n</tbody>\n</table>\n<p>The last row explains why powers of 2 have self-similar difference tables:  $\\Delta$  acts on  $2^x$  the way  $\\frac{d}{dx}$  acts on  $e^x$  -- it returns the function unchanged.</p>\n<p>Higher-order differences compose exactly as you'd expect:</p>\n\n\n$$\\Delta^n f(x) = \\sum_{k=0}^{n} \\binom{n}{k} (-1)^{n-k} f(x+k)$$\n\n<p>This follows from expanding  $\\Delta^n = (E - I)^n$  by the binomial theorem---which works because  $E$  and  $I$  commute. The formula says: to compute the  $n$ -th difference, take an alternating weighted sum of  $n+1$  consecutive values. This is exactly what building the difference table row by row computes, just expressed in closed form.</p>\n<p>The operator-theoretic viewpoint extends further. The formal relation between  $\\Delta$  and the continuous derivative  $D = \\frac{d}{dx}$  is  $\\Delta = e^D - 1$ , since  $Ef(x) = e^D f(x) = f(x+1)$  by Taylor's theorem. Inverting:  $D = \\ln(1 + \\Delta) = \\Delta - \\frac{\\Delta^2}{2} + \\frac{\\Delta^3}{3} - \\cdots$ . These formal power series in operators converge when applied to polynomials (which are annihilated by sufficiently high powers of  $\\Delta$  or  $D$ ), and they form the basis of the operator method in combinatorics.</p>\n<p>The backward difference operator  $\\nabla f(x) = f(x) - f(x-1)$  and the central difference operator  $\\delta f(x) = f(x + \\tfrac{1}{2}) - f(x - \\tfrac{1}{2})$  give rise to their own families of formulas (Stirling's interpolation, Gauss's formulas), but the forward difference is the most natural starting point.</p>\n<h2>The General Pattern</h2>\n<p>For any polynomial of degree  $d$ , the  $d$ -th differences are constant, and the  $(d+1)$ -th differences are zero.</p>\n<p>Try the cubes: 1, 8, 27, 64, 125, 216...</p>\n<pre><code>1     8    27    64   125   216\n   7    19    37    61    91\n     12    18    24    30\n        6     6     6\n           0     0\n</code></pre>\n<p>Third differences constant at 6. Fourth differences zero. The cubes are cubic polynomials---degree 3.</p>\n<p>The figurate number pattern continues: the <em>square pyramidal numbers</em> 1, 5, 14, 30, 55, 91 count balls stacked in a square pyramid. Each layer is a square number, so the pyramidal numbers are partial sums of squares:  $P_n = \\sum_{k=1}^{n} k^2$ . Their third differences are constant at 2, confirming degree 3 -- the sum of a degree-2 sequence is degree 3, exactly as the discrete antidifference predicts.</p>\n<p>This works in reverse too. If someone hands you a mystery sequence and its third differences are constant, you know it's generated by a cubic polynomial. You can even reconstruct the formula.</p>\n<p>Why does the  $d$ -th difference of a degree- $d$  polynomial come out constant? Because  $\\Delta$  reduces degree by exactly 1. To see this: if  $f(x) = cx^d + \\text{lower terms}$ , then  $\\Delta f(x) = c[(x+1)^d - x^d] = c[dx^{d-1} + \\text{lower terms}]$ , so  $\\deg(\\Delta f) = d - 1$  with leading coefficient  $cd$ . Applying this  $d$  times,  $\\Delta^d f(x) = c \\cdot d! $ , a constant. One more application gives  $\\Delta^{d+1} f(x) = 0$ . For the cubes,  $f(x) = x^3$  with  $c = 1$ , so  $\\Delta^3 f(x) = 3! = 6$ .</p>\n<p>This gives a practical test: <strong>if the  $k$ -th row of a difference table is constant, the sequence is generated by a polynomial of degree exactly  $k$ </strong>. The converse holds too---non-polynomial sequences (exponentials, factorials, primes) never terminate. (For primes: the prime counting function grows as  $n / \\ln n$  by the prime number theorem, which is not polynomial in the index, so no finite-degree polynomial can reproduce the prime sequence.)</p>\n<p>The calculus analogy extends to summation. Just as integration undoes differentiation, the <strong>antidifference</strong>  $\\Delta^{-1}$  (also called the indefinite sum) undoes  $\\Delta$ . Since  $\\Delta x^{\\underline{n}} = n \\cdot x^{\\underline{n-1}}$ , we get  $\\Delta^{-1} x^{\\underline{n}} = \\frac{x^{\\underline{n+1}}}{n+1} + C$ ---the discrete analogue of  $\\int x^n dx = \\frac{x^{n+1}}{n+1} + C$ . This gives closed forms for discrete sums:  $\\sum_{k=0}^{n-1} k^{\\underline{2}} = \\sum_{k=0}^{n-1} k(k-1) = \\frac{n(n-1)(n-2)}{3}$ , which you can verify is  $\\frac{x^{\\underline{3}}}{3}\\big|_0^n$ .</p>\n<p>Here is a short Python function that computes the difference table:</p>\n<pre><code class=\"language-python\">def difference_table(seq):\n    &quot;&quot;&quot;Return the difference table as a list of rows.&quot;&quot;&quot;\n    table = [list(seq)]\n    while len(table[-1]) &gt; 1:\n        row = table[-1]\n        table.append([row[i+1] - row[i] for i in range(len(row) - 1)])\n    return table\n\n# Triangular numbers\nfor row in difference_table([1, 3, 6, 10, 15, 21, 28]):\n    print(row)\n# [1, 3, 6, 10, 15, 21, 28]\n# [2, 3, 4, 5, 6, 7]\n# [1, 1, 1, 1, 1]\n# [0, 0, 0, 0]\n</code></pre>\n<h2>Newton's Forward Difference Formula</h2>\n<p>Any sequence can be reconstructed from its difference table. Specifically:</p>\n\n\n$$a_n = \\sum_{k=0}^{n} \\binom{n}{k} \\Delta^k a_0$$\n\n<p>where  $\\Delta^k a_0$  is the  $k$ -th entry on the left edge of the difference table.</p>\n<p>This is <strong>Newton's forward difference formula</strong>. It says: to reconstruct a sequence, take the left edge of its difference table and weight by binomial coefficients.</p>\n<h3>Derivation from the operator identity</h3>\n<p>The formula follows directly from the relation  $E = I + \\Delta$ . Applying  $E^n$  to  $f$  at  $x = 0$ :</p>\n\n\n$$f(n) = E^n f(0) = (I + \\Delta)^n f(0) = \\sum_{k=0}^{n} \\binom{n}{k} \\Delta^k f(0)$$\n\n<p>That's it. The binomial theorem for operators gives Newton's interpolation formula in one line.</p>\n<h3>Example: reconstruct a polynomial</h3>\n<p>For the triangular numbers, the left edge is 1, 2, 1, 0, 0, ... So:</p>\n\n\n$$T_n = \\binom{n}{0} \\cdot 1 + \\binom{n}{1} \\cdot 2 + \\binom{n}{2} \\cdot 1 = 1 + 2n + \\frac{n(n-1)}{2}$$\n\n<p>Simplify and you get  $\\frac{n^2 + 3n + 2}{2} = \\frac{(n+1)(n+2)}{2}$ , which is indeed  $T_{n+1}$ .</p>\n<p>For the cubes, the table starts at  $a_0 = 1 = 1^3$ , so our sequence is really  $f(n) = (n+1)^3$ . The left edge is 1, 7, 12, 6, 0, 0, ... and Newton's formula gives:</p>\n\n\n$$f(n) = 1 \\cdot \\binom{n}{0} + 7 \\cdot \\binom{n}{1} + 12 \\cdot \\binom{n}{2} + 6 \\cdot \\binom{n}{3}$$\n\n<p>Expanding:  $1 + 7n + 6n(n-1) + n(n-1)(n-2) = 1 + 7n + 6n^2 - 6n + n^3 - 3n^2 + 2n = n^3 + 3n^2 + 3n + 1 = (n+1)^3$ . Correct.</p>\n<h3>Connection to polynomial interpolation</h3>\n<p>Newton's formula is an interpolation formula: given  $d+1$  values of a degree- $d$  polynomial, it reconstructs the polynomial exactly. The representation is in the <strong>Newton basis</strong>  $\\{\\binom{x}{0}, \\binom{x}{1}, \\binom{x}{2}, \\ldots\\}$  rather than the monomial basis  $\\{1, x, x^2, \\ldots\\}$ .</p>\n<p>The Newton basis has a computational advantage: adding one more data point only requires computing one more difference, appending one more term. With the monomial basis (Vandermonde system), you'd need to re-solve the entire system. This is why Newton's formula was the workhorse of numerical tables for three centuries, from Briggs's logarithms to the Nautical Almanac.</p>\n<p>There is a subtlety: Newton's formula interpolates exactly through the given points  $a_0, a_1, \\ldots, a_d$  using a polynomial of degree  $d$ . But if the sequence is not polynomial, the formula still produces a value for  $a_n$  at non-integer  $n$ ---it just gives the unique polynomial that passes through the first  $d+1$  data points. For non-polynomial sequences, the approximation degrades as  $n$  moves far from the data. This is the discrete version of the Runge phenomenon in polynomial interpolation.</p>\n<p>For polynomial sequences, of course, the formula is exact for all  $n$ , not just the tabulated values. This is the basis of the difference table test: if the  $d$ -th differences are constant, Newton's formula with  $d+1$  terms reproduces the sequence exactly and extends it to all integers (and even to non-integer arguments).</p>\n<p>Here is a Python function that reconstructs a sequence from its left edge:</p>\n<pre><code class=\"language-python\">from math import comb\n\ndef newton_interpolate(left_edge, n):\n    &quot;&quot;&quot;Evaluate Newton's forward difference formula at n.&quot;&quot;&quot;\n    return sum(c * comb(n, k) for k, c in enumerate(left_edge))\n\n# Recover triangular numbers from left edge [1, 2, 1]\nprint([newton_interpolate([1, 2, 1], n) for n in range(7)])\n# [1, 3, 6, 10, 15, 21, 28]\n</code></pre>\n<h2>The Binomial Transform</h2>\n<p>The operation of &quot;take all the differences and read down the left edge&quot; has a name: the <strong>binomial transform</strong>.</p>\n<p>Given a sequence  $(a_0, a_1, a_2, \\ldots)$ , its binomial transform is the sequence of values at the left edge of the difference table:</p>\n\n\n$$b_n = \\Delta^n a_0 = \\sum_{k=0}^{n} \\binom{n}{k} (-1)^{n-k} a_k$$\n\n<h3>Formal definition and inverse</h3>\n<p>The binomial transform can be written in matrix form. Let  $\\mathbf{b} = B \\mathbf{a}$ , where  $B$  is the lower-triangular matrix with entries  $B_{nk} = \\binom{n}{k}(-1)^{n-k}$ . The inverse is clean: to recover the original sequence, apply the <em>unsigned</em> binomial transform:</p>\n\n\n$$a_n = \\sum_{k=0}^{n} \\binom{n}{k} b_k$$\n\n<p>This is Newton's forward difference formula: the left edge of the difference table (the  $b_k$ ) reconstructs the original sequence when weighted by binomial coefficients. The forward (signed) and inverse (unsigned) transforms are related by  $B^{-1}_{nk} = \\binom{n}{k}$ . To verify the inversion, compose the two. Apply the signed transform to the unsigned:</p>\n\n\n$$\\sum_k \\binom{n}{k}(-1)^{n-k}\\sum_j \\binom{k}{j} c_j = \\sum_j c_j \\underbrace{\\sum_k \\binom{n}{k}\\binom{k}{j}(-1)^{n-k}}_{\\delta_{nj}}$$\n\n<p>The inner sum vanishes for  $n \\ne j$  because  $\\sum_k \\binom{n-j}{k-j}(-1)^{n-k}$  is the alternating row sum of Pascal's triangle, which is zero unless the row has length 1.</p>\n<p>(Some authors define the binomial transform without the signs:  $b_n = \\sum_k \\binom{n}{k} a_k$ . Under that convention, the signed version is the inverse. The two conventions are conjugates of each other, differing by whether you call the &quot;forward&quot; direction signed or unsigned.)</p>\n<h3>Examples</h3>\n<p><strong>Powers of 2.</strong> The sequence  $a_n = 2^n$  gives  $b_n = \\sum_{k=0}^{n} \\binom{n}{k}(-1)^{n-k} 2^k = (2-1)^n = 1$ . So the binomial transform of  $(1, 2, 4, 8, 16, \\ldots)$  is  $(1, 1, 1, 1, 1, \\ldots)$ .</p>\n<p>Inversely,  $(1, 1, 1, \\ldots)$  transforms to  $(1, 2, 4, 8, \\ldots)$ . The all-ones sequence and the powers of 2 are binomial transform pairs.</p>\n<p><strong>Powers of 3.</strong> Similarly,  $b_n = (3-1)^n = 2^n$ . So the transform of  $(1, 3, 9, 27, \\ldots)$  is  $(1, 2, 4, 8, \\ldots)$ . Chains form: all-ones  $\\leftrightarrow$  powers of 2  $\\leftrightarrow$  powers of 3  $\\leftrightarrow$  ... where each application of the binomial transform subtracts 1 from the base. More generally, the binomial transform of  $((a+1)^n)$  is  $(a^n)$ , since  $\\sum_k \\binom{n}{k}(-1)^{n-k}(a+1)^k = ((a+1)-1)^n = a^n$ .</p>\n<p><strong>Bell numbers.</strong> The Bell numbers  $B_n$  (1, 1, 2, 5, 15, 52, 203, ...) count the total number of set partitions of  $\\lbrace 1, \\ldots, n \\rbrace$ . They satisfy  $B_{n+1} = \\sum_{k=0}^{n} \\binom{n}{k} B_k$ ---the unsigned binomial transform of  $(B_n)$  is the shifted sequence  $(B_{n+1})$ . This recurrence says: to partition  $\\lbrace 1, \\ldots, n+1 \\rbrace$ , choose which  $k$  elements share a block with  $n+1$  (the  $\\binom{n}{k}$  factor) and partition the remaining  $n-k$  elements (the  $B_{n-k}$  factor, which after reindexing gives  $B_k$ ).</p>\n<p><strong>Catalan numbers.</strong> The Catalan numbers  $C_n = \\frac{1}{n+1}\\binom{2n}{n}$  (1, 1, 2, 5, 14, 42, ...) have a binomial transform that gives the central Delannoy numbers---which count lattice paths with diagonal steps allowed. The binomial transform acts as a &quot;diagonalization&quot; of the lattice path model.</p>\n<h3>Connection to generating functions</h3>\n<p>If  $A(x) = \\sum a_n x^n$  is the ordinary generating function (OGF) of  $(a_n)$ , and  $(b_n)$  is its binomial transform, then</p>\n\n\n$$\\sum b_n x^n = \\frac{1}{1+x} A\\!\\left(\\frac{x}{1+x}\\right).$$\n\n<p>For exponential generating functions (EGFs), the relationship is cleaner. If  $\\hat{A}(x) = \\sum a_n \\frac{x^n}{n!}$ , then</p>\n\n\n$$\\sum b_n \\frac{x^n}{n!} = e^{-x} \\hat{A}(x).$$\n\n<p>Multiplication by  $e^{-x}$  in the EGF world corresponds to the binomial transform in the coefficient world. This is one reason EGFs are so natural for combinatorics: many transforms that look complicated on OGFs become simple operations on EGFs.</p>\n<p>The EGF perspective makes the inversion transparent. The signed binomial transform corresponds to multiplying by  $e^{-x}$ ; the unsigned inverse corresponds to multiplying by  $e^{x}$ . The composition  $e^{x} \\cdot e^{-x} = 1$  gives the identity, confirming the inversion. In the OGF world, the forward transform is the substitution  $x \\mapsto \\frac{x}{1+x}$  combined with a factor of  $\\frac{1}{1+x}$ , and the inverse substitution  $x \\mapsto \\frac{x}{1-x}$  with factor  $\\frac{1}{1-x}$  undoes it.</p>\n<h2>Fibonacci's Self-Similarity</h2>\n<p>The Fibonacci sequence has a peculiar difference table:</p>\n<figure>\n  <img src=\"https://attobop.net/posts/difference-tables/fibonacci_diff.png\" alt=\"Fibonacci difference table with left edge highlighted\" width=\"500\">\n  <figcaption> The left edge of Fibonacci</figcaption>\n</figure>\n<p>Look at the left edge: 0, 1, -1, 2, -3, 5, -8, 13, -21...</p>\n<p>It's the Fibonacci numbers again, with alternating signs! The binomial transform of Fibonacci is (essentially) Fibonacci.</p>\n<p>Why? The Fibonacci recurrence  $F_n = F_{n-1} + F_{n-2}$  has characteristic roots  $\\phi = \\frac{1+\\sqrt{5}}{2}$  and  $\\psi = \\frac{1-\\sqrt{5}}{2}$ , with the closed form  $F_n = \\frac{\\phi^n - \\psi^n}{\\sqrt{5}}$ . The binomial transform sends  $\\alpha^n \\mapsto (\\alpha - 1)^n$ . The key step is computing  $\\phi - 1 = \\frac{\\sqrt{5}-1}{2} = -\\psi$  and  $\\psi - 1 = \\frac{-\\sqrt{5}-1}{2} = -\\phi$ . (Both follow from the defining property  $\\phi + \\psi = 1$ .) So:</p>\n\n\n$$b_n = \\frac{(\\phi-1)^n - (\\psi-1)^n}{\\sqrt{5}} = \\frac{(-\\psi)^n - (-\\phi)^n}{\\sqrt{5}} = (-1)^{n+1} \\frac{\\phi^n - \\psi^n}{\\sqrt{5}} = (-1)^{n+1} F_n$$\n\n<p>The binomial transform of  $F_n$  is  $(-1)^{n+1} F_n$ . The Fibonacci sequence is an eigensequence of the binomial transform, with eigenvalue  $-1$  (up to the sign shift). This algebraic self-similarity is a shadow of the golden ratio's defining property  $\\phi^2 = \\phi + 1$ , which gives  $\\phi - 1 = 1/\\phi$ .</p>\n<p>More precisely: any sequence whose Binet-style closed form involves roots  $\\alpha$  satisfying  $\\alpha(\\alpha - 1) = \\pm 1$  will be an eigensequence (or near-eigensequence) of the binomial transform. The golden ratio satisfies  $\\phi(\\phi - 1) = 1$ , which is why Fibonacci works. The Lucas numbers  $L_n = \\phi^n + \\psi^n$  have the same property: their binomial transform is  $(-1)^n L_n$ .</p>\n<h2>OEIS Treasures</h2>\n<p>The <a href=\"https://oeis.org/\">On-Line Encyclopedia of Integer Sequences</a> catalogs over 370,000 sequences. Many entries note when one sequence is the binomial transform of another.</p>\n<p>One favorite: <strong>A000079 (Powers of 2)</strong>, 1, 2, 4, 8, 16, 32...</p>\n<pre><code>1    2    4    8   16   32\n   1    2    4    8   16\n      1    2    4    8\n         1    2    4\n            1    2\n               1\n</code></pre>\n<p>Every row is the same sequence. The difference table is self-similar because  $2^n - 2^{n-1} = 2^{n-1}$  -- differencing just shifts the sequence. This is the table-level manifestation of the binomial transform pair we derived above: the all-ones sequence and the powers of 2 are transform partners.</p>\n<h2>Stirling Numbers</h2>\n<p>The connection between ordinary powers  $x^n$  and the Newton basis  $\\binom{x}{k}$  runs through the <strong>Stirling numbers</strong>.</p>\n<h3>Stirling numbers of the second kind</h3>\n<p>The Stirling number of the second kind, written  $S(n,k)$  or  $\\lbrace{n \\atop k}\\rbrace$ , counts the number of ways to partition a set of  $n$  elements into exactly  $k$  non-empty subsets.</p>\n<p>For example,  $S(4,2) = 7$ : the set  $\\lbrace a,b,c,d \\rbrace$  can be split into two non-empty parts in 7 ways ( $\\lbrace a | bcd \\rbrace$ ,  $\\lbrace b | acd \\rbrace$ ,  $\\lbrace c | abd \\rbrace$ ,  $\\lbrace d | abc \\rbrace$ ,  $\\lbrace ab | cd \\rbrace$ ,  $\\lbrace ac | bd \\rbrace$ ,  $\\lbrace ad | bc \\rbrace$ ).</p>\n<p>The connection to difference tables: applying  $\\Delta^k$  to  $x^n$  and evaluating at  $x = 0$  gives</p>\n\n\n$$\\Delta^k x^n \\big|_{x=0} = \\sum_{j=0}^{k} \\binom{k}{j}(-1)^{k-j} j^n = S(n,k) \\cdot k!$$\n\n<p>The left side is the  $k$ -th entry on the left edge of the difference table of the sequence  $0^n, 1^n, 2^n, 3^n, \\ldots$ . Why does the right side equal  $k! \\cdot S(n,k)$ ? Count surjections from  $[n]$  to  $[k]$  by inclusion-exclusion:  $\\sum_{j=0}^{k} \\binom{k}{j}(-1)^{k-j} j^n$  counts functions minus those missing at least one target, plus those missing at least two, and so on. Each surjection is an ordered partition into  $k$  non-empty blocks, and there are  $k!$  orderings per partition. So the alternating sum equals  $k! \\cdot S(n,k)$ , and the difference table of  $n$ -th powers encodes Stirling numbers.</p>\n<p>This is why the change-of-basis formula works:</p>\n\n\n$$x^n = \\sum_{k=0}^{n} S(n,k) \\cdot k! \\cdot \\binom{x}{k} = \\sum_{k=0}^{n} S(n,k) \\cdot x^{\\underline{k}}$$\n\n<p>Taking differences of a polynomial converts from the &quot;power basis&quot; to the &quot;falling factorial basis,&quot; and the conversion coefficients are exactly the Stirling numbers of the second kind.</p>\n<h3>Stirling numbers of the first kind</h3>\n<p>The Stirling numbers of the first kind, written  $s(n,k)$  or  $\\left[{n \\atop k}\\right]$  (unsigned), perform the reverse conversion:</p>\n\n\n$$x^{\\underline{n}} = \\sum_{k=0}^{n} s(n,k) \\cdot (-1)^{n-k} \\cdot x^k$$\n\n<p>Their combinatorial meaning:  $\\left[{n \\atop k}\\right]$  counts the number of permutations of  $n$  elements with exactly  $k$  cycles.</p>\n<p>For instance,  $\\left[{4 \\atop 2}\\right] = 11$ : among the 24 permutations of  $\\lbrace 1,2,3,4 \\rbrace$ , exactly 11 have two cycles. (Examples:  $(1)(234)$ ,  $(12)(34)$ , etc.)</p>\n<p>The two kinds of Stirling numbers are inverses as change-of-basis matrices:</p>\n\n\n$$\\sum_{j} S(n,j) \\cdot s(j,k) \\cdot (-1)^{j-k} = \\delta_{nk}$$\n\n<p>This is the discrete analogue of the fact that differentiation and integration are inverse operations. The first kind converts from falling factorials to powers; the second kind converts from powers to falling factorials. Together they mediate between the two natural bases for polynomial sequences.</p>\n<p>For fixed  $k$ ,  $S(n,k) \\sim \\frac{k^n}{k!}$  as  $n \\to \\infty$ . The number of ways to partition a large set into  $k$  blocks is approximately the number of surjections from  $n$  to  $k$  divided by  $k!$  (correcting for block ordering). The Stirling numbers of the first kind have a more delicate asymptotic:  $\\left[{n \\atop k}\\right] \\sim \\frac{n!}{n} \\cdot \\frac{(\\ln n)^{k-1}}{(k-1)!}$  for fixed  $k$ , reflecting the logarithmic growth of the number of cycles in a random permutation.</p>\n<h3>A small table</h3>\n<p>The Stirling numbers of the second kind for small  $n$ ,  $k$ :</p>\n<pre><code>n\\k   0   1   2   3   4   5\n 0    1\n 1    0   1\n 2    0   1   1\n 3    0   1   3   1\n 4    0   1   7   6   1\n 5    0   1  15  25  10   1\n</code></pre>\n<p>Each entry satisfies the recurrence  $S(n,k) = k \\cdot S(n-1,k) + S(n-1,k-1)$ : either the  $n$ -th element joins one of the  $k$  existing subsets ( $k$  choices, hence the factor of  $k$ ), or it starts a new subset of its own.</p>\n<h3>Worked example:  $x^3$  in the Newton basis</h3>\n<p>Let's verify the change-of-basis formula for  $x^3$ . We need the Stirling numbers  $S(3,1) = 1$ ,  $S(3,2) = 3$ ,  $S(3,3) = 1$  from the table above. The formula gives:</p>\n\n\n$$x^3 = S(3,1) \\cdot 1! \\cdot \\binom{x}{1} + S(3,2) \\cdot 2! \\cdot \\binom{x}{2} + S(3,3) \\cdot 3! \\cdot \\binom{x}{3}$$\n\n\n\n$$= 1 \\cdot x + 6 \\cdot \\frac{x(x-1)}{2} + 6 \\cdot \\frac{x(x-1)(x-2)}{6} = x + 3x(x-1) + x(x-1)(x-2)$$\n\n<p>Expanding:  $x + 3x^2 - 3x + x^3 - 3x^2 + 2x = x^3$ . Correct.</p>\n<p>Notice that the coefficients  $1! \\cdot S(3,1) = 1$ ,  $2! \\cdot S(3,2) = 6$ ,  $3! \\cdot S(3,3) = 6$  are exactly the left edge values of the difference table of  $0^3, 1^3, 2^3, 3^3, \\ldots$  (which is  $0, 1, 6, 6, 0, \\ldots$  starting from  $a_0 = 0^3 = 0$ ). The Stirling numbers are encoded in the difference table, waiting to be read off.</p>\n<h2>Applications</h2>\n<h3>Gregory-Newton interpolation for numerical tables</h3>\n<p>Before computers, mathematical tables (logarithms, trigonometric functions, actuarial tables) were computed by hand and checked by differencing. If you have a table of  $\\sin(x)$  at equally spaced points and the fourth differences are nearly constant, you know the interpolation is accurate to the corresponding degree. Any entry that produces an anomalous difference is a transcription error.</p>\n<p>Charles Babbage designed his <a href=\"https://en.wikipedia.org/wiki/Difference_engine\">Difference Engine</a> (1822) specifically to automate this: it computed polynomial functions by repeated addition, implementing Newton's formula mechanically. The machine had no multiplication unit---it didn't need one, because the difference method reduces polynomial evaluation to addition.</p>\n<p>The method was still in active use well into the 20th century. The <a href=\"https://en.wikipedia.org/wiki/Abramowitz_and_Stegun\"><em>Handbook of Mathematical Functions</em></a> (Abramowitz and Stegun, 1964) includes extensive difference tables for standard functions, and its introduction describes differencing as the primary method for verifying table accuracy. The British Nautical Almanac Office employed human &quot;computers&quot; who differenced tables by hand until the 1950s.</p>\n<p>The connection to error detection is worth emphasizing. A single transcription error in a table entry produces a localized spike in the difference table---the error propagates through subsequent differences but with a characteristic signature (binomial coefficients with alternating signs). An experienced table-maker could spot and localize an error by inspecting the difference columns, a technique that predates modern error-correcting codes by two centuries.</p>\n<h3>Detecting polynomial sequences</h3>\n<p>The most immediate application: if you encounter a sequence and want to know whether a polynomial generates it, compute the difference table. If the  $k$ -th row is constant, the sequence is a polynomial of degree  $k$  and Newton's formula gives you the closed form immediately.</p>\n<p>This is not limited to pure mathematics. Physical quantities that depend polynomially on a parameter (projectile height vs. time, area vs. length) can be identified by differencing a table of measurements. The method is more numerically stable than fitting a polynomial by least squares when the data points are equally spaced, because it avoids solving a Vandermonde system.</p>\n<p>A concrete use case: the sum of  $k$ -th powers,  $S_k(n) = 1^k + 2^k + \\cdots + n^k$ , is always a polynomial of degree  $k+1$  in  $n$ . You can discover this experimentally by computing  $S_k(n)$  for  $n = 0, 1, 2, \\ldots$  and differencing. For  $k = 2$ : the sequence $0, 1, 5, 14, 30, 55, 91$ has constant third differences (equal to 2), confirming degree 3. The left edge is $0, 1, 3, 2$, giving  $S_2(n) = \\binom{n}{1} + 3\\binom{n}{2} + 2\\binom{n}{3} = n + \\frac{3n(n-1)}{2} + \\frac{n(n-1)(n-2)}{3} = \\frac{n(n+1)(2n+1)}{6}$ .</p>\n<p>In competitive programming and puzzle mathematics, the &quot;difference table test&quot; is a standard first move when encountering an unfamiliar integer sequence.</p>\n<h3>Euler-Maclaurin summation</h3>\n<p>The Euler-Maclaurin formula connects discrete sums to integrals:</p>\n\n\n$$\\sum_{k=0}^{n} f(k) = \\int_0^n f(x)\\,dx + \\frac{f(0) + f(n)}{2} + \\sum_{j=1}^{p} \\frac{B_{2j}}{(2j)!}\\left(f^{(2j-1)}(n) - f^{(2j-1)}(0)\\right) + R_p$$\n\n<p>where  $B_{2j}$  are Bernoulli numbers and  $R_p$  is a remainder term. The formula emerges from the operator identity</p>\n\n\n$$\\sum_{k=0}^{n} = \\frac{E^{n+1} - I}{E - I} = \\frac{E^{n+1} - I}{\\Delta}$$\n\n<p>and the formal expansion of  $\\Delta^{-1}$  in terms of the differential operator  $D = \\frac{d}{dx}$  via  $\\Delta = e^D - 1$ , giving  $\\Delta^{-1} = D^{-1} - \\frac{1}{2} + \\frac{B_2}{2!}D - \\frac{B_4}{4!}D^3 + \\cdots$ . This is the operator-theoretic bridge between finite differences and continuous calculus.</p>\n<p>The formula is useful in practice. It gives the asymptotic expansion of the harmonic numbers ( $\\sum 1/k \\approx \\ln n + \\gamma + \\frac{1}{2n} - \\frac{1}{12n^2} + \\cdots$ ), Stirling's approximation for  $n!$ , and sharp estimates of partial sums of  $\\zeta(s)$ . The error term  $R_p$  can be bounded explicitly, making Euler-Maclaurin the standard tool for converting discrete sums into integrals plus computable corrections.</p>\n<h3>The Norlund-Rice integral</h3>\n<p><em>(The following section uses contour integration from complex analysis. If that's not familiar, skip to Umbral Calculus below -- it doesn't depend on this material.)</em></p>\n<p>Many alternating sums in combinatorics have the form</p>\n\n\n$$\\sum_{k=0}^{n} \\binom{n}{k} (-1)^k f(k)$$\n\n<p>which is exactly  $\\Delta^n f(0)$ . When  $f$  is a &quot;nice&quot; function (meromorphic, polynomial growth), the Norlund-Rice integral gives a contour integral representation:</p>\n\n\n$$\\sum_{k=0}^{n} \\binom{n}{k} (-1)^k f(k) = \\frac{(-1)^n}{2\\pi i} \\oint \\frac{n!\\, f(z)}{z(z-1)\\cdots(z-n)}\\,dz$$\n\n<p>The contour encloses  $0, 1, \\ldots, n$ . The representation converts a discrete alternating sum into a contour integral evaluable by residues, yielding closed forms or sharp asymptotics.</p>\n<p>As an example, consider  $f(k) = 1/(k+1)$ . The alternating sum  $\\sum_{k=0}^{n} \\binom{n}{k} \\frac{(-1)^k}{k+1}$  equals  $\\frac{1}{n+1}$ . The contour integral gives this directly: the integrand is  $\\frac{(-1)^n n!}{(k+1) \\cdot z(z-1)\\cdots(z-n)}$ . Besides the poles at  $z = 0, 1, \\ldots, n$  (which reconstruct the sum), there is a pole at  $z = -1$  from  $f(z) = 1/(z+1)$ . Deforming the contour to pick up only this pole, the residue at  $z = -1$  is  $\\frac{(-1)^n n!}{(-1)(-2)\\cdots(-1-n)} = \\frac{(-1)^n n!}{(-1)^{n+1}(n+1)!} = \\frac{-1}{n+1}$ , which after the  $(-1)^n / (2\\pi i)$  prefactor and the contour orientation gives  $1/(n+1)$ .</p>\n<p>Flajolet and Sedgewick use this extensively in <a href=\"https://algo.inria.fr/flajolet/Publications/book.pdf\"><em>Analytic Combinatorics</em></a> for the analysis of algorithms---average-case complexity of search trees, hashing, and digital structures often reduces to alternating sums over binomial coefficients. The connection to difference tables is direct: the alternating sum  $\\sum \\binom{n}{k}(-1)^k f(k)$  is nothing but  $\\Delta^n f(0)$ , the  $n$ -th entry on the left edge of the difference table of  $(f(0), f(1), f(2), \\ldots)$ .</p>\n<h2>Umbral Calculus</h2>\n<p>There is a strange algebraic trick that has been making mathematicians uncomfortable since the 19th century. Write  $B^n$  as a formal symbol, and impose the rule that after expanding any expression, you &quot;lower the index&quot;: replace  $B^n$  with  $B_n$ , the  $n$ -th Bernoulli number. Then identities like</p>\n\n\n$$(B+1)^n = B^n \\quad (n \\geq 2)$$\n\n<p>hold, where both sides are evaluated by the lowering rule. This is the <strong>umbral calculus</strong> (from the Latin <em>umbra</em>, shadow): treat sequence indices as if they were exponents.</p>\n<p>To see the lowering rule in action, verify the identity  $(B+1)^3 = B^3$ . Expand the left side by the binomial theorem:  $(B+1)^3 = B^3 + 3B^2 + 3B^1 + B^0$ . Now lower indices:  $B_3 + 3B_2 + 3B_1 + B_0$ . The Bernoulli numbers are  $B_0 = 1$ ,  $B_1 = -1/2$ ,  $B_2 = 1/6$ ,  $B_3 = 0$ . So the left side becomes  $0 + 3 \\cdot \\frac{1}{6} + 3 \\cdot (-\\frac{1}{2}) + 1 = \\frac{1}{2} - \\frac{3}{2} + 1 = 0 = B_3$ . It works because the Bernoulli numbers satisfy the recurrence  $\\sum_{k=0}^{n} \\binom{n}{k} B_k = B_n$  for  $n \\ge 2$ ---which is exactly what  $(B+1)^n = B^n$  says after lowering.</p>\n<p>Why does this work systematically? <a href=\"https://en.wikipedia.org/wiki/Gian-Carlo_Rota\">Rota</a> (1970s) observed that the lowering rule is a linear functional  $L$  on polynomials defined by  $L[x^n] = B_n$ . The &quot;umbral identity&quot;  $(B+1)^n = B^n$  is really  $L[(x+1)^n] = L[x^n]$ , which holds because the Bernoulli numbers are defined by exactly this shift relation. The trick generalizes: any sequence  $(a_n)$  defines a linear functional  $L[x^n] = a_n$ , and identities among the  $a_n$  become polynomial identities under  $L$ .</p>\n<p>What does this buy us beyond a notational trick? It lets you derive identities mechanically. For instance, Faulhaber's formula for  $\\sum_{k=0}^{n-1} k^p$  (sums of powers) follows in one line: write the sum as  $\\Delta^{-1} x^p \\big|_0^n$ , convert  $x^p$  to the falling factorial basis using Stirling numbers (which the umbral calculus handles via the &quot;lowering&quot; of the Bell polynomial), and read off the Bernoulli coefficients. The result:</p>\n\n\n$$\\sum_{k=0}^{n-1} k^p = \\frac{1}{p+1} \\sum_{j=0}^{p} \\binom{p+1}{j} B_j \\, n^{p+1-j}$$\n\n<p>This is what  $(B + n)^{p+1} - B^{p+1} = (p+1) \\sum k^p$  says after lowering -- the entire derivation is one application of the binomial theorem followed by the lowering rule. Without the umbral calculus, deriving Faulhaber's formula requires either Euler-Maclaurin summation or induction on  $p$ , both of which are longer.</p>\n<p>The connection to difference tables is direct. The forward difference operator  $\\Delta$  satisfies  $\\Delta x^{\\underline{n}} = n \\cdot x^{\\underline{n-1}}$ , mirroring  $\\frac{d}{dx} x^n = n x^{n-1}$ . Newton's forward difference formula is the expansion of a polynomial in the basis  $\\binom{x}{n} = x^{\\underline{n}}/n!$ , just as Taylor's theorem expands in the basis  $x^n/n!$ . This is not a coincidence: both  $\\Delta$  and  $d/dx$  are examples of operators that reduce polynomial degree by 1 and commute with shifts, and any such operator produces its own &quot;Taylor expansion&quot; with its own natural basis. The difference table is the discrete instance of this structure.</p>\n<h2>A Puzzle</h2>\n<p>Here's a sequence: 2, 5, 10, 17, 26, 37, ...</p>\n<p>Build its difference table. What degree polynomial generates it? Can you find the formula?</p>\n<details>\n<summary>Solution</summary>\n<pre><code>2    5   10   17   26   37\n   3    5    7    9   11\n      2    2    2    2\n         0    0    0\n</code></pre>\n<p>Second differences are constant, so it's quadratic.</p>\n<p>Left edge: 2, 3, 2. Using Newton's formula:</p>\n\n\n$$a_n = 2\\binom{n}{0} + 3\\binom{n}{1} + 2\\binom{n}{2} = 2 + 3n + n(n-1) = n^2 + 2n + 2$$\n\n<p>Check:  $a_0 = 2$ ,  $a_1 = 5$ ,  $a_2 = 10$ . Correct.</p>\n</details>\n<h2>Open Connections</h2>\n<h3>p-adic analysis and Mahler's theorem</h3>\n<p>Mahler (1958) proved that every continuous function  $f: \\mathbb{Z}_p \\to \\mathbb{Q}_p$  on the  $p$ -adic integers has a unique expansion</p>\n\n\n$$f(x) = \\sum_{n=0}^{\\infty} \\Delta^n f(0) \\cdot \\binom{x}{n}$$\n\n<p>and this series converges for all  $x \\in \\mathbb{Z}_p$  if and only if  $\\Delta^n f(0) \\to 0$  in the  $p$ -adic absolute value. This is the same Newton series formula, but the convergence condition is  $p$ -adic rather than archimedean.</p>\n<p>A concrete example: consider  $f(x) = (-1)^x$  as a function on  $\\mathbb{Z}_2$  (the 2-adic integers). Its difference table has  $\\Delta^n f(0) = \\sum_k \\binom{n}{k}(-1)^{n-k}(-1)^k = \\sum_k \\binom{n}{k}(-1)^n(-2)^k...$ ---more simply,  $\\Delta f(0) = f(1) - f(0) = -2$ ,  $\\Delta^2 f(0) = f(2) - 2f(1) + f(0) = 1 + 2 + 1 = 4$ ,  $\\Delta^3 f(0) = -8$ , and in general  $\\Delta^n f(0) = (-2)^n$ . In the real numbers,  $|(-2)^n| \\to \\infty$ , so the Newton series diverges. But 2-adically,  $|(-2)^n|_2 = 2^{-n} \\to 0$ , so the series converges. The function  $(-1)^x$ , which makes no sense for non-integer real arguments, extends continuously to all of  $\\mathbb{Z}_2$  via its difference table.</p>\n<h3>Difference tables over finite fields</h3>\n<p>Over  $\\mathbb{F}_p$ , Fermat's little theorem gives  $a^p \\equiv a \\pmod{p}$  for all  $a$ . This means  $\\Delta^p f(x) = f(x+p) - \\binom{p}{1}f(x+p-1) + \\cdots + (-1)^p f(x) \\equiv f(x+p) - f(x) \\pmod{p}$  since  $\\binom{p}{k} \\equiv 0 \\pmod{p}$  for  $0 < k < p$ . So  $\\Delta^p = E^p - I$  over  $\\mathbb{F}_p$ ---the  $p$ -th difference operator &quot;wraps around&quot; to a single shift.</p>\n<p>This periodicity means that over finite fields, difference tables are eventually periodic rather than eventually zero. Over  $\\mathbb{F}_p$ , every function from  $\\mathbb{F}_p$  to  $\\mathbb{F}_p$  is polynomial (there are  $p^p$  functions and  $p^p$  polynomials of degree  $< p$ , and they're all distinct). So Newton's formula works perfectly, but you only need terms up to  $\\binom{x}{p-1}$ . The &quot;degree&quot; of a polynomial over a finite field is bounded by  $p-1$ , no matter what it looks like symbolically.</p>\n<p>The theory connects to Lucas's theorem on binomial coefficients mod  $p$ :  $\\binom{m}{n} \\equiv \\prod_i \\binom{m_i}{n_i} \\pmod{p}$  where  $m_i, n_i$  are the base- $p$  digits. This gives the fractal structure of Pascal's triangle mod  $p$  (the Sierpinski triangle for  $p = 2$ ) and explains the periodic structure of difference tables over finite fields.</p>\n<h3>Connection to discrete Fourier transform</h3>\n<p>The forward difference operator  $\\Delta$  diagonalizes nicely under the discrete Fourier transform. If  $\\hat{f}(k) = \\sum_{j=0}^{N-1} f(j) \\omega^{-jk}$  with  $\\omega = e^{2\\pi i/N}$ , then  $\\widehat{\\Delta f}(k) = (\\omega^k - 1)\\hat{f}(k)$ . So  $\\Delta$  is multiplication by  $\\omega^k - 1$  in frequency space.</p>\n<p>This means the difference table of a sequence is intimately related to its spectral decomposition. The eigenvalues  $\\omega^k - 1$  of  $\\Delta$  on  $\\mathbb{C}^N$  determine how quickly each Fourier mode decays under repeated differencing. The constant sequences (zero frequency) are annihilated by  $\\Delta^1$ ; sequences with low-frequency content are annihilated by low-order differences; high-frequency components persist the longest.</p>\n<p>For a polynomial of degree  $d$ , the Fourier coefficients decay fast enough that  $\\Delta^{d+1}$  annihilates the sequence. For non-polynomial sequences, the Fourier perspective explains <em>why</em> the difference table doesn't terminate: the high-frequency content never vanishes under the multiplication-by- $(\\omega^k - 1)$  map.</p>\n<p>A practical consequence: each application of  $\\Delta$  multiplies the  $k$ -th Fourier mode by  $|\\omega^k - 1|$ . At the Nyquist frequency  $k = N/2$ , this factor is 2. After  $n$  applications of  $\\Delta$ , the Nyquist component is amplified by  $2^n$ . This is why differencing a noisy empirical sequence more than a few times produces garbage---and why Babbage's Difference Engine was designed for exact integer arithmetic, not floating-point approximation.</p>\n<h2>References</h2>\n<ol>\n<li>\n<p>Graham, R.L., Knuth, D.E., and Patashnik, O. <em>Concrete Mathematics: A Foundation for Computer Science</em>. 2nd ed. Addison-Wesley, 1994. Chapters 2 and 5 cover difference operators, Stirling numbers, and the Newton basis systematically.</p>\n</li>\n<li>\n<p>Jordan, C. <em>Calculus of Finite Differences</em>. 3rd ed. Chelsea, 1965. The classical reference. Encyclopedic coverage of difference operators, interpolation, summation, and their applications.</p>\n</li>\n<li>\n<p>Roman, S. <em>The Umbral Calculus</em>. Academic Press, 1984. Rota's program made rigorous. Sheffer sequences, delta operators, and the algebraic foundations of the calculus of finite differences.</p>\n</li>\n<li>\n<p>Riordan, J. <em>Combinatorial Identities</em>. Wiley, 1968. The working reference for binomial transform identities and their combinatorial proofs.</p>\n</li>\n<li>\n<p>Flajolet, P. and Sedgewick, R. <em>Analytic Combinatorics</em>. Cambridge University Press, 2009. Chapter III covers the Norlund-Rice integral and its applications to the analysis of algorithms.</p>\n</li>\n<li>\n<p>Knuth, D.E. <em>The Art of Computer Programming</em>, Vol. 1. 3rd ed. Addison-Wesley, 1997. Section 1.2.6 covers finite differences and their use in numerical computation.</p>\n</li>\n<li>\n<p>Boole, G. <em>A Treatise on the Calculus of Finite Differences</em>. 2nd ed. Macmillan, 1872. The first systematic treatment. Available on the Internet Archive.</p>\n</li>\n<li>\n<p>Mahler, K. &quot;An interpolation series for continuous functions of a  $p$ -adic variable.&quot; <em>Journal fur die reine und angewandte Mathematik</em>, 199:23--34, 1958.</p>\n</li>\n<li>\n<p>Stanley, R.P. <em>Enumerative Combinatorics</em>, Vol. 1. 2nd ed. Cambridge University Press, 2012. Chapter 1 covers Stirling numbers and their combinatorial interpretations in full generality.</p>\n</li>\n<li>\n<p>Goldberg, S. <em>Introduction to Difference Equations</em>. Dover, 1986. Accessible introduction with applications to economics and population dynamics.</p>\n</li>\n<li>\n<p><a href=\"https://oeis.org/wiki/Transforms\">OEIS Wiki: Transforms</a>. Living reference for binomial transforms and their relationships across the OEIS.</p>\n</li>\n<li>\n<p>Spivey, M.Z. &quot;The Euler-Maclaurin formula and sums of powers.&quot; <em>Mathematics Magazine</em>, 79(1):61--65, 2006. A clean derivation of Euler-Maclaurin via the operator formalism.</p>\n</li>\n</ol>\n","date_published":"Tue, 10 Dec 2024 00:00:00 GMT"},{"id":"https://attobop.net/posts/channel-hopping/","url":"https://attobop.net/posts/channel-hopping/","title":"Channel Hopping as a Bandit Problem","content_html":"<p>Every production WiFi monitoring tool hops channels the same way it did in 2006: cycle through a fixed list, dwell for 250ms, repeat. Can we do better by treating channel selection as a learning problem?</p>\n<!--more-->\n<h2>Background</h2>\n<h3>802.11 Channel Architecture</h3>\n<p>The 2.4 GHz band has 14 channels (11 in the US), each 22 MHz wide, spaced 5 MHz apart. Only channels 1, 6, and 11 are non-overlapping -- the rest bleed into their neighbors. This overlap is not a minor technicality; it's the central complication in any channel selection scheme.</p>\n<p>The 5 GHz band is wider and cleaner: channels are 20 MHz wide with no overlap at the base width. UNII-1 through UNII-3 give roughly 25 non-overlapping channels. DFS (Dynamic Frequency Selection) channels add another ~15 but require radar detection, which complicates monitor mode.</p>\n<p>WiFi 6E and 7 opened the 6 GHz band: 59 new 20 MHz channels in the US (the full 1200 MHz from 5925 to 7125 MHz). This changes the problem's character entirely. With 14 channels, a static schedule visits each one every 3.5 seconds. With 59 channels, the same schedule takes 14.75 seconds -- long enough to miss transient devices, short-lived connections, and bursty traffic patterns.</p>\n<h3>Monitor Mode and Passive Capture</h3>\n<p>A wireless NIC in monitor mode receives raw 802.11 frames without associating to any access point. The kernel exposes each frame with a radiotap header containing metadata: signal strength, data rate, channel, and flags.</p>\n<pre><code>iw dev wlan0 set channel 6    # tune to channel 6\nsleep 0.25                    # dwell for 250ms\niw dev wlan0 set channel 1    # hop to channel 1\n</code></pre>\n<p>During each dwell, we observe some number of packets. Between dwells, the channel switch costs 15--60ms of dead time on typical hardware. USB chipsets are worse: the <code>MT7612U</code> averages 140ms per switch, and the <code>RT5572</code> hits 450ms under load<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn1\" id=\"fnref1\">[1]</a></sup>. These switching costs are real -- at 250ms dwell with 100ms switching overhead, you lose 29% of observation time to transitions.</p>\n<p>The question: which channel should we hop to next, and for how long?</p>\n<h3>The Multi-Armed Bandit Framework</h3>\n<p>The multi-armed bandit (MAB) is the simplest formalization of the explore-exploit tradeoff. You face  $K$  slot machines (arms), each with an unknown reward distribution. At each time step you pull one arm and observe a reward. The goal is to maximize cumulative reward, or equivalently, minimize <em>regret</em>: the gap between what you earned and what you would have earned always pulling the best arm.</p>\n<p>Lai and Robbins (1985) established the fundamental lower bound: for any consistent policy, the expected number of times a suboptimal arm  $i$  is pulled must grow at least as  $\\ln T / d(\\lambda_i, \\lambda^*)$ , where  $d$  is the KL divergence between the suboptimal and optimal arm distributions<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn2\" id=\"fnref2\">[2]</a></sup>. The intuition: the KL divergence  $d(\\lambda_i, \\lambda^*)$  measures how many samples you need to statistically distinguish arm  $i$  from the best arm. Arms that are nearly as good as the best require many pulls before you can confidently stop pulling them, so they contribute more regret. No algorithm can do better than logarithmic regret. The algorithms below differ in how tightly they approach this bound and how they handle the practical complications of WiFi monitoring.</p>\n<h2>What Production Tools Do</h2>\n<p><strong><a href=\"https://github.com/aircrack-ng/aircrack-ng\"><code>airodump-ng</code></a></strong> hops through an interleaved sequence <code>{1, 7, 13, 2, 8, 3, 14, 9, 4, 10, 5, 11, 6, 12}</code> with a fixed 250ms dwell. The interleaving maximizes frequency distance between consecutive hops -- channel 1 at 2412 MHz to channel 7 at 2442 MHz is a 30 MHz gap, well past the 22 MHz channel width. No adaptation.</p>\n<p><strong><a href=\"https://www.kismetwireless.net/\">Kismet</a></strong> hops at 5 Hz (200ms dwell), reasoning from the Nyquist theorem: APs beacon at ~10/sec, so 5 hops/sec captures at least one beacon per AP. It supports manual per-channel weights (<code>channellist=1:3,6:3,11:3,2,3,4,5,7,8,9,10</code> to spend 3x time on the three non-overlapping 2.4 GHz channels) and splits channel lists across multiple interfaces automatically.</p>\n<p><strong>WiFi Explorer Pro</strong> is the only commercial tool with adaptive behavior: it halves dwell time to 60ms on channels where no 802.11 activity was detected in the last scan.</p>\n<p><strong><code>tshark</code>/<code>dumpcap</code></strong> does basic round-robin through a channel list with configurable dwell. No adaptation, no interleaving. The typical invocation cycles 1 through 11 sequentially.</p>\n<p>Nobody uses a learning algorithm. The gap between what practitioners have asked for (<a href=\"https://github.com/aircrack-ng/aircrack-ng/issues/119\"><code>airodump-ng</code> issue #119</a>, 2007: &quot;dwell longer on channels with traffic&quot;) and what tools provide has been open for nearly two decades.</p>\n<h2>Modeling as a Bandit</h2>\n<p>We have  $K$  channels. At each time step  $t$ , we pick a channel  $a_t \\in \\{1, \\ldots, K\\}$ , dwell, and observe a reward  $r_{a_t}^{(t)}$  -- the number of packets captured.</p>\n<p>The standard MAB goal is to minimize cumulative <em>regret</em>: the gap between what we earned and what we would have earned always picking the best channel.</p>\n\n\n$$R_T = \\sum_{t=1}^{T} r_{a^*}^{(t)} - r_{a_t}^{(t)}$$\n\n<p>where  $a^* = \\arg\\max_i \\mathbb{E}[r_i]$  is the channel with highest expected packet rate.</p>\n<h3>The mapping</h3>\n<table>\n<thead>\n<tr>\n<th>MAB concept</th>\n<th>WiFi monitoring</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Arm  $i$ </td>\n<td>Channel  $i$  (e.g., 2.4 GHz channel 6)</td>\n</tr>\n<tr>\n<td>Pull arm  $i$ </td>\n<td>Dwell on channel  $i$  for one interval</td>\n</tr>\n<tr>\n<td>Reward  $r_i$ </td>\n<td>Number of packets captured during dwell</td>\n</tr>\n<tr>\n<td>Reward distribution</td>\n<td> $\\text{Poisson}(\\lambda_i)$  -- independent frame arrivals</td>\n</tr>\n<tr>\n<td>Best arm  $a^*$ </td>\n<td>Channel with highest traffic rate</td>\n</tr>\n<tr>\n<td>Regret</td>\n<td>Packets missed by not dwelling on the best channel</td>\n</tr>\n<tr>\n<td>Switching cost</td>\n<td>Dead time during channel change (15--450ms)</td>\n</tr>\n<tr>\n<td>Non-stationarity</td>\n<td>Devices join/leave, traffic shifts over hours</td>\n</tr>\n</tbody>\n</table>\n<h3>Assumptions and Where They Break</h3>\n<p>The standard stochastic MAB assumes independent arms with stationary reward distributions. Both assumptions are violated in WiFi monitoring:</p>\n<p><strong>Stationarity.</strong> Traffic is bursty. Devices join and leave the network. A channel that was quiet at 2 AM will be busy at 10 AM. The Poisson rate  $\\lambda_i$  changes over time, which means an algorithm that converges to a single channel will eventually be wrong.</p>\n<p><strong>Independence.</strong> Dwelling on channel 6 picks up packets from channels 5 and 7 due to spectral overlap. The reward from one arm depends on the true state of adjacent arms. This violates the independent-arms assumption and biases naive estimators.</p>\n<p>We address non-stationarity with sliding-window extensions (see below) and independence with a cross-channel leakage model.</p>\n<h3>A Complication: Cross-Channel Leakage</h3>\n<p>In practice, dwelling on channel  $i$  picks up some packets from adjacent channels  $j$  due to spectral overlap. If we observe counts  $\\{o_{ij}\\}$  while dwelling on channel  $i$ , the naive reward counts everything:</p>\n\n\n$$r_i^{(t)} = \\sum_j o_{ij}^{(t)}$$\n\n<p>This leakage means channels aren't truly independent arms. We'll return to this in the mixture model section.</p>\n<h2>Algorithms</h2>\n<h3>Uniform Hopping</h3>\n<p>Visit each channel in round-robin order with equal dwell time. This is what <code>airodump-ng</code> does (modulo the interleaving order).</p>\n<p>If channel  $a^*$  has rate  $\\lambda^*$  and the average rate across channels is  $\\bar{\\lambda}$ , uniform hopping incurs <em>linear</em> regret:</p>\n\n\n$$R_T = T \\cdot (\\lambda^* - \\bar{\\lambda})$$\n\n<p>Regret grows linearly with time because we never stop exploring channels with low traffic. The advantage is coverage: every channel gets visited, so no device goes undetected.</p>\n<h3>Epsilon-Greedy</h3>\n<p>The simplest adaptive policy: with probability  $\\varepsilon$ , pick a random channel (explore); otherwise, pick the channel with highest observed mean (exploit).</p>\n\n\n$$a_t = \\begin{cases} \\arg\\max_i \\hat{\\lambda}_i & \\text{with probability } 1-\\varepsilon \\\\ \\text{Uniform}(\\{1,\\ldots,K\\}) & \\text{with probability } \\varepsilon \\end{cases}$$\n\n<p>With fixed  $\\varepsilon$ , regret is linear:  $R_T = O(\\varepsilon T)$  since the exploration fraction never decays. With a decaying schedule  $\\varepsilon_t = \\min(1, cK / (d^2 t))$  for suitable constants $c, d$, the regret becomes  $O(K \\ln T)$ <sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn3\" id=\"fnref3\">[3]</a></sup>.</p>\n<p>Epsilon-greedy is easy to implement and understand. Its weakness is that exploration is undirected -- it wastes pulls on channels that are clearly bad. <code>UCB</code> and Thompson Sampling fix this by exploring <em>where uncertainty is highest</em>.</p>\n<h3>UCB1</h3>\n<p>Upper Confidence Bound selects the channel that maximizes the sample mean plus an exploration bonus:</p>\n\n\n$$a_t = \\arg\\max_i \\left[ \\hat{\\lambda}_i + \\sqrt{\\frac{2 \\ln t}{n_i}} \\right]$$\n\n<p>where  $\\hat{\\lambda}_i$  is the average packet count on channel  $i$  and  $n_i$  is the number of times we've dwelled there.</p>\n<p>The exploration term  $\\sqrt{2 \\ln t / n_i}$  shrinks as a channel is visited more, so <code>UCB1</code> naturally shifts from exploration to exploitation. The intuition: we're optimistic in the face of uncertainty. A channel we haven't visited much gets a large bonus, so we'll try it. Once we've seen enough to know it's bad, the bonus can't save it.</p>\n<p>The regret bound comes from applying Hoeffding's inequality to the confidence width. For sub-Gaussian rewards (which includes bounded packet counts):</p>\n\n\n$$R_T \\le \\sum_{i: \\lambda_i < \\lambda^*} \\frac{8 \\ln T}{\\Delta_i} + \\left(1 + \\frac{\\pi^2}{3}\\right) \\sum_i \\Delta_i$$\n\n<p>where  $\\Delta_i = \\lambda^* - \\lambda_i$  is the gap between channel  $i$  and the best channel<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn3\" id=\"fnref3:1\">[3:1]</a></sup>.</p>\n<p>For the 14-channel simulation with  $T = 10{,}000$  steps and gaps  $\\Delta_i$  ranging from 5 to 40 packets/dwell, the dominant term  $\\sum 8 \\ln T / \\Delta_i$  evaluates to roughly 150 -- total regret of 150 packets out of a potential 500,000 (the best channel producing ~50 packets/dwell for 10,000 steps). That is 0.03% loss from the optimal policy, compared to uniform hopping's ~40% loss.</p>\n<p><em>Why  $\\ln T / \\Delta_i$ ?</em> The UCB selects arm  $i$  only while  $\\hat{\\lambda}_i + \\sqrt{2 \\ln t / n_i} \\ge \\hat{\\lambda}^*$ , which (by Hoeffding) happens with probability  $\\exp(-n_i \\Delta_i^2 / 2)$ . Setting this equal to  $1/t$  and solving gives  $n_i \\approx 2 \\ln t / \\Delta_i^2$  pulls before the bound tightens enough to exclude arm  $i$ . Each of those pulls costs  $\\Delta_i$  regret, giving total contribution  $\\approx 2 \\ln T / \\Delta_i$ . This is <em>logarithmic</em> in  $T$  -- the key improvement over uniform.</p>\n<h3>KL-UCB</h3>\n<p>For Poisson-distributed packet counts, <code>KL-UCB</code> gives tighter bounds by exploiting the distribution structure. It selects the channel maximizing  $\\lambda$  subject to  $n_i \\cdot d(\\hat{\\lambda}_i, \\lambda) \\le \\ln t$ , where  $d$  is the Poisson KL divergence:</p>\n\n\n$$d(\\lambda_1, \\lambda_2) = \\lambda_1 \\ln\\frac{\\lambda_1}{\\lambda_2} - \\lambda_1 + \\lambda_2$$\n\n<p>This matches the Lai-Robbins lower bound asymptotically<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn4\" id=\"fnref4\">[4]</a></sup>. The improvement over <code>UCB1</code> is not just theoretical: when packet counts follow a Poisson distribution (a reasonable model for independent frame arrivals), <code>KL-UCB</code> converges to the optimal channel in fewer samples because the Poisson KL divergence is tighter than the sub-Gaussian bound that <code>UCB1</code> uses.</p>\n<h3>Thompson Sampling</h3>\n<p>Rather than constructing a confidence bound, Thompson Sampling maintains a posterior distribution over each channel's packet rate and samples from it.</p>\n<p>Model packet counts as Poisson with unknown rate:  $r_i \\sim \\text{Poisson}(\\lambda_i)$ . The conjugate prior for a Poisson rate is the Gamma distribution:</p>\n\n\n$$\\lambda_i \\sim \\text{Gamma}(\\alpha_i, \\beta_i)$$\n\n<p>Initialize with a weak prior:  $\\alpha_i = 1, \\beta_i = 1$  for all channels (prior mean of 1 packet per dwell).</p>\n<p>At each step:</p>\n<ol>\n<li>Sample  $\\tilde{\\lambda}_i \\sim \\text{Gamma}(\\alpha_i, \\beta_i)$  for each channel</li>\n<li>Select  $a_t = \\arg\\max_i \\tilde{\\lambda}_i$ </li>\n<li>Dwell on channel  $a_t$ , observe packet count  $c$ </li>\n<li>Update:  $\\alpha_{a_t} \\leftarrow \\alpha_{a_t} + c$ ,  $\\beta_{a_t} \\leftarrow \\beta_{a_t} + 1$ </li>\n</ol>\n<p>The posterior mean  $\\alpha_i / \\beta_i$  converges to the true rate  $\\lambda_i$ . Channels with uncertain estimates (wide posterior) are sampled from more variably, which drives exploration. Channels with well-estimated low rates rarely produce a high sample, so they're naturally avoided.</p>\n<p>The regret bound matches <code>KL-UCB</code> asymptotically<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn5\" id=\"fnref5\">[5]</a></sup>:</p>\n\n\n$$R_T = O\\left(\\sum_{i: \\lambda_i < \\lambda^*} \\frac{\\ln T}{d(\\lambda_i, \\lambda^*)}\\right)$$\n\n<p><em>Why does this match the Lai-Robbins lower bound?</em> The posterior concentrates around the true rate at rate  $1/\\sqrt{n_i}$ , so the probability of sampling a rate higher than  $\\lambda^*$  from a suboptimal arm decays as  $\\exp(-n_i \\cdot d(\\lambda_i, \\lambda^*))$ . The expected number of pulls before this probability becomes negligible is  $\\ln T / d(\\lambda_i, \\lambda^*)$ , matching the information-theoretic minimum.</p>\n<p>In empirical evaluations, Thompson Sampling matches or beats <code>UCB1</code> in cumulative reward and convergence speed<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn4\" id=\"fnref4:1\">[4:1]</a></sup>. It's also simpler to implement: the update is two additions, and sampling from a Gamma distribution is a standard library call.</p>\n<pre><code class=\"language-python\">import numpy as np\n\nclass ThompsonChannelHopper:\n    def __init__(self, K):\n        self.alpha = np.ones(K)   # Gamma shape\n        self.beta = np.ones(K)    # Gamma rate\n\n    def select(self):\n        samples = np.random.gamma(self.alpha, 1.0 / self.beta)\n        return np.argmax(samples)\n\n    def update(self, channel, packets):\n        self.alpha[channel] += packets\n        self.beta[channel] += 1\n</code></pre>\n<h2>The Diversity Problem</h2>\n<p>Pure regret minimization converges to a single channel -- the one with highest traffic. For passive monitoring, this is unacceptable: we need to see devices on <em>all</em> channels, not just the busiest one.</p>\n<h3>Entropy Regularization</h3>\n<p>One approach is an entropy-regularized reward. Define a modified objective:</p>\n\n\n$$r^\\star_i = (1 - \\gamma) \\frac{r_i}{\\sum_j r_j} + \\gamma H(p)$$\n\n<p>where  $H(p) = -\\sum_i p_i \\ln p_i$  is the entropy of the channel visit distribution and  $\\gamma \\in [0, 1]$  controls the diversity-exploitation tradeoff. At  $\\gamma = 0$ , this reduces to pure exploitation. At  $\\gamma = 1$ , it reduces to uniform exploration.</p>\n<p>This has theoretical grounding: Zimmert and Seldin (2021) showed that Tsallis entropy regularization with  $\\alpha = 1/2$  achieves  $O(\\sqrt{KT})$  regret in adversarial settings and  $O(\\ln T / \\Delta)$  in stochastic settings simultaneously<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn6\" id=\"fnref6\">[6]</a></sup>. In the WiFi context, this means the entropy bonus helps when traffic is non-stationary (devices joining and leaving) without sacrificing convergence when traffic is stable.</p>\n<h3>Minimum Dwell Floor</h3>\n<p>A simpler practical alternative: impose a minimum dwell floor. Visit every channel at least once per sweep (the Kismet model), then allocate remaining time proportional to learned rates. This gives coverage guarantees without modifying the reward.</p>\n<p>Concretely: in each round of  $K$  steps, dedicate one step per channel (the &quot;floor&quot;), then allocate the remaining  $T - K$  steps using Thompson Sampling. The coverage guarantee is exact: every channel gets at least  $T/K$  visits. The regret cost of the floor is at most  $K \\cdot \\Delta_{\\max}$  per round, which is constant overhead.</p>\n<h3>Constrained Formulation</h3>\n<p>A more principled version: minimize regret subject to a minimum visit frequency per channel. This falls under the constrained bandit framework. Define  $f_i = n_i / T$  as the fraction of time spent on channel  $i$ . The constraint is  $f_i \\ge f_{\\min}$  for all  $i$ .</p>\n<p>Badanidiyuru et al. (2013) give an algorithm for bandits with knapsack constraints that achieves  $O(\\sqrt{T})$  regret while satisfying the constraints<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn7\" id=\"fnref7\">[7]</a></sup>. The WiFi application is a special case where the constraint is a lower bound on each arm's pull frequency.</p>\n<h2>Non-Stationary Extensions</h2>\n<p>Real WiFi traffic is non-stationary. The Poisson rates  $\\lambda_i(t)$  drift as devices join and leave, as usage patterns shift through the day, and as interference from neighboring networks changes. An algorithm that converges to a fixed channel allocation will eventually be wrong.</p>\n<h3>Sliding-Window UCB</h3>\n<p>Garivier and Moulines (2011) proposed Sliding-Window UCB (<code>SW-UCB</code>): instead of using all historical observations, compute the sample mean and confidence bound using only the last  $\\tau$  observations<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn8\" id=\"fnref8\">[8]</a></sup>:</p>\n\n\n$$a_t = \\arg\\max_i \\left[ \\hat{\\lambda}_{i,\\tau} + \\sqrt{\\frac{2 \\ln \\min(t, \\tau)}{n_{i,\\tau}}} \\right]$$\n\n<p>where  $\\hat{\\lambda}_{i,\\tau}$  and  $n_{i,\\tau}$  are the mean and count using only the window  $[t-\\tau, t]$ .</p>\n<p>The window size  $\\tau$  controls the bias-variance tradeoff: a short window reacts quickly to changes but has high variance; a long window is stable but slow to adapt. For WiFi monitoring with traffic that shifts on the scale of minutes,  $\\tau$  in the range of 100--500 steps (25--125 seconds at 250ms dwell) is reasonable.</p>\n<h3>Discounted Thompson Sampling</h3>\n<p>An alternative: instead of a hard window, apply exponential discounting to the Gamma posterior. Replace the update rule with:</p>\n\n\n$$\\alpha_{a_t} \\leftarrow \\delta \\cdot \\alpha_{a_t} + c, \\qquad \\beta_{a_t} \\leftarrow \\delta \\cdot \\beta_{a_t} + 1$$\n\n<p>where  $\\delta \\in (0, 1)$  is the discount factor. This exponentially down-weights old observations, making the posterior responsive to recent traffic. At  $\\delta = 1$ , this reduces to standard Thompson Sampling. At  $\\delta = 0.99$  with 250ms dwell, the effective memory is roughly  $1/(1-\\delta) = 100$  steps or 25 seconds.</p>\n<p>Raj and Kalyani (2017) proved that discounted Thompson Sampling achieves near-optimal regret in piecewise-stationary environments, with a regret bound that depends on the number of change points rather than the time horizon<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn9\" id=\"fnref9\">[9]</a></sup>.</p>\n<h2>Accounting for Cross-Channel Leakage</h2>\n<p>Consider three 2.4 GHz channels: 1, 6, and 11. While dwelling on channel 6, you might observe 40 packets from channel 6, 8 from channel 5, and 3 from channel 7. The &quot;leakage&quot; weights  $w_{ij}$  capture this:  $w_{6,6} \\approx 1.0$ ,  $w_{6,5} \\approx 0.2$ ,  $w_{6,7} \\approx 0.07$ . The weights decay roughly with frequency distance.</p>\n<p>In general, dwelling on channel  $i$  gives us counts  $o_{ij}$  from each source channel  $j$ . The total reward is  $r_i = \\sum_j o_{ij} = \\sum_j w_{ij} x_j$ , where  $x_j \\sim \\text{Poisson}(\\lambda_j)$  is the true traffic rate on channel  $j$  and  $w_{ij}$  is the overlap weight.</p>\n<p>The simplest correction: if the overlap weight matrix  $W$  is known (and for the 2.4 GHz band, it is approximately determined by the channel frequency layout -- 20 MHz channels spaced 5 MHz apart give a predictable overlap function), then deconvolve before updating. The observation vector  $\\mathbf{o}_i = W_i \\boldsymbol{\\lambda}$  is a linear mixture; solving  $\\hat{\\boldsymbol{\\lambda}} = W^{-1} \\mathbf{o}_i$  (or a regularized variant, since  $W$  is ill-conditioned for closely-spaced channels) recovers estimates of the per-channel rates. In the common case where only three non-overlapping channels carry most traffic (1, 6, 11 in 2.4 GHz),  $W$  is nearly diagonal and the correction reduces to simple rescaling. Update the Gamma posterior on each  $\\lambda_j$  with the deconvolved count. This is cheap and effective when the geometry is stable.</p>\n<p>When the weights are <em>not</em> known -- for instance, in indoor environments where multipath reflections and non-line-of-sight propagation distort the overlap pattern -- the weights themselves must be learned. This can be handled with a Bayesian model over the weight matrix (e.g., a Normal-Inverse-Wishart prior on the rows of  $W$ , updated jointly with the Gamma priors on  $\\lambda_j$ ). The machinery is standard conjugate Bayesian inference<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn10\" id=\"fnref10\">[10]</a></sup>, but the convergence is slower because you are learning  $K^2$  weights alongside  $K$  rates. In practice, the known-geometry approximation is usually sufficient for the 2.4 GHz and 5 GHz bands.</p>\n<h2>Related Work</h2>\n<p>The channel hopping problem sits at an intersection of several research communities that don't talk to each other much.</p>\n<p><strong>Cognitive radio and dynamic spectrum access.</strong> The CR literature has studied bandit-based channel selection extensively, but for <em>transmitting</em>, not monitoring. Jouini et al. (2010) applied <code>UCB</code> to opportunistic spectrum access where a secondary user must find unused spectrum<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn11\" id=\"fnref11\">[11]</a></sup>. Anandkumar et al. (2011) extended this to multi-user coordination, where multiple radios must independently select different channels without communication<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn12\" id=\"fnref12\">[12]</a></sup>. The monitoring problem is simpler (no coordination, no transmission constraints) but has the diversity requirement that CR work ignores.</p>\n<p><strong>Network measurement.</strong> The systems community has studied WiFi monitoring from a measurement perspective. Schulman et al. (2008) analyzed the statistical properties of channel scanning for network surveys<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn13\" id=\"fnref13\">[13]</a></sup>. The focus is on coverage metrics (fraction of APs detected) rather than sequential decision-making. Their finding -- that 30 seconds per channel suffices for 95% AP detection -- motivates the bandit approach: if most information arrives in the first few seconds, we should allocate time adaptively.</p>\n<p><strong>Restless bandits.</strong> When channels have state that evolves while unobserved (traffic arrives whether or not we're listening), the problem is formally a restless bandit. Whittle (1988) proposed an index policy for restless bandits that generalizes the Gittins index<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn14\" id=\"fnref14\">[14]</a></sup>. Computing the Whittle index is tractable for certain transition structures, including the binary (active/idle) channel model used in CR. Whether it applies to the Poisson traffic model in WiFi monitoring is an open question.</p>\n<h2>Regret Comparison</h2>\n<figure>\n  <img src=\"https://attobop.net/posts/channel-hopping/bandit_comparison.png\" alt=\"Three-panel comparison: cumulative regret, per-step regret, and channel selection fraction for Uniform, UCB1, and Thompson Sampling\" width=\"680\">\n  <figcaption> 14 channels with Poisson rates 3--48 packets/dwell, averaged over 20 runs. Left: cumulative regret -- uniform grows linearly, UCB1 and TS are sublinear. Center: per-step regret -- UCB1 drops below 1, TS settles around 3.5, uniform stays at ~23. Right: channel selection fraction -- UCB1 concentrates on the best channel; TS spreads across the top 2--3; uniform visits all equally.</figcaption>\n</figure>\n<p>In this high-rate Poisson regime, <code>UCB1</code> outperforms Thompson Sampling because the Poisson variance scales with the mean -- the posterior updates for high-rate channels are noisier, causing TS to explore the top few channels longer before committing. Both achieve sublinear regret; both vastly outperform uniform.</p>\n<h2>Open Problems</h2>\n<p><strong>6 GHz channel explosion.</strong> With 59 channels in 6 GHz alone (plus 14 in 2.4 GHz and ~25 in 5 GHz), the exploration budget is severe. A uniform sweep takes ~25 seconds. Contextual information (time of day, known AP locations, channel utilization reports from 802.11k/v/r) could dramatically reduce the exploration needed. This pushes toward contextual bandits, where side information informs the channel selection policy.</p>\n<p><strong>Multi-interface parallelism.</strong> Kismet already supports splitting channel lists across multiple NICs. The multi-agent bandit formulation -- where  $M$  monitors must coordinate which channels to observe simultaneously -- is well-studied theoretically (Anandkumar et al. 2011) but not implemented in any monitoring tool. With USB WiFi adapters at $15 each, running 4--8 monitors in parallel is practical, and the coordination question is open: should they explore independently, or share observations?</p>\n<p><strong>Non-stationarity detection.</strong> Rather than always applying a sliding window (which adds a tuning parameter), a change-point detection algorithm could trigger re-exploration only when the traffic distribution shifts. Tartakovsky et al. (2014) give sequential change-point tests that could be layered on top of Thompson Sampling<sup class=\"footnote-ref\"><a href=\"https://attobop.net/posts/channel-hopping/#fn15\" id=\"fnref15\">[15]</a></sup>.</p>\n<p><strong>Switching cost.</strong> We've treated channel switches as having a fixed time cost, but they also have an information cost: during the switch, no packets are captured on <em>any</em> channel. The bandit-with-switching-costs formulation (Dekel et al. 2014) penalizes frequent switches, which would favor longer dwell times and fewer hops. This is the formal version of the practitioner's intuition that &quot;hopping too fast wastes time.&quot;</p>\n<p><strong>Integration with active scanning.</strong> Passive monitoring can be supplemented with probe requests to solicit responses from hidden APs. The decision of when to send a probe (which temporarily makes the monitor visible) is a separate explore-exploit tradeoff layered on top of the channel selection problem.</p>\n<h2>Summary</h2>\n<table>\n<thead>\n<tr>\n<th>Method</th>\n<th>Regret</th>\n<th>Diversity</th>\n<th>Complexity</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Uniform</td>\n<td> $O(T)$  -- linear</td>\n<td>Full coverage</td>\n<td>None</td>\n</tr>\n<tr>\n<td> $\\varepsilon$ -greedy (decaying)</td>\n<td> $O(K \\ln T)$ </td>\n<td>Moderate</td>\n<td>Sample mean</td>\n</tr>\n<tr>\n<td>UCB1</td>\n<td> $O(\\ln T / \\Delta)$ </td>\n<td>Decreasing</td>\n<td>Sample mean + bonus</td>\n</tr>\n<tr>\n<td>KL-UCB</td>\n<td> $O(\\ln T / d(\\lambda, \\lambda^*))$ </td>\n<td>Decreasing</td>\n<td>KL optimization</td>\n</tr>\n<tr>\n<td>Thompson Sampling</td>\n<td> $O(\\ln T / d(\\lambda, \\lambda^*))$ </td>\n<td>Decreasing</td>\n<td>Gamma posterior sampling</td>\n</tr>\n<tr>\n<td>TS + entropy</td>\n<td> $O(\\sqrt{KT})$  worst case</td>\n<td>Tunable via  $\\gamma$ </td>\n<td>Gamma + entropy term</td>\n</tr>\n<tr>\n<td>Sliding-Window UCB</td>\n<td> $O(\\ln T / \\Delta)$  per segment</td>\n<td>Decreasing</td>\n<td>Windowed mean + bonus</td>\n</tr>\n<tr>\n<td>Discounted TS</td>\n<td> $O(\\ln T / d)$  per segment</td>\n<td>Decreasing</td>\n<td>Discounted Gamma</td>\n</tr>\n</tbody>\n</table>\n<p>The right choice depends on the monitoring goal. For intrusion detection (must see every device), uniform or entropy-regularized TS with high  $\\gamma$  preserves coverage. For traffic analysis (maximize total observed packets), pure Thompson Sampling with Gamma-Poisson conjugate is hard to beat. For environments where traffic patterns shift throughout the day, discounted TS or <code>SW-UCB</code> adapt without manual retuning.</p>\n<p>Every production tool today uses the first row. The gap is open.</p>\n<hr>\n<h2>References</h2>\n<hr class=\"footnotes-sep\">\n<section class=\"footnotes\">\n<ol class=\"footnotes-list\">\n<li id=\"fn1\" class=\"footnote-item\"><p>USB-WiFi issue #376. Channel switching latency benchmarks across chipsets. <a href=\"https://github.com/morrownr/USB-WiFi/issues/376\">github.com/morrownr/USB-WiFi</a> <a href=\"https://attobop.net/posts/channel-hopping/#fnref1\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn2\" class=\"footnote-item\"><p>Lai, T. L. &amp; Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. <em>Advances in Applied Mathematics</em>, 6(1), 4--22. <a href=\"https://attobop.net/posts/channel-hopping/#fnref2\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn3\" class=\"footnote-item\"><p>Auer, P., Cesa-Bianchi, N., &amp; Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. <em>Machine Learning</em>, 47(2), 235--256. <a href=\"https://attobop.net/posts/channel-hopping/#fnref3\" class=\"footnote-backref\">↩︎</a> <a href=\"https://attobop.net/posts/channel-hopping/#fnref3:1\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn4\" class=\"footnote-item\"><p>Garivier, A. &amp; Cappe, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. <em>COLT</em>. <a href=\"https://attobop.net/posts/channel-hopping/#fnref4\" class=\"footnote-backref\">↩︎</a> <a href=\"https://attobop.net/posts/channel-hopping/#fnref4:1\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn5\" class=\"footnote-item\"><p>Agrawal, S. &amp; Goyal, N. (2012). Further optimal regret bounds for Thompson Sampling. <em>arXiv:1209.3353</em>. Finite-time Poisson-Gamma bounds in: Jin, T. et al. (2022). Finite-time regret of Thompson Sampling algorithms for exponential family multi-armed bandits. <em>NeurIPS</em>. <a href=\"https://attobop.net/posts/channel-hopping/#fnref5\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn6\" class=\"footnote-item\"><p>Zimmert, J. &amp; Seldin, Y. (2021). Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. <em>JMLR</em>, 22(28), 1--28. <a href=\"https://attobop.net/posts/channel-hopping/#fnref6\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn7\" class=\"footnote-item\"><p>Badanidiyuru, A., Kleinberg, R., &amp; Slivkins, A. (2013). Bandits with knapsacks. <em>FOCS</em>, 207--216. <a href=\"https://attobop.net/posts/channel-hopping/#fnref7\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn8\" class=\"footnote-item\"><p>Garivier, A. &amp; Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. <em>ALT</em>, 174--188. <a href=\"https://attobop.net/posts/channel-hopping/#fnref8\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn9\" class=\"footnote-item\"><p>Raj, V. &amp; Kalyani, S. (2017). Taming non-stationary bandits: A Bayesian approach. <em>arXiv:1707.09727</em>. <a href=\"https://attobop.net/posts/channel-hopping/#fnref9\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn10\" class=\"footnote-item\"><p>Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., &amp; Rubin, D. B. (2014). <em>Bayesian Data Analysis</em> (3rd ed., p. 73). Chapman &amp; Hall/CRC. <a href=\"https://attobop.net/posts/channel-hopping/#fnref10\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn11\" class=\"footnote-item\"><p>Jouini, W., Ernst, D., Moy, C., &amp; Palicot, J. (2010). Upper confidence bound based decision making strategies and dynamic spectrum access. <em>ICC</em>, 1--5. <a href=\"https://attobop.net/posts/channel-hopping/#fnref11\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn12\" class=\"footnote-item\"><p>Anandkumar, A., Michael, N., Tang, A. K., &amp; Swami, A. (2011). Distributed algorithms for learning and cognitive medium access with logarithmic regret. <em>IEEE JSAC</em>, 29(4), 731--745. <a href=\"https://attobop.net/posts/channel-hopping/#fnref12\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn13\" class=\"footnote-item\"><p>Schulman, A., Levin, D., &amp; Spring, N. (2008). On the fidelity of 802.11 packet traces. <em>PAM</em>, 132--141. <a href=\"https://attobop.net/posts/channel-hopping/#fnref13\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn14\" class=\"footnote-item\"><p>Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. <em>Journal of Applied Probability</em>, 25(A), 287--298. <a href=\"https://attobop.net/posts/channel-hopping/#fnref14\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n<li id=\"fn15\" class=\"footnote-item\"><p>Tartakovsky, A., Nikiforov, I., &amp; Basseville, M. (2014). <em>Sequential Analysis: Hypothesis Testing and Changepoint Detection</em>. Chapman &amp; Hall/CRC. <a href=\"https://attobop.net/posts/channel-hopping/#fnref15\" class=\"footnote-backref\">↩︎</a></p>\n</li>\n</ol>\n</section>\n","date_published":"Mon, 26 Jun 2023 00:00:00 GMT"}]}