Why Regions, Not Points: Geometric Embeddings for Subsumption

April 2, 2026

"Dog is-a Animal." This is a statement about sets: every dog is an animal, so the set of dogs is contained in the set of animals. If you want an embedding to capture this, the embedding of Dog should be inside the embedding of Animal in some geometric sense. Points have no interior, no volume, and no way to express that one concept is a subset of another — not just close, but contained. You need regions.The entire field follows from taking "is-a" literally as set containment.

Background

Notation. Throughout this post: $m, M$ are box corners (min, max); $c, o $ are center and half-width (offset); $ d$ is the embedding dimension; $\beta$ is the Gumbel temperature; $\sqsubseteq$ is subsumption ("is-a"); $\sqcap$ is concept conjunction ("and"); $P(B \subseteq A)$ is containment probability.

Knowledge graph embeddings

A knowledge graph is a collection of triples $(h, r, t)$: head entity, relation, tail entity. ("Berlin", capitalOf, "Germany") or ("Dog", isA, "Animal"). These graphs are large but sparse: 93.8% of people in Freebase have no recorded birthplace, and 78.5% have no recorded nationality. The standard task is link prediction: given (Berlin, capitalOf, ?) or (?, isA, Animal), rank the missing entity.

The dominant approaches since 2013 embed entities as vectors and score triples geometrically — either by treating relations as transformations (TransE, RotatE) or as bilinear forms like $h^\top M_r t$ where $M_r$ is a learnable matrix per relation (DistMult, ComplEx).The point-embedding models discussed here are implemented in tranz — named after the Trans* family (TransE, TransR, TransH, TransD), which all model relations as geometric operations on points: translations, projections into relation-specific subspaces, or hyperplane reflections. They differ in how they handle 1-to-N and N-to-N relations, but none can represent containment. The region embeddings that follow are in subsume, named for the subsumption relation ( $\sqsubseteq$ ) these embeddings model.

TransE (2) — short for "translating embeddings" — models each relation as a translation: score a triple $(h, r, t)$ by $\|h + r - t\|$ .Lower is better. The choice of L1 vs L2 norm matters less than you'd expect. If the triple is true, the head plus the relation vector should land near the tail. It is simple, scalable, and became the baseline most subsequent work is measured against, reaching competitive performance on FB15k (a standard link prediction benchmark derived from Freebase, with 15,000 entities) despite the simple scoring function.

But TransE fails structurally on certain relation types. For 1-to-N relations like (Britain, hasCity, ?), every correct tail entity must satisfy $h + r \approx t$ , which forces London, Manchester, and Edinburgh to have nearly identical embeddings. The model can't distinguish cities that share a country. Symmetric relations are worse: if $h + r = t$ and $t + r = h$ , then $r = 0$ , which collapses the relation entirely.The "married to" relation forces every married couple to share an embedding. Not ideal.

RotatE (7) — "rotation embedding" — fixes the symmetry problem by working in complex space: each relation is an element-wise rotation $t = h \circ r$ where $|r_i| = 1$ . Symmetric relations get angle $\pi$ (rotation by 180 degrees is its own inverse). This handles symmetry, antisymmetry, inversion, and composition.

But neither TransE nor RotatE can represent containment.Distance can approximate containment, but it can never encode it. "Close to Animal" is not the same as "inside Animal." Both embed entities as points. A point has no volume, no interior, no boundary. You can measure the distance between two points, but you can't ask whether one point is inside another.

What containment requires

Consider the subsumption hierarchy: Dog $\sqsubseteq$ Animal $\sqsubseteq$ LivingThing, where $\sqsubseteq$ (read "is subsumed by") means every instance of the left concept is also an instance of the right. In set-theoretic terms, $\text{ext}(\text{Dog}) \subseteq \text{ext}(\text{Animal}) \subseteq \text{ext}(\text{LivingThing})$ , where $\text{ext}(C)$ is the set of individuals that fall under concept $C$ .

An embedding that captures this needs three properties:

Volume. More general concepts (Animal) should have larger representations than specific ones (Dog). "Animal" covers more ground than "Dog."
Containment. The representation of Dog should be geometrically inside the representation of Animal. Not just "close to" — inside.
Intersection. The concepts "Animal" and "Pet" overlap (some animals are pets, some aren't). Their representations should intersect, and the intersection should itself be a valid representation — of the concept $\text{Animal} \sqcap \text{Pet}$ , whether or not that conjunction has a name in the ontology.

Points fail all three. They have no volume. No interior for containment. Two distinct points have empty intersection — there is no partial overlap.

Order embeddings: the first attempt

Vendrov et al. [3] had the right instinct: embed the partial order directly into the geometry. Their approach mapped concepts into the non-negative orthant $\mathbb{R}^d_+$ (the region where all coordinates are $\ge 0$ , the $d$ -dimensional analogue of the first quadrant) and imposed the reverse product order: $x$ is more general than $y$ if $x_i \le y_i$ for every coordinate. The origin is the top element — the most general concept.The "entity" concept, containing everything, sits at zero. Specificity grows with coordinate magnitude. Each point defines two cones: the cone extending toward the origin (smaller coordinates) contains its ancestors (more general concepts); the cone extending away from the origin (larger coordinates) contains its descendants (more specific concepts).

Order embeddings in the positive orthant: sibling cones overlap — In order embeddings, each concept

This gets transitivity for free: if $x \le y$ and $y \le z$ coordinate-wise, then $x \le z$ . But the representation is still points, and the cones are unbounded. Vilnis et al. [4] proved the deeper problem: for any product probability measure $p(x) = \prod_i p_i(x_i)$ over the non-negative orthant, the covariance of the indicator functions (1 if a random point lands in the cone, 0 otherwise) of any two cones is non-negative — $\text{Cov}(\mathbf{1}_{C_A}, \mathbf{1}_{C_B}) \ge 0$ . If "Mammal" and "Reptile" are both under "Animal," their cones necessarily overlap — you cannot express that they are disjoint under any such measure. The geometry forces all concepts under a common ancestor to be positively correlated, which is wrong for siblings that should be mutually exclusive.

This motivated boxes: a bounded region closed under intersection that admits a probabilistic containment score.

Box Embeddings

Box taxonomy showing containment, disjointness, and partial overlap — All three properties in one geometry. Containment: Dog, Cat, Fish inside Animal (P=1). Disjointness: Dog and Cat do not overlap (P=0). Partial overlap: Pet (dashed) intersects Dog and Cat but not Fish. Hierarchy: LivingThing contains Animal contains Dog; LivingThing contains Plant contains Tree.

The idea of representing words as regions rather than points goes back to Erk [1], who embedded words as convex regions in vector space to model graded entailment.Erk's 2009 paper predates the deep learning era entirely. The idea of "words as regions" waited nearly a decade for the right optimization tools. Vilnis et al. [4] made this precise for knowledge graphs, proposing axis-aligned hyperrectangles — boxes — in $\mathbb{R}^d$ . A box $B$ is parameterized by its minimum and maximum corners:

B = \{x \in \mathbb{R}^d : m_i \le x_i \le M_i, \; i = 1, \ldots, d\}

where $m = (m_1, \ldots, m_d)$ and $M = (M_1, \ldots, M_d)$ with $m_i \le M_i$ for each coordinate.

This is the simplest region that supports all three properties. Volume is the product of side lengths:

\text{Vol}(B) = \prod_{i=1}^{d} (M_i - m_i)

Containment is coordinate-wise: $A \subseteq B$ if and only if $m^B_i \le m^A_i$ and $M^A_i \le M^B_i$ for all $i$ . The intersection of two boxes is a box (or empty):

(A \cap B)_i = [\max(m^A_i, m^B_i), \; \min(M^A_i, M^B_i)]

The intersection is non-empty when $\max(m^A_i, m^B_i) \le \min(M^A_i, M^B_i)$ for every coordinate. If any coordinate has an empty interval, the boxes are disjoint. Disjoint boxes represent mutually exclusive concepts — the thing order embeddings could not express. And when boxes partially overlap, the intersection region is itself a box representing the conjunction: the box for "Animal" intersected with the box for "Pet" gives a box for "things that are both animals and pets."

The axis-alignment is a deliberate restriction.Why not balls? The intersection of two balls is not a ball. Boxes are the simplest shape where "overlap" stays in the family. Rotated boxes would be more expressive, but their intersection is not necessarily a rotated box — you lose closure under intersection. Axis-aligned boxes form a lattice — a structure where any two elements have a greatest lower bound (intersection/meet) and a least upper bound (bounding box/join) — mirroring concept hierarchies in description logics (formal languages for representing ontologies; see the EL⁺⁺ section below). Each dimension contributes an independent "vote" on containment — the box analogue of diagonal covariance in Gaussian models.

Containment probability

The natural scoring function for subsumption is the containment probability:

P(B \subseteq A) = \frac{\text{Vol}(A \cap B)}{\text{Vol}(B)}

Concretely: draw a point uniformly from $B$ ; this is the probability it lands in $A$ .This uniform-measure interpretation is what makes the $P(\cdot)$ notation honest. Without it, $P(B \subseteq A)$ applies probability notation to a deterministic set-theoretic predicate — either $B$ is inside $A$ or it isn't. The uniform draw gives it a genuine probabilistic meaning. It returns 1 if $B$ is entirely inside $A$ , 0 if they are disjoint, and a value between 0 and 1 for partial overlap. If we want to express "Dog is-a Animal," we train so that $P(\text{Dog} \subseteq \text{Animal}) \approx 1$ .

The asymmetry is important: $P(B \subseteq A) \ne P(A \subseteq B)$ in general, because the denominator is always the volume of the thing being tested for containment. A small box inside a large box gives $P = 1$ in one direction and $P \ll 1$ in the other. This matches the semantics: every dog is an animal, but not every animal is a dog.

A worked example

Take $d = 2$ . Let Animal $= [0, 10] \times [0, 10]$ and Dog $= [2, 5] \times [3, 7]$ .

\text{Vol}(\text{Animal}) = 10 \times 10 = 100

\text{Vol}(\text{Dog}) = 3 \times 4 = 12

The intersection is Dog itself (Dog is inside Animal), so $\text{Vol}(\text{Animal} \cap \text{Dog}) = 12$ .

P(\text{Dog} \subseteq \text{Animal}) = \frac{12}{12} = 1

P(\text{Animal} \subseteq \text{Dog}) = \frac{12}{100} = 0.12

Now add Cat $= [6, 9] \times [3, 7]$ . Dog and Cat are disjoint (no overlap in the first coordinate: Dog's $[2, 5]$ doesn't intersect Cat's $[6, 9]$), so $P(\text{Dog} \subseteq \text{Cat}) = 0$ . Both are contained in Animal. The geometry directly encodes the taxonomy.

import numpy as np

def containment_prob(a_min, a_max, b_min, b_max):
    """P(B subset A): fraction of B's volume inside A."""
    inter_min = np.maximum(a_min, b_min)
    inter_max = np.minimum(a_max, b_max)
    inter_sides = np.maximum(inter_max - inter_min, 0)
    b_sides = np.maximum(b_max - b_min, 1e-10)
    return np.prod(inter_sides) / np.prod(b_sides)

animal = (np.array([0, 0]), np.array([10, 10]))
dog    = (np.array([2, 3]), np.array([5, 7]))
cat    = (np.array([6, 3]), np.array([9, 7]))

print(f"P(Dog ⊆ Animal) = {containment_prob(*animal, *dog)}")   # 1.0
print(f"P(Animal ⊆ Dog) = {containment_prob(*dog, *animal)}")   # 0.12
print(f"P(Dog ⊆ Cat)    = {containment_prob(*cat, *dog)}")      # 0.0

In high dimensions, computing the raw volume $\prod_i (M_i - m_i)$ numerically is unstable — floating-point underflow to zero occurs well before the full product is computed. The fix is to work in log-space: $\log \text{Vol}(B) = \sum_{i=1}^{d} \log(M_i - m_i)$ . The product becomes a sum, and everything stays tractable. All practical implementations operate on log-volumes, not volumes.

The Gradient Problem

Hard boxes have a serious training defect.This section is why the Gumbel paper exists. The gradient problem is not a minor inconvenience — it makes naive box training fail completely. Consider two disjoint boxes — say Dog $= [2, 5] \times [3, 7]$ and Fish $= [20, 25] \times [30, 35]$ . Their intersection volume is zero. If we move Dog slightly — say shift it by $\epsilon$ in any direction — the intersection is still zero. The loss doesn't change. The gradient is zero.

This is the local identifiability problem — the formal name for the gradient problem in box embeddings.The name comes from statistics: a parameter is locally identifiable if small perturbations change the likelihood. Here, they don't. Dasgupta et al. (2020) use "local identifiability" as the motivation for Gumbel boxes. When boxes are disjoint, the containment probability is identically zero in a neighborhood of the current parameters. The optimizer receives no signal about which direction to move to bring the boxes closer.

The problem gets worse in high dimensions. In $\mathbb{R}^d$ , two boxes are disjoint if any single coordinate has non-overlapping intervals. With $d = 200$ and random initialization, the probability that two boxes overlap in all 200 coordinates simultaneously is negligible. Almost every pair starts disjoint, and the optimizer is blind to almost every relationship it needs to learn.

The intersection volume is piecewise multilinear in the box coordinates — within each combinatorial regime (fully contained, partially overlapping, disjoint), it is a product of $d$ linear terms, one per coordinate. The containment probability, as a ratio of two such products, is a piecewise rational function. At the boundaries between regimes, the function is continuous but not differentiable. And in the disjoint regime, it is flat: identically zero, with zero gradient everywhere.

This is analogous to the dead ReLU problem in neural networks, but worse. A dead ReLU neuron has zero gradient for negative inputs but can recover if the bias shifts; a pair of disjoint boxes in $d$ dimensions has zero gradient across all $4d$ parameters (two boxes, each with $2d$ corner coordinates).

Hard box vs Gumbel box volume as a function of gap between boxes — Left: hard box intersection volume drops linearly as overlap decreases, then is flat zero once boxes are disjoint — the gradient vanishes entirely. Right: Gumbel expected volume (softplus) decays smoothly through the transition — gradients are nonzero everywhere, including deep in the disjoint regime (dashed tangent).

First fix: smoothing (Li et al., 2019)

Li et al. [6] attacked the gradient problem by convolving the hard box indicator functions with Gaussian kernels. The smoothed intersection is never exactly zero — even for disjoint boxes, the Gaussian tails overlap, providing gradient signal.

The smoothed volume has a closed-form expression involving the Gaussian CDF, which is differentiable everywhere. This works: disjoint boxes now produce nonzero loss, and the optimizer can move them toward overlap.

Smoothing breaks the lattice.Softboxes trade one structural guarantee (closure under intersection) for another (smooth gradients). The intersection of two smoothed boxes is not itself a smoothed box. The representation is no longer closed under intersection — the lattice structure, which was one of the main reasons for choosing boxes in the first place, is lost. The smoothed model is also not idempotent (applying an operation twice gives a different result than once): $P(A \subseteq A) \ne 1$ in general, which is semantically wrong — every set contains itself. And in one-dimensional tree experiments (where the ground truth is recoverable), softbox Mean Reciprocal Rank (MRR, a ranking metric where 1.0 is perfect) plateaued at 0.691 compared to 0.971 for Gumbel boxes — a ceiling imposed by the smoothing's loss of lattice structure. Smoothing fixed the zero-gradient problem but introduced new ones.

Second fix: Gumbel boxes (Dasgupta et al., 2020)

Dasgupta et al. [8] found a solution that fixes the gradient problem without breaking the lattice.

Instead of a deterministic box $[m_i, M_i]$ , make each endpoint a random variable drawn from a Gumbel distribution. The Gumbel distribution arises naturally as the limiting distribution of the maximum (or minimum) of many independent samples — it is to maxima what the Gaussian is to averages. It comes in two variants: Gumbel-max (right-skewed, models maxima) and Gumbel-min (left-skewed, models minima). The lower bound $m_i$ gets a Gumbel-max because box intersection takes the $\max$ of lower bounds; the upper bound $M_i$ gets a Gumbel-min because intersection takes the $\min$ of upper bounds.The pairing sounds backwards until you think about intersection: new lower bound = max of old lower bounds (Gumbel-max stays Gumbel-max), new upper bound = min of old upper bounds (Gumbel-min stays Gumbel-min).

m_i \sim \text{Gumbel}_{\max}(\mu^m_i, \beta), \qquad M_i \sim \text{Gumbel}_{\min}(\mu^M_i, \beta)

where $\mu^m_i, \mu^M_i$ are learnable location parameters and $\beta > 0$ is a temperature controlling the "softness" of the boundaries. At $\beta \to 0$ , the Gumbel distributions collapse to point masses and we recover hard boxes.Think of $\beta$ as wall thickness. Large $\beta$ : fuzzy walls, easy to push through. Small $\beta$ : sharp walls, crisp decisions.

Why Gumbel specifically, and not Gaussian or logistic or any other smooth distribution? Because the Gumbel family is min-max stable: if you take the maximum of several independent Gumbel-max variables, the result is itself a Gumbel-max (and likewise, the minimum of Gumbel-min variables is Gumbel-min). No other common distribution has this property. The new location parameter is computed via LogSumExp ( $\text{LSE}(x_1, \ldots, x_k) = \log \sum_i e^{x_i}$ , a smooth approximation to max):

\max(X_1, \ldots, X_k) \sim \text{Gumbel}_{\max}\!\left(\beta \ln \sum_i e^{\mu_i / \beta}, \; \beta\right)

Recall that box intersection takes the coordinate-wise max of lower bounds and min of upper bounds. With Gumbel endpoints, the intersection of two Gumbel boxes is again a Gumbel box — computed in closed form via LogSumExp. The lattice structure that Gaussian smoothing destroyed is preserved.

Where the Bessel function comes from

The expected side length of a Gumbel box along one coordinate involves a modified Bessel function $K_0$ . What matters is its behavior: $K_0$ starts at infinity for argument zero, then drops off smoothly — monotonically decreasing, convex, and asymptotically $\sqrt{\pi / 2z} \, e^{-z}$ . For large positive gap $\Delta_i$ between box endpoints (box is wide open), the expected volume is large. For $\Delta_i \approx 0$ (box is barely open), it is small but nonzero. For $\Delta_i < 0$ (endpoints are "inverted," meaning the box is empty in expectation), it is tiny but still has a nonzero derivative $-K_1$ . The optimizer always has a signal — this is the property that hard boxes and softboxes lack.

The formula: the expected side length is $\mathbb{E}[\max(M_i - m_i, 0)]$ where $M_i$ is Gumbel-min and $m_i$ is Gumbel-max. Integrating the product of a Gumbel-max PDF and a Gumbel-min survival function gives:

\mathbb{E}[\max(M_i - m_i, 0)] = 2\beta \, K_0\!\left(\frac{2}{\beta} \, e^{-\Delta_i / (2\beta)}\right)

where $\Delta_i = \mu^M_i - \mu^m_i$ is the gap between the location parameters. Why a Bessel function? $K_0$ arises whenever you integrate products of exponential-family distributions with a maximum operation. The Gumbel distribution is the Type I extreme value distribution, so the integral reduces directly to the standard Laplace-type representation of $K_0$ .

A concrete comparison in 1D: two intervals that should overlap but are currently disjoint, with a gap of 2 units.

Hard box: gradient zero everywhere. The optimizer has no direction.
Softbox ( $\sigma = 1$ ): gradient nonzero (Gaussian tails give containment probability $\sim 0.05$ ), but the smoothing creates a flat ceiling — different box configurations produce the same loss. 1D tree MRR stalls at 0.691.
Gumbel box ( $\beta = 1$ ): gradient nonzero and correctly directed, scaling with gap size. Expected overlap $\sim 0.13$ via $K_0$ . 1D tree MRR: 0.971.

Gumbel boxes don't just "smooth a little better" — they fix the gradient problem while keeping the expected intersection volume well-defined and closed-form, so the loss function is coherent across all box configurations, not just the overlapping ones.

The softplus approximation

Computing Bessel functions in every forward pass is expensive. Dasgupta et al. [8] observed that $2\beta K_0(\frac{2}{\beta} e^{-x/(2\beta)})$ is nearly indistinguishable from a shifted softplus, and proposed:

2\beta \, K_0\!\left(\frac{2}{\beta} \, e^{-\Delta_i/(2\beta)}\right) \approx \beta \, \log\!\left(1 + \exp\!\left(\frac{\Delta_i}{\beta} - 2\gamma\right)\right)

where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant (the mean of a standard Gumbel distribution). The $2\gamma$ shift accounts for the difference of means of the two endpoints, each contributing $\gamma\beta$ to its expected value. The approximation error is bounded by $0.062\beta$ , which is negligible in training.Numerically computed (Appendix C of Dasgupta et al., [8]), not a formal proof. Exact value: $0.0617013\beta$ .

This is what implementations actually compute: a softplus with a constant shift. It is cheap, differentiable, monotonic with respect to containment, and numerically stable. The full expected log-volume of a Gumbel box becomes a sum of $d$ softplus evaluations — the same computational complexity as a hard box, but with smooth gradients everywhere.

Gumbel box membership probability for different beta values — Membership probability along one coordinate for three temperatures. Small beta (green): sharp walls, nearly a hard box. Large beta (red): fuzzy boundaries that extend well beyond the box interior. The gradient signal at the boundary decays smoothly with distance — this is why disjoint Gumbel boxes still produce gradients.

The temperature $\beta$ acts as a curriculum. Start with large $\beta$ (soft, fuzzy boxes with wide gradients) and anneal toward small $\beta$ (hard boxes with crisp boundaries). Early in training, the soft boundaries let the optimizer explore; late in training, the hard boundaries let the model make precise containment judgments.

Where does the Gumbel model break? The same place boxes do: union and complement. The max of a Gumbel-max and a Gumbel-min is not Gumbel-anything — the distribution family is closed under intersection but not union. And the complement of a Gumbel box is not a Gumbel box. For negation, you still need cones or fuzzy operators. The Gumbel contribution is surgical: it fixes the gradient problem for the operations boxes already support, without pretending to solve the operations they don't.

Query2Box: Answering Logical Queries

With the gradient problem solved, box embeddings can represent containment and train reliably. The next question: can boxes do more than subsumption? Ren et al. [9] introduced Query2Box — the name says it: translate a logical query into a box — applying box embeddings to multi-hop logical queries over knowledge graphs. The idea: represent a query as a box, and rank answer entities by their distance to the query box.

Consider the query: "Where did Canadian Turing Award winners graduate?"This is the running example from Ren et al. [9]. Leskovec uses it in his Stanford CS224W lectures as the canonical multi-hop box query. This decomposes into: start with TuringAward (a point), apply the "Win" relation projection to get a box of Turing winners, separately start with Canada, project via "Citizen" to get a box of Canadians, intersect the two boxes, then project via "Graduate" to get universities. Each relation projection simply translates the box center and adds to the box offset — the box can only grow, never shrink. This makes sense: following a one-to-many relation (like "cities in a country") should expand the answer set.

Query2Box pipeline: anchor entities projected into boxes, intersected, then projected again to produce the answer box — The full Query2Box pipeline for a two-branch conjunctive query. Anchor entities start as points (zero-offset boxes). Each relation projection translates and expands the box. Intersection shrinks the box to the overlap region. The final answer box contains green points (correct answers); the red point outside has high d_out and scores poorly.

Note the asymmetry: queries are boxes, but answer entities are still embedded as points.After spending the whole post arguing against points, Query2Box puts entities back as points. The resolution: entities are answer candidates, not concepts. The box is the question, not the thing. The box represents the set of plausible answers; each candidate entity is a point that may or may not land inside it. The scoring function for a candidate answer entity $v$ (embedded as a point) relative to a query box $q$ is:

d(q, v) = d_{\text{out}}(q, v) + \alpha \cdot d_{\text{in}}(q, v)

where $d_{\text{out}}$ is the L1 distance from $v$ to the nearest point on the box boundary (zero if $v$ is inside), $d_{\text{in}}$ is the L1 distance from $v$ to the box center (measuring how deep inside the box $v$ is), and $\alpha \in (0, 1)$ is a hyperparameter (set to $0.2$ in the original paper) that downweights the inside distance.

The asymmetry between $d_{\text{out}}$ and $d_{\text{in}}$ is intentional. Outside entities should be penalized heavily (they are not answers to the query). Inside entities are all plausible answers, but we mildly prefer those closer to the center — answers nearest the box center.

Intersection of query boxes models conjunction: "countries that border France and have population > 50M" corresponds to intersecting two query boxes. Recall that geometric intersection is coordinate-wise max of lower bounds, min of upper bounds. This works when boxes overlap, but after multiple projection steps the boxes may not overlap at all — and an exact empty intersection gives zero volume with zero gradient (the same dead-zone problem from the previous section). Query2Box sidesteps this with a learned intersection operator: an attention mechanism over the input boxes' centers produces the new center, combined with a coordinate-wise minimum of offsets to shrink the box. This is an approximation, not a geometric intersection, but it provides gradient signal in all configurations.

Two operations that boxes cannot handle:

Union. The union of two boxes is generally not a box. Query2Box works around this by transforming queries into disjunctive normal form (DNF) — push all disjunctions to the last step, compute each conjunctive branch as a box, then aggregate scores across branches.DNF means rewriting the query so all ORs are at the outermost level. Each branch is a pure conjunction, representable as a single box. This is sound but adds computational cost proportional to the number of disjuncts.

Negation. The complement of a box in $\mathbb{R}^d$ is an unbounded region that is not a box. Queries like "European countries that do not border France" require a different approach. This limitation motivated two lines of follow-up work: geometric alternatives (cones) and algebraic alternatives (fuzzy logic).

Beyond Boxes: Cones and Negation

Zhang et al. [12] introduced ConE — "cone embeddings" — which represents queries as Cartesian products of two-dimensional cones. The $d$ -dimensional embedding space is split into $d/2$ independent 2D planes, and in each plane the query is an angular sector emanating from the origin, parameterized by an axis direction $\theta$ and an aperture $\phi$ : it covers all directions within angle $\phi$ of $\theta$ . The full query region is the Cartesian product of these $d/2$ sectors.

The key property: within each 2D plane, the complement of an angular sector is another angular sector. If a sector covers the angular range $[\theta - \phi, \theta + \phi]$ , the remaining arc of the circle is centered at $\theta + \pi$ with aperture $\pi - \phi$ . Since each plane's complement is again a sector, the Cartesian product of complements is again a valid ConE representation. Negation stays within the representation class.

The aperture encodes concept generality: wider cones are more general, narrower cones more specific. Containment is angular: cone $A$ contains cone $B$ if $B$ 's angular extent falls entirely within $A$ 's. Intersection of two angular sectors is a contiguous arc (another cone) whenever the sectors overlap in a connected region — which the model's parameter constraints on aperture ensure.

ConE was the first geometry-based query embedding model to handle all three propositional connectives — conjunction, disjunction, and negation — over multi-hop queries. The cost is that angular containment is a weaker form of "containment" than volumetric box inclusion — cones have no finite volume, so you lose the probabilistic interpretation that makes box embeddings attractive for subsumption. For query answering, where the goal is ranking rather than exact set membership, angular containment works. For ontology completion, where you want $P(\text{Dog} \subseteq \text{Animal}) \approx 1$ as a calibrated score, boxes remain the better tool.

Ren and Leskovec [10] took a different route with BetaE — "Beta embeddings," named after the Beta distribution — embed queries as Beta distributions rather than geometric regions. The complement of $\text{Beta}(\alpha, \beta)$ is $\text{Beta}(\beta, \alpha)$ — negation is a parameter swap. Conjunction and disjunction use learned neural networks that map pairs of Beta parameters to a new Beta, keeping the output within the representation class (the product of two Betas is not a Beta, so an exact closed form doesn't exist). BetaE is the probabilistic counterpart to ConE's geometric solution.

A third line of work avoids new geometry entirely. Arakelyan et al. [11] showed that complex queries can be decomposed into atomic link prediction calls and aggregated via t-norm fuzzy logic — requiring no training on complex queries at all. T-norms are binary operations that generalize logical AND to continuous values in $[0, 1]$; common choices include the product $x \cdot y$ and the minimum $\min(x, y)$ . Chen et al. [13] applied this directly to box containment probabilities: t-norms for conjunction, t-conorms (the dual, generalizing OR) for disjunction, $1 - x$ for negation. The geometry handles containment; the algebra handles composition. Because the operators are fixed (not learned), a model trained on simple triples can in principle answer complex queries at test time — though performance depends on how well the embeddings support the assumed t-norm structure.

EL⁺⁺ and Ontology Completion

The Description Logic (DL) EL⁺⁺ — a fragment of first-order logic designed for efficient automated reasoning — underpins biomedical ontologies like SNOMED CT (the standard vocabulary for electronic health records), Gene Ontology (a structured vocabulary of gene functions), and GALEN (a medical terminology system).SNOMED CT alone has over 350,000 concepts. You can't inspect this by hand — you need automated reasoning, and embedding methods extend it to plausible inferences. Its language allows concept conjunction ( $C \sqcap D$ ), existential restriction ( $\exists r.C$ , "things that have an $r$ -relationship to some $C$ "), and a bottom concept ( $\bot$ ). The key inference is subsumption: given an ontology, determine whether $C \sqsubseteq D$ holds (every instance of $C$ is an instance of $D$ ).

Classical reasoners (ELK, Snorocket) compute subsumption exactly by rewriting the ontology's axioms (its TBox, the terminological component that defines concept relationships) into a set of normal forms — standardized shapes like $C \sqsubseteq D$ or $C_1 \sqcap C_2 \sqsubseteq D$ that decompose complex axioms into simple building blocks. This is complete but does not generalize — it cannot predict subsumptions that are plausible but not logically entailed by the axioms. Embedding methods can.

Jackermeier et al. [16] developed Box²EL — "dual box embeddings for EL," using separate box pairs for concepts and roles — which embeds EL⁺⁺ concepts as boxes and roles as affine transformations (a linear map plus a translation, mapping one box to another by shifting its center and rescaling its half-widths). The four normal forms of EL⁺⁺ each get a geometric loss:

NF1: $C_1 \sqcap C_2 \sqsubseteq D$ . The intersection of the boxes for $C_1$ and $C_2$ should be contained in the box for $D$ . Loss: penalize volume of $(C_1 \cap C_2) \setminus D$ .

NF2: $C \sqsubseteq D$ . The box for $C$ should be inside the box for $D$ . In center-offset parameterization (storing a box as its midpoint $c$ and half-widths $o$ instead of min/max corners), the boundary extends from $c - o$ to $c + o$ . The quantity $|c_C - c_D| + o_C - o_D$ measures how far $C$ 's boundary protrudes past $D$ 's boundary along each coordinate — zero means $C$ fits inside $D$ , positive means it sticks out.This is the protrusion on the worse of the two sides (left or right boundary), not the total. If $C$ sticks out equally on both sides by $\delta/2$ , the formula gives $\delta$ , same as one side sticking out by $\delta$ . The loss penalizes protrusion:

\mathcal{L}_{\text{NF2}} = \left\| \text{ReLU}\!\left(|c_C - c_D| + o_C - o_D - \epsilon\right) \right\|_2

where $\epsilon$ is a margin. This is zero when $C$ is inside $D$ with room to spare, and grows as $C$ protrudes.

NF3: $C \sqsubseteq \exists r.D$ . Every instance of $C$ has an $r$ -relationship to some instance of $D$ . Geometrically: applying the role transformation $r$ to the box for $C$ should land inside the box for $D$ .

NF4: $\exists r.C \sqsubseteq D$ . Everything that has an $r$ -relationship to a $C$ is a $D$ . Geometrically: the pre-image of the box for $C$ under the role transformation $r$ should be contained in the box for $D$ .

Training on the GALEN medical ontology (24,353 concepts, 951 roles), Box²EL predicts subsumptions that are plausible but not explicitly stated in the ontology — a form of ontology completion that classical reasoners cannot do. On GALEN, box embeddings substantially outperform point-based methods (TransE, ELEm) because the geometry directly encodes the containment structure of the ontology.

Volume correlates with generality

Trained box embeddings recover a property the loss never explicitly optimizes: log-volume correlates with hierarchy position. After training on Gene Ontology, concepts near the root ("biological process," "molecular function") have large boxes, while leaf concepts ("mitotic spindle assembly checkpoint signaling") have small ones. The Spearman correlation between log-volume and depth is strongly negative.

This is not enforced by an explicit loss term — it emerges from the containment constraints alone. If $C \sqsubseteq D$ , then $\text{Vol}(C) \le \text{Vol}(D)$ (a subset can't be larger than its superset). Training on thousands of such constraints pushes the model toward a volume assignment that reflects the concept hierarchy.You could add an explicit volume-depth loss term, but it would be redundant. Containment constraints already imply it.

Year	Model	Geometry	Key contribution
2009	Regions (Erk)	Convex regions	First proposal of words as regions, not points
2013	TransE (Bordes)	Point + translation	Baseline KGE; simple, scalable, limited
2016	Order Emb. (Vendrov)	Orthant cones	Partial order from coordinate comparison
2018	Box Lattice (Vilnis)	Axis-aligned boxes	Boxes as lattice; containment probability
2018	Ent. Cones (Ganea)	Hyperbolic cones	Hierarchy in hyperbolic space
2019	Smoothing (Li)	Smoothed boxes	Gaussian smoothing fixes zero gradients
2019	RotatE (Sun)	Point + rotation	Complex-space rotation handles symmetry
2020	Gumbel Box (Dasgupta)	Probabilistic boxes	Min-max stable smoothing preserves lattice
2020	Query2Box (Ren)	Boxes for queries	Multi-hop query answering with boxes
2020	BetaE (Ren)	Beta distributions	Negation via parameter swap
2021	ConE (Zhang)	2D angular sectors	Negation via complement sectors
2021	CQD (Arakelyan)	Any + fuzzy logic	Decompose queries into atomic predictions
2022	BoxEL (Xiong)	Boxes for EL⁺⁺	First faithful ontology box embedding
2024	Box²EL (Jackermeier)	Dual boxes for EL⁺⁺	Affine role transformations
2024	Octagons (Charpenay)	Octagonal regions	Diagonal constraints capture cross-dim correlations
2025	TransBox (Yang)	Boxes for EL⁺⁺	EL⁺⁺-closed: logical operations stay in representation
2026	TaxoBell (Mishra)	Gaussian boxes	Self-supervised taxonomy expansion

After 2018, the box lattice foundation split into two threads: the gradient fix (Gumbel, 2020) and multi-hop query answering (Query2Box, BetaE, ConE, CQD, 2020–2021). Ontology applications followed (BoxEL, Box²EL, TransBox, 2022–2025).

The Landscape

Geometry	Containment	Intersection closed?	Negation?	Parameters
Boxes	Volumetric	Yes	No	$2d$
Gumbel boxes	Soft volumetric	Yes	No	$2d + \beta$
Cones	Angular	Yes	Yes	$2d$ (2D sectors)
Octagons	Volumetric + diagonal	Yes	No	$O(d^2)$
Gaussians	KL divergence	No	No	$2d$
Entailment cones (hyperbolic)	Geodesic	Yes	No	$2d$
Subspaces	Subspace inclusion	Yes	No	$O(d \cdot k)$

No single geometry dominates. In my experience implementing several of these in subsume, the right choice depends on the task. For ontology completion (subsumption only, no negation), boxes or Gumbel boxes suffice. For queries requiring negation and disjunction, cones or the fuzzy approach are needed. For taxonomy expansion with uncertainty, Gaussians (Mishra et al., 21) encode uncertainty through variance parameters. Entailment cones in hyperbolic space (Ganea et al., 5) — a non-Euclidean geometry where volume grows exponentially with radius, naturally accommodating trees with high branching factor — exploit this growth for tree-like hierarchies. Octagons (Charpenay and Schockaert, 15) add diagonal constraints $x_i \pm x_j \le c$ to capture correlations between dimensions that axis-aligned boxes miss, at the cost of $O(d^2)$ parameters per concept. Most recently, Moreira et al. [19] proposed linear subspaces as the geometric primitive: more general concepts get higher-dimensional subspaces, and subsumption is subspace inclusion. This is closed under intersection (subspace intersection is a subspace) but trades the intuitive "volume" interpretation for algebraic structure.

Choose your geometry before your optimizer. The shape of the embedding space determines what relationships the model can even express.

Open Problems

Faithfulness. Box embeddings can approximate but may not perfectly represent the logical structure of an ontology. Xiong et al. [14] proved soundness for BoxEL — if the model scores a subsumption as true (zero loss), then it actually holds in the ontology. The complementary property, completeness (every true subsumption scores high), is not claimed. TransBox (Yang et al., 20) — "translation + box," combining TransE-style relation modeling with box containment — goes further, achieving "EL⁺⁺-closure" — intersections, role-filler constraints, and TBox axioms all stay within the box representation.

Leemhuis and Kutz [18] formalized what box semantics can and cannot represent, characterizing the expressiveness limits of box-based embeddings for EL⁺⁺ and extensions. A pragmatic alternative: DELE (Mashkova et al., 17) sidesteps geometric faithfulness entirely by precomputing the deductive closure — the complete set of subsumptions that logically follow from the ontology's axioms — with a classical reasoner, then training the embedding to reproduce it. No general impossibility result rules out faithful box embeddings, and no paper has achieved strong faithfulness (entailed axioms score high AND non-entailed axioms score low). Whether the gap matters in practice is unsettled.

Beyond axis-alignment. Axis-aligned boxes assume that dimensions are independent — the containment of Dog within Animal decomposes into $d$ independent interval containments. For concept hierarchies with correlated features, this is restrictive. Octagons partially address this, but only for pairs of dimensions. Full covariance (oriented boxes, ellipsoids) would be more expressive, but the intersection of oriented boxes is not an oriented box. Finding the right trade-off between expressiveness and closure is open.

Composition of geometries. Each geometry handles some operations well and others poorly. Can different geometries be composed? For instance, use boxes for conjunction and cones for negation within the same query. The challenge is defining a consistent scoring function across heterogeneous representations. This is related to the broader question of whether there exists a single continuous geometry of polynomial dimension that is closed under intersection, union, complement, and projection — a universal query embedding. No such geometry is currently known for continuous embeddings at practical dimensionality.

Calibration. The containment probability $P(B \subseteq A) = \text{Vol}(A \cap B) / \text{Vol}(B)$ is a geometric quantity, not a calibrated probability. A value of 0.7 does not mean "70% chance of subsumption" — it means "70% of $B$ 's volume overlaps with $A$ ."This matters for downstream decisions. If you threshold at 0.5, are you accepting everything with >50% overlap, or everything with >50% probability of being true? The model doesn't know. Whether these geometric scores can be calibrated to meaningful probabilities, and whether that calibration generalizes across domains, is unexplored.

Density matrices. Quantum-inspired representations model concepts as positive semidefinite matrices (matrices with all eigenvalues $\ge 0$ , used in quantum mechanics to represent probabilistic mixtures of states) on a Hilbert space. One natural analogue of containment is the Loewner order: model $A \sqsubseteq B$ by requiring $B - A$ to be positive semidefinite (all eigenvalues of the difference are $\ge 0$ ). This is richer than box containment — it captures arbitrary subspace relationships, not just axis-aligned intervals. The computational cost is $O(d^2)$ per concept, the intersection structure is more complex, and the connection to DL subsumption semantics is an analogy rather than a derivation. As of 2026, quantum KGE implementations exist but use quantum circuits for scoring, not density matrices for concept representation. The theoretical proposal remains unimplemented.

Evaluation methodology. A pervasive but underacknowledged flaw: most ontology embedding benchmarks treat logically entailed axioms as negatives during training and evaluation. DELE [17] showed this is broken — the model is penalized for predicting things that are provably true. The deductive closure must be computed first and filtered from the negative set. More broadly, no standardized benchmark suite exists for ontology embeddings; different papers use incompatible datasets and protocols, making cross-paper comparison unreliable.

Geometric vs. neural. For multi-hop query answering, the geometric approach (Query2Box, ConE, BetaE) is being superseded. Newer geometric approaches like cylinder embeddings and fully geometric methods — where all logical operations are geometric transformations, not learned neural operators — show that the "geometric" label applied to Query2Box was partially misleading: its intersection operator was a neural network. GNN-based query answering substantially outperforms all purely geometric methods on standard benchmarks. The niche where geometric methods retain a defensible advantage is ontology-specific settings with explicit DL semantics: GNN foundation models don't enforce logical closure, but TransBox and DELE do.

References

[1] Erk, K. (2009). "Representing Words as Regions in Vector Space." CoNLL, 57--65. ↩

[2] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. & Yakhnenko, O. (2013). "Translating Embeddings for Modeling Multi-relational Data." NeurIPS, 2787--2795. ↩

[3] Vendrov, I., Kiros, R., Fidler, S. & Urtasun, R. (2016). "Order-Embeddings of Images and Language." ICLR. ↩

[4] Vilnis, L., Li, X., Murty, S. & McCallum, A. (2018). "Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures." ACL, 263--272. ↩

[5] Ganea, O.-E., Becigneul, G. & Hofmann, T. (2018). "Hyperbolic Entailment Cones for Learning Hierarchical Embeddings." ICML. ↩

[6] Li, X. L., Vilnis, L., Zhang, D., Boratko, M. & McCallum, A. (2019). "Smoothing the Geometry of Probabilistic Box Embeddings." ICLR. ↩

[7] Sun, Z., Deng, Z.-H., Nie, J.-Y. & Tang, J. (2019). "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space." ICLR. ↩

[8] Dasgupta, S. S., Boratko, M., Zhang, D., Vilnis, L., Li, X. L. & McCallum, A. (2020). "Improving Local Identifiability in Probabilistic Box Embeddings." NeurIPS. ↩

[9] Ren, H., Hu, W. & Leskovec, J. (2020). "Query2Box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings." ICLR. ↩

[10] Ren, H. & Leskovec, J. (2020). "Beta Embeddings for Multi-Hop Logical Reasoning in Knowledge Graphs." NeurIPS. ↩

[11] Arakelyan, E., Daza, D., Minervini, P. & Cochez, M. (2021). "Complex Query Answering with Neural Link Predictors." ICLR (Outstanding Paper). ↩

[12] Zhang, Z., Wang, J., Chen, J., Ji, S. & Wu, F. (2021). "ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs." NeurIPS. ↩

[13] Chen, X., Hu, Z. & Sun, Y. (2022). "Fuzzy Logic Based Logical Query Answering on Knowledge Graphs." AAAI. ↩

[14] Xiong, B., Potyka, N., Tran, T.-K., Nayyeri, M. & Staab, S. (2022). "Faithful Embeddings for EL⁺⁺ Knowledge Bases." ECML-PKDD. ↩

[15] Charpenay, V. & Schockaert, S. (2024). "Embedding Ontologies with Octagons." IJCAI. ↩

[16] Jackermeier, M., Chen, J. & Horrocks, I. (2024). "Dual Box Embeddings for the Description Logic EL⁺⁺." WWW '24. ↩

[17] Mashkova, O., Zhapa-Camacho, F. & Hoehndorf, R. (2024). "DELE: Deductive EL++ Embeddings for Knowledge Base Completion." arXiv:2411.01574. ↩

[18] Leemhuis, M. & Kutz, O. (2025). "Understanding the Expressive Capabilities of Knowledge Base Embeddings under Box Semantics." KR '25. ↩

[19] Moreira, G., Marinho, Z., Marques, M., Costeira, J. P. & Xiong, B. (2025). "Native Logical and Hierarchical Representations with Subspace Embeddings." arXiv:2508.16687. ↩

[20] Yang, H., Chen, J. & Sattler, U. (2025). "TransBox: EL⁺⁺-closed Ontology Embedding." WWW '25. ↩

[21] Mishra, S. et al. (2026). "TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion." arXiv:2601.09633. ↩

Implementations

tranz — TransE, RotatE, ComplEx, and DistMult with GPU training via candle. The point-embedding baseline.
subsume — Box, Gumbel box, cone, octagon, Gaussian, and hyperbolic interval embeddings with EL⁺⁺ losses, Query2Box scoring, and fuzzy t-norm query answering.