Compressing a Set of IDs to the Combinatorial Minimum

March 20, 2026

You have a set of $n$ document IDs sampled from a universe of $N$ . How small can the encoding be?

Not a list. Not a sequence. A set: the IDs are unique, order doesn't matter, and you want to reconstruct exactly which $n$ values you had. The minimum possible size is

\log_2 \binom{N}{n} \text{ bits.}

That bound is tight -- there are exactly $\binom{N}{n}$ possible sets, so any lossless scheme needs at least that many distinguishable codewords. The question is how close you can get in practice.

For $n = 1000$ IDs from a universe of $N = 10^6$ , the floor is $11{,}401$ bits (11.40 bits/ID). A naive fixed-width scheme costs 20 bits/ID. A bitmap costs 1,000,000 bits regardless of $n$ . Getting to within a few percent of the floor requires something more deliberate.

Naive approaches

Fixed-width encoding. The simplest scheme: represent each ID as $\lceil \log_2 N \rceil$ bits. For $N = 10^6$ that's 20 bits per ID, packed into a byte array. Straightforward, random-accessible, and consistently about 40--75% above the floor for typical sparsities.

Bitmap. Allocate $N$ bits and flip bit $i$ for each ID $i$ in the set. Decompression is a popcount scan. The bitmap is exactly $N$ bits regardless of $n$ , which makes it the best choice when $n/N \approx 0.5$ (the floor is also $\approx N$ bits there) and catastrophically bad when $n \ll N$ . At $n = 1000$ , $N = 10^6$ , the bitmap is $87\times$ the floor.

Delta coding with varint. Sort the IDs, then encode the first ID followed by the gaps between consecutive IDs. A variable-length integer (varint) represents each gap: gaps below 128 take 1 byte, gaps below 16,384 take 2 bytes, and so on -- 7 usable bits per byte with the high bit as a continuation flag.

For a uniform random sample with $N = 10^6$ , $n = 1000$ , the average gap is $N/n = 1000$ , which needs 2 varint bytes (16 bits). Total: 16,000 bits -- 40% above the 11,401-bit floor. When gaps are small (dense, clustered sets), delta+varint gets close to 1 byte/ID. When gaps are large (sparse, uniform), it stays at 2 bytes/ID.

Here is the comparison for a uniform random sample from $N = 10^6$ :

$n$	Floor (bits/ID)	Fixed-width	Delta+varint	Elias-Fano
100	14.68	20.00	16	15.23
1,000	11.40	20.00	16	11.95
10,000	8.08	20.00	8	8.56
100,000	4.69	20.00	8	5.25
500,000	2.00	20.00	8	3.00

Delta+varint is near-optimal when the varint byte boundary happens to land close to the floor (the $n = 10{,}000$ row), but has no mechanism to close the gap otherwise. Fixed-width is uniformly bad. Elias-Fano is consistently within a small factor of the floor.

The information-theoretic floor

Where does $\log_2 \binom{N}{n}$ come from?

A set of $n$ items from $[N]$ has $\binom{N}{n}$ possible values. Any lossless encoding is a bijection between sets and codewords. A bijection needs at least $\lceil \log_2 \binom{N}{n} \rceil$ bits.

There is a cleaner way to see this. The floor satisfies

\log_2 \binom{N}{n} = \log_2 N! - \log_2 n! - \log_2 (N-n)!

For $n \ll N$ , Stirling's approximation gives the leading term:

\log_2 \binom{N}{n} \approx n \log_2 \frac{N}{n} + n \log_2 e.

This says the cost per ID is roughly $\log_2(N/n)$ bits -- the log of how many IDs you could have picked per slot. At $n/N = 0.001$ , that's about $\log_2(1000) \approx 10$ bits/ID. At $n/N = 0.01$ , it's about 6.6 bits/ID. The floor drops as density rises, bottoming out at 1 bit/IDAt $n = N$ , there is exactly one set (all of them), so 0 bits are needed. At $n = N/2$ , the floor is $N$ bits total, or 2 bits/ID, which is where a bitmap is competitive. near $n = N$ .

Notice what the floor does not include: any bits for ordering. A set has no distinguished ordering, so you should not pay for one. The contrast with sequence encoding is stark: encoding an ordered sequence of $n$ distinct items from $[N]$ costs $\log_2 \frac{N!}{(N-n)!}$ bits -- larger by $\log_2 n!$ . For $n = 1000$ , that difference is 8,529 bits, or 8.53 bits/ID. That is the cost of treating a set as if it were a sequence.

Delta coding pays part of that cost implicitly. By sorting first, you avoid encoding permutations, but you have now committed to a specific representation (sorted order) and your entropy model must respect it. The 40% overhead of delta+varint relative to the floor is partly the varint granularity (byte-aligned, not bit-aligned) and partly the fact that delta coding is an order-1 model that does not exploit global structure.

Elias-Fano: a succinct approach

Elias-Fano (independently developed by Peter Elias and Robert Fano in the early 1970s) is a succinct data structure for monotone integer sequences that supports random access and efficient iteration.

The idea: choose a splitting parameter $\ell = \lfloor \log_2(N/n) \rfloor$ . Split each ID into a $\ell$ -bit lower part and an upper part. Store the lower parts verbatim ( $n \cdot \ell$ bits total, packed). Store the upper parts as a unary-coded gap sequence, using a bitvector of length $n + \lfloor N / 2^\ell \rfloor + 1$ bits. Total:

n\ell + n + \left\lfloor \frac{N}{2^\ell} \right\rfloor + 1 \approx n \log_2 \frac{N}{n} + 2n \text{ bits.}

The $2n$ additive term is the per-element constant that separates Elias-Fano from the floor. For the $n = 1000$ , $N = 10^6$ case, EF costs 11,954 bits versus the floor of 11,401 -- overhead of 4.8%.

As density increases, the overhead grows. At $n/N = 0.5$ , EF costs $3n$ bits total while the floor is $2n$ bits -- a 50% overhead. The structure is not designed for dense sets; bitmap wins there.

The key property of Elias-Fano that no entropy coder can match: $O(1)$ random access to any element, and efficient next_geq(x) queries (find the smallest ID $\geq x$ ) in $O(\log n / \log \log n)$ . For posting list traversal, this matters more than the last 5% of compression.

Partitioned Elias-Fano (Ottaviano & Venturini, SIGIR 2014) extends this to clustered lists: divide the IDs into blocks, encode each block with its own local Elias-Fano structure. This reduces the effective universe per block, which tightens the per-element cost when IDs cluster spatially -- common in ANN graph structures where nearby vectors share neighbors.

ROC: treating permutation as a latent variable

If the goal is to get as close as possible to $\log_2 \binom{N}{n}$ bits, the right framework is bits-back coding (also known as BB-ANS, after Townsend et al. 2019), applied to sets by Severo et al. in Compressing multisets with large alphabets (2022) and extended to vector ID sets in their 2025 follow-up.

The key insight: the $n!$ orderings of a set are a latent variable. You do not need to transmit which ordering you chose -- you just need to use the randomness in the rANS decoder state to select an ordering for free, and then encode as if you were encoding that ordering under a model that makes the ordering cheap.

Here is the mechanism more concretely.

rANS recap. Asymmetric numeral systems (ANS), in its range variant (rANS), is an entropy coder that maintains a state integer $x$ . Encoding a symbol $s$ with probability $p_s$ over model $T$ replaces $x$ with a new state that carries $-\log_2 p_s$ more bits of information. Decoding extracts the symbol and restores the lower-entropy state. Crucially, the decoder can peek at the current state slot to determine which symbol it encodes, and can advance after an external decision.

Bits-back. The bits-back trick works as follows. Suppose you want to encode a set $S = \{s_1, \ldots, s_n\}$ and you have $\log_2 n!$ bits of "free randomness" available in the rANS decoder state (which you are going to be sending anyway). Use those bits to decode a permutation $\sigma$ of $[n]$ under a uniform model. Now encode the permuted sequence $s_{\sigma(1)}, s_{\sigma(2)}, \ldots, s_{\sigma(n)}$ using a sequential model that assigns equal probability to each of the $N - k$ remaining IDs at step $k$ .For step $k$ , there are $N - k$ remaining IDs to choose from, and the model assigns each equal probability $1/(N-k)$ . Encoding the $k$ -th chosen ID costs $\log_2(N - k)$ bits. Summing over $k = 0, \ldots, n-1$ gives $\sum_{k=0}^{n-1} \log_2(N-k) = \log_2(N!/(N-n)!)$ bits for the sequence -- exactly the sequence entropy. But those bits include the permutation entropy $\log_2 n!$ , which you extracted for free. Net cost: $\log_2(N!/(N-n)!) - \log_2(n!) = \log_2\binom{N}{n}$ .

The net cost: you encoded a permutation for free (you decoded it from the ANS state, so decoding will re-encode it, breaking even), and you encoded the sequence under the ordered model at cost $\log_2(N!/(N-n)!)$ . Subtracting the $\log_2 n!$ bits you got back from the permutation leaves exactly $\log_2 \binom{N}{n}$ bits. The bound is achieved.

In practice, rANS uses a quantized frequency table with total mass $T = 2^b$ for some precision $b$ (typically 12--16). The quantization introduces overhead of about $O(\log T / T)$ bits per symbol -- less than a millibit per symbol at $b = 12$ . So the realized cost is $\log_2 \binom{N}{n} + \epsilon$ bits where $\epsilon$ is a small constant overhead.

The ans crate provides the rANS primitives needed for this: RansEncoder/RansDecoder for symbol-at-a-time control, RansDecoder::peek to inspect the state before advancing, and RansDecoder::advance to step forward after the bits-back decision. The streaming API is required because the bits-back construction interleaves encoding and decoding of different symbols.

use ans::{RansEncoder, RansDecoder, FrequencyTable};

// Build a model. For ROC's inner loop, the model at step k
// assigns equal weight to the N-k remaining IDs.
let table = FrequencyTable::from_counts(&[3, 7], 12)?;

// Encode in reverse order (rANS requirement).
let mut enc = RansEncoder::new();
for &sym in message.iter().rev() {
    enc.put(sym, &table)?;
}
let bytes = enc.finish();

// Decode, using peek+advance for bits-back interleaving.
let mut dec = RansDecoder::new(&bytes)?;
let sym = dec.peek(&table);   // inspect without consuming
dec.advance(sym, &table)?;    // step forward after decision

The full ROC construction requires a slightly more involved loop that builds the sequential model adaptively as elements are selected, but the core rANS primitives are the same.

When each method wins

The choice depends on three variables: density $n/N$ , whether you need random access, and how much implementation complexity you can afford.

Use a bitmap when $n/N > 0.3$ or so. At half density, the bitmap is within a fraction of a bit of the floor, supports $O(1)$ lookup by ID, and has trivial implementation. At 10% density, the bitmap is $2\times$ above the floor; Elias-Fano wins. At 1% density, the bitmap is $12\times$ above the floor.

Use Elias-Fano when you need random access or sequential traversal with next_geq, and density is low-to-medium ( $n/N < 0.1$ or so). EF stays within 5--12% of the floor across that range, supports $O(\log n)$ random access, and is well-studied with mature implementations. Partitioned EF reduces overhead further for spatially clustered lists.

Use ROC when you want to get as close to the floor as possible and do not need random access. ROC is a stream codec: you decompress the entire set in one pass. For batch workloads -- building an index offline, transmitting a posting list over the wire -- the lack of random access is not a constraint, and the compression savings matter. At $n/N = 0.001$ , the gap between EF (15.23 bits/ID) and the floor (14.68 bits/ID) is 0.55 bits/ID. For $n = 10^6$ IDs, that is 68 KB. Whether that matters depends on your workload.

Use delta+varint when you want a simple, allocation-free baseline without the sbits dependency. It is not close to optimal, but it is predictable: for uniform samples with $N/n \in [128, 16384)$ , it costs 16 bits/ID, which is the 2-byte varint regime. The overhead relative to the floor ranges from near zero (when $N/n \approx 100$ , the 1-byte regime) to $40\%$ (when $N/n \approx 1000$ , the 2-byte regime).

Using the crate

use cnk::{IdSetCompressor, RocCompressor};

// Default: delta+varint baseline
let compressor = RocCompressor::new();
let ids = vec![1u32, 5, 10, 20, 50];
let universe_size = 1000;

let compressed = compressor.compress_set(&ids, universe_size)?;
let recovered = compressor.decompress_set(&compressed, universe_size)?;
assert_eq!(ids, recovered);

With the sbits feature enabled, Elias-Fano is available:

use cnk::{EliasFanoCompressor, IdSetCompressor};

let ef = EliasFanoCompressor::new();
let compressed = ef.compress_set(&ids, universe_size)?;
let recovered = ef.decompress_set(&compressed, universe_size)?;
// supports random access via ef.get(i)

For cases where you have multiple lists with different density profiles, the auto selector picks the best available codec based on per-list statistics:

use cnk::{compress_set_auto, decompress_set_auto, AutoConfig};

let (choice, bytes) = compress_set_auto(&ids, universe_size, AutoConfig::default())?;
// record `choice` alongside `bytes` -- decompression needs it
let recovered = decompress_set_auto(choice, &bytes, universe_size)?;

The returned CodecChoice records which codec was used, so decompression does not depend on build-time configuration matching at the original compression site. This is important when the compressor and decompressor run in different processes or different builds.

Status of full ROC

The current RocCompressor in version 0.1.3 is delta+varint, not bits-back ANS. The name reflects the intended final state, not the present implementation. The ans crate provides the necessary streaming primitives (RansEncoder, RansDecoder, peek, advance), and the connection is wired up in cnk::ans. The full construction, which requires maintaining the adaptive sequential model and interleaving the bits-back permutation extraction, is planned but not yet implemented.

The distinction matters: the current implementation is 40% above the floor for typical posting-list densities. Full ROC would be within 1%. If you need the floor today, Elias-Fano at 4--12% overhead is the practical choice for read-heavy workloads.