You've Seen 100 Species. How Many Exist?

April 6, 2026

In the 1940s, the lepidopterist Alexander Corbet spent two years trapping butterflies in Malaya. He returned with a frequency table: 118 species observed exactly once, 74 observed twice, 44 three times, and so on. He brought the table to Ronald Fisher and asked: if I went back for another two years, how many new species would I find?

Fisher had to estimate a column of the table that didn't exist -- the zero column, the species Corbet never saw. Around the same time, at Bletchley Park, Alan Turing was working on the same problem in a different guise: estimating the probability of encountering a cipher pattern that had never appeared in intercepted traffic.

Both problems have the same structure: a finite sample from a distribution whose support is unknown and possibly much larger than what you've observed. The answer depends on the frequency of frequencies -- not just how many species you saw, but how many you saw exactly once, exactly twice, exactly three times. This vector is the fingerprint of the sample, and it turns out to contain nearly all the information you need.

The fingerprint

Let $X_1, X_2, \ldots, X_n$ be $n$ i.i.d. draws from a discrete distribution $P$ over an unknown support. Define:

ϕj={i:species i was observed exactly j times}\phi_j = |\{i : \text{species } i \text{ was observed exactly } j \text{ times}\}|

$\phi_1$ is the number of singletons (species seen exactly once). $\phi_2$ is the number of doubletons. The vector $(\phi_1, \phi_2, \ldots)$ is the fingerprint.

The fingerprint of a sample with counts [5, 4, 3, 2, 2, 1, 1, 1]. Three species appear once, two appear twice. The tall left bar -- many singletons relative to n -- signals that rare species dominate and much of the distribution is unseen.

For estimating symmetric properties of a distribution -- properties that depend only on the multiset of probabilities, not on which symbol has which probability -- the fingerprint is a sufficient statistic. Any estimator based on the full sample can be matched (up to small error) by one using only the fingerprint.(1) A practical consequence: you can share a fingerprint without revealing which specific symbols were observed, and any downstream analysis of symmetric properties loses essentially nothing.

$\phi_1$ is large when there are many rare species -- species you barely caught. If $\phi_1$ is large relative to $n$ , you've probably missed a lot.

The code examples below use fingerprints, a Rust crate named after this data structure.

use fingerprints::Fingerprint;

let counts = [5usize, 4, 3, 2, 2, 1, 1, 1];
let fp = Fingerprint::from_counts(counts).unwrap();

assert_eq!(fp.sample_size(), 19);      // n = 5+4+3+2+2+1+1+1
assert_eq!(fp.observed_support(), 8);   // S_obs
assert_eq!(fp.singletons(), 3);         // phi_1
assert_eq!(fp.doubletons(), 2);         // phi_2

Good-Turing: the probability of the unseen

Good (1953) gave an answer to "what fraction of the next observation will be a species we haven't seen yet?"(2)

p^0=ϕ1n\hat{p}_0 = \frac{\phi_1}{n}

The missing mass -- the total probability of all unseen species -- is estimated by the fraction of singletons. If 3 of your 19 observations come from species seen only once, then roughly $3/19 \approx 16\%$ of the distribution's probability mass belongs to species you never saw.

Why does this work? Under Poisson sampling, the expected number of singletons is $\sum_i n p_i e^{-n p_i}$ , which is approximately $n \sum_i p_i \mathbb{1}[p_i \text{ small}]$ -- that is, $n$ times the total probability of rare species. Dividing by $n$ recovers the missing mass. The estimator is approximately unbiased, and it uses no assumptions about the shape of $P$ .

The complementary quantity is sample coverage: $\hat{C} = 1 - \phi_1/n$ . Coverage of 84% means the observed species account for 84% of draws from the underlying distribution.

More generally, Good-Turing smoothing reassigns probabilities to every frequency class: a species seen $r$ times gets estimated probability $\tilde{r}/n$ where $\tilde{r} = (r+1)\phi_{r+1}/\phi_r$ .This has a pathology at $r = r_{\max}$ : the most frequent species gets probability zero because $\phi_{r_{\max}+1} = 0$ . Gale and Sampson (1995) fix this with log-linear smoothing of the frequency spectrum. The missing mass formula is the special case $r = 0$ .

The minimal-bias extension

Good-Turing uses only $\phi_1$ , discarding information in $\phi_2, \phi_3, \ldots$ . A minimal-bias estimator uses all fingerprint entries:

M^0MB=i=1rmax(1)i1ϕi(ni)\hat{M}_0^{\text{MB}} = \sum_{i=1}^{r_{\max}} (-1)^{i-1} \frac{\phi_i}{\binom{n}{i}}

The alternating signs come from inclusion-exclusion: each term corrects the over- or under-counting of the previous one. The first term $\phi_1/n$ is exactly Good-Turing; the remaining terms are its exact bias correction. This has exponentially smaller bias than Good-Turing, at the cost of higher variance for highly skewed distributions.

How many species total?

Chao1: a lower bound

Chao (1984) gave a lower bound on the total number of species $S$ :(3)

S^Chao1=Sobs+n1nϕ122ϕ2\hat{S}_{\text{Chao1}} = S_{\text{obs}} + \frac{n-1}{n} \cdot \frac{\phi_1^2}{2\phi_2}

This is a lower bound: it never overestimates the true support size in expectation. The proof uses the Cauchy-Schwarz inequality on the expected values of $\phi_0$ (unseen count), $\phi_1$ , and $\phi_2$ under multinomial sampling. The inequality is tight only when all unseen species have equal probability -- any heterogeneity in abundances means the true $S$ is strictly larger.

The $\phi_1^2 / 2\phi_2$ term estimates the unseen count from the ratio of singletons to doubletons. If you have many singletons but few doubletons, the singletons look like the tip of an iceberg.[1]

use fingerprints::{Fingerprint, support_chao1, support_chao1_with_ci};

let fp = Fingerprint::from_counts([5, 3, 2, 1, 1]).unwrap();
let s_hat = support_chao1(&fp);
assert!(s_hat >= fp.observed_support() as f64);

// With variance and 95% CI (log-transformation, Chao 1987):
let est = support_chao1_with_ci(&fp);
println!("[{:.1}, {:.1}, {:.1}]", est.ci_lower, est.point, est.ci_upper);

iChao1: pulling in higher frequencies

Chiu, Wang, Walther, and Chao (2014) improved the lower bound using $\phi_3$ (tripletons) and $\phi_4$ (quadrupletons):(4)

S^iChao1=S^Chao1+ϕ34ϕ4max ⁣(ϕ1nϕ2ϕ32(n2)ϕ4,  0)\hat{S}_{\text{iChao1}} = \hat{S}_{\text{Chao1}} + \frac{\phi_3}{4\phi_4} \max\!\left(\phi_1 - \frac{n\phi_2 \phi_3}{2(n-2)\phi_4},\; 0\right)

It is always at least as large as Chao1 and uses more of the fingerprint. The correction term captures information about the tail that $\phi_1$ and $\phi_2$ alone miss.

Valiant-Valiant: certified bounds via linear programming

When you need rigorous bounds rather than point estimates, Valiant and Valiant (2011) showed that linear programming can extract tight bounds on any symmetric property from the fingerprint.(5)

The fingerprint $(\phi_1, \phi_2, \ldots)$ from $n$ samples constrains the underlying histogram of probabilities. Formulate an LP whose variables are the counts of probability bins, whose constraints require the expected fingerprint to match the observed one (under Poisson sampling), and whose objective minimizes or maximizes the property of interest. This yields certified upper and lower bounds.

The following example shows how wide the bounds can be from 19 observations -- the interval itself quantifies uncertainty:

use fingerprints::{Fingerprint, vv_support_bounds, vv_entropy_bounds_nats, to_bits};

let fp = Fingerprint::from_counts([5, 4, 3, 2, 2, 1, 1, 1]).unwrap();

let (s_lo, s_hi) = vv_support_bounds(&fp);
println!("support in [{:.0}, {:.0}]", s_lo, s_hi);
// support in [8, 144]

let (h_lo, h_hi) = vv_entropy_bounds_nats(&fp);
println!("entropy in [{:.2}, {:.2}] bits", to_bits(h_lo), to_bits(h_hi));
// entropy in [0.16, 4.92] bits

Entropy in the unseen regime

Estimating Shannon entropy $H = -\sum_i p_i \log p_i$ from a finite sample is biased. The naive plug-in estimator $\hat{H} = -\sum_i \hat{p}_i \log \hat{p}_i$ systematically underestimates $H$ because it misses unseen species entirely.

Four estimators form a bias-correction hierarchy.

1. Plug-in (maximum likelihood)

H^plug=j1ϕjjnln ⁣(jn)\hat{H}_{\text{plug}} = -\sum_{j \ge 1} \phi_j \cdot \frac{j}{n} \ln\!\left(\frac{j}{n}\right)

Negatively biased. Appropriate when $n \gg S$ .

2. Miller-Madow

H^MM=H^plug+Sobs12n\hat{H}_{\text{MM}} = \hat{H}_{\text{plug}} + \frac{S_{\text{obs}} - 1}{2n}

A first-order bias correction due to Miller (1955).(6) Simple, but when $S_{\text{obs}} \approx n$ (many singletons), the correction can overshoot.

3. Jackknife

Delete-one jackknifing reduces bias to $O(1/n^2)$ :(7)

H^JK=nH^n(n1)E[H^n1]\hat{H}_{\text{JK}} = n \hat{H}_n - (n-1) \mathbb{E}[\hat{H}_{n-1}]

Computed efficiently from the fingerprint without enumerating all leave-one-out samples.

4. Pitman-Yor

Archer, Park, and Pillow (2012, 2014) showed that the Pitman-Yor process -- a two-parameter generalization of the Dirichlet process -- is a better prior for entropy estimation because it generates power-law tails.(8) The process has discount $d \in [0, 1)$ and concentration $\alpha > -d$ . When $d > 0$ , the number of distinct symbols grows as $O(n^d)$ rather than $O(\log n)$ for the Dirichlet process ( $d = 0$ ), making it appropriate for natural language, species distributions, and network degree distributions.

use fingerprints::{Fingerprint, entropy_pitman_yor_nats, pitman_yor_params_hat, to_bits};

let fp = Fingerprint::from_counts([5, 4, 3, 2, 2, 1, 1, 1]).unwrap();
let h_py = entropy_pitman_yor_nats(&fp);
let py = pitman_yor_params_hat(&fp);

println!("H = {:.3} bits  (d={:.3}, alpha={:.3})",
    to_bits(h_py), py.d, py.alpha);

A worked example

Sampling from a Zipf $(s=1.1)$ distribution over $K=5000$ types, then measuring each estimator's error in bits relative to the true $H = 7.91$ . Columns: Plug-in, Miller-Madow, Jackknife, Pitman-Yor.

    N     S    F1  p0_hat err_plug   err_MM   err_JK   err_PY
   50    34    26  0.5200  -3.1185  -2.6424  -2.2469  -1.2323
  100    54    41  0.4100  -2.8684  -2.4861  -2.1760  -1.0118
  200   101    82  0.4100  -2.2451  -1.8845  -1.5798  -0.2188
  400   192   148  0.3700  -1.3858  -1.0413  -0.7629  +0.5478
  800   299   228  0.2850  -1.4200  -1.1513  -0.9360  +0.5788
 1600   510   382  0.2387  -1.0018  -0.7723  -0.5924  +0.9947
 3200   847   576  0.1800  -0.6455  -0.4548  -0.3162  +1.2493
 6400  1350   857  0.1339  -0.4401  -0.2880  -0.1835  +1.3877
Entropy estimation error vs sample size. Plug-in, Miller-Madow, and Jackknife are consistently negatively biased (they underestimate). Pitman-Yor has the smallest absolute error at small n but overshoots at larger n. True H = 7.91 bits, Zipf(1.1) over 5,000 types.

At $n = 50$ , coverage is only 48% and the plug-in underestimates by 3.1 bits. The Pitman-Yor estimator has the smallest absolute error at small sample sizes -- at $n = 200$ , it's off by just 0.2 bits while the plug-in is off by 2.2.

But the plot reveals a limitation: as $n$ grows, PY overshoots. By $n = 6400$ it overestimates by 1.4 bits while the jackknife is off by only 0.2.[2] The classical estimators are boring but reliable: consistently negatively biased, converging steadily toward zero.

When estimators disagree

Different estimators make different assumptions about the tail. When they converge to similar values, the tail assumption doesn't matter. When they diverge, the divergence tells you the sample is too small to resolve the tail -- and no estimator can fix that.

Regime Recommended estimator
$n \gg S_{\text{obs}}$ , few singletons Plug-in or Miller-Madow
Moderate undersampling Jackknife
Severe undersampling, power-law tails Pitman-Yor
Need certified bounds, not point estimates Valiant-Valiant LP
Need lower bound on support size Chao1 or iChao1

When sample coverage $\hat{C} < 0.8$ (i.e., $\phi_1/n > 0.2$ ), plug-in and Miller-Madow should be treated with caution. At very small $n$ , Pitman-Yor produces a less biased estimate -- but as the convergence plot shows, it can overshoot at larger sample sizes. No single estimator dominates across all regimes. The safest strategy: report multiple estimates and the VV bounds, letting the spread quantify your uncertainty.

Connections

The same fingerprint structure recurs in settings that have no apparent connection to ecology.

Cryptanalysis to NLP. Good formalized and published Turing's wartime work in 1953.(9) The same smoothing technique -- estimating the probability of unseen events from the frequency of rare ones -- became the foundation of statistical language modeling. Every n-gram model before neural methods used Good-Turing smoothing or a direct descendant.

Shakespeare's vocabulary. Efron and Thisted (1976) used the fingerprint of word frequencies in Shakespeare's known canon to predict how many new words would appear if an additional text were discovered.Shakespeare's known works contain 31,534 distinct word types across roughly 884,000 tokens. Of those, 14,376 appear exactly once -- hapax legomena. That $\phi_1$ alone is 46% of the vocabulary. When an unknown poem attributed to "W.S." surfaced in the Bodleian Library in 1985, their model predicted $7.0 \pm 2.6$ new words. The actual count was 9. They also tested poems by Jonson, Marlowe, and Donne against the Shakespearean fingerprint -- all failed. The fingerprint became a forensic signature.(10)

Immune repertoire diversity. Your immune system contains $\sim\!10^8$ distinct T-cell receptor clonotypes generated by VDJ recombination, but any blood draw captures a tiny fraction. Each clonotype is a species, each sequenced cell is a sample. Laydon et al. (2015) showed that classical estimators severely undercount because the TCR frequency distribution is heavy-tailed.(11) An unusually diverse repertoire correlates with effective viral control -- the unseen fraction has direct clinical relevance for HIV, cancer immunotherapy, and aging.

Forensic DNA and the rare-type match. When a suspect's DNA profile matches a crime scene sample but has never appeared in the reference database, what is the probability an innocent person has this profile? Classical frequency estimation gives zero -- absurd. This is the missing mass problem. Favaro and Naulet (2023) frame it as a Good-Turing problem and prove optimal bounds.(12)

Metagenomics. Microbiome research surveys microbial diversity from 16S rRNA sequencing. Chao1 is a standard alpha-diversity metric in QIIME2 and mothur. But modern denoising pipelines (ASVs) collapse rare variants, destroying the singleton count that Chao1 depends on. Deng, Umbach, and Neufeld (2024) warned that Chao1 "must not be used" with ASV data -- a case where the estimator's assumption is violated by preprocessing.(13)

Software fuzzing. Böhme (2018) mapped fuzzing directly onto the unseen species framework: execution paths are species, fuzzer runs are samples. Good-Turing estimates how many undiscovered paths remain and the residual risk that a vulnerability lurks in uncovered code.(14)

Distinct counting at scale. HyperLogLog (Flajolet et al., 2007) estimates the number of distinct elements in a data stream -- the same "how many species?" question, solved with hash-based sketches instead of fingerprints.The connection goes deeper than analogy. Cohen (2016) extended HyperLogLog to estimate the full frequency-of-frequencies structure, explicitly bridging the database and ecological traditions. It's deployed in Redis, PostgreSQL, Spark, and Google's infrastructure. The question Fisher and Good posed for butterfly species in the 1940s is now answered billions of times per day in production databases.(15)

The German tank problem. Allied statisticians estimated German tank production from captured serial numbers. The MVUE $\hat{N} = m + m/k - 1$ (where $m$ is the max serial, $k$ is sample size) gave estimates within 1% of actual production. This is the uniform-distribution special case: the fingerprint framework generalizes it to arbitrary abundance distributions.


The unseen species problem is one of those places where a single mathematical structure -- the fingerprint -- connects Bletchley Park to microbiome sequencing to forensic evidence to database cardinality. The question is always the same: you've seen some of it. How much is left?

References

[1] Acharya, Das, Orlitsky, Suresh, "A unified maximum likelihood approach for estimating symmetric properties of discrete distributions," ICML 2017.

[2] I.J. Good, "The population frequencies of species and the estimation of population parameters," Biometrika 40(3/4), 237--264, 1953. The core idea is due to Turing; Good credits him explicitly but is the sole listed author.

[3] Anne Chao, "Nonparametric estimation of the number of classes in a population," Scandinavian Journal of Statistics 11(4), 265--270, 1984.

[4] Chiu, Wang, Walther, Chao, "An improved nonparametric lower bound of species richness via a modified Good-Turing frequency formula," Biometrics 70(3), 671--682, 2014.

[5] Gregory Valiant, Paul Valiant, "Estimating the unseen: An $n/\log n$ -sample estimator for entropy and support size, shown optimal via new CLTs," STOC, 685--694, 2011. Journal version: JACM 64(6), 2017.

[6] G.A. Miller, "Note on the bias of information estimates," in Information Theory in Psychology, Free Press, 95--100, 1955. Often called "Miller-Madow" by convention, though Madow is not a co-author.

[7] S. Zahl, "Jackknifing an index of diversity," Ecology 58(4), 907--913, 1977.

[8] Evan Archer, Il Memming Park, Jonathan W. Pillow, "Bayesian entropy estimation for countable discrete distributions," JMLR 15, 2833--2868, 2014. Conference version: NeurIPS 2012.

[9] I.J. Good, "Turing's anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma," Journal of Statistical Computation and Simulation 66, 101--111, 2000.

[10] Bradley Efron, Ronald Thisted, "Estimating the number of unseen species: How many words did Shakespeare know?", Biometrika 63(3), 435--447, 1976. The validation: Thisted and Efron, "Did Shakespeare write a newly-discovered poem?", Biometrika 74(3), 445--455, 1987.

[11] Laydon, Bangham, et al., "Estimating T-cell repertoire diversity: limitations of classical estimators and a new approach," Phil. Trans. R. Soc. B 370, 20140291, 2015.

[12] Favaro, Naulet, "Optimal estimation of high-order missing masses, and the rare-type match problem," arXiv:2306.14998, 2023.

[13] Deng, Umbach, Neufeld, "Nonparametric richness estimators Chao1 and ACE must not be used with amplicon sequence variant data," The ISME Journal, 2024.

[14] Marcel Böhme, "STADS: Software Testing as Species Discovery," ACM TOSEM 27(2), 2018.

[15] Flajolet, Fusy, Gandouet, Meunier, "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm," AofA, 2007. Cohen, "HyperLogLog Hyper Extended," arXiv:1607.06517, 2016.


  1. When $\phi_2 = 0$ , the bias-corrected fallback is $S_{\text{obs}} + \phi_1(\phi_1 - 1)/2$ . ↩︎

  2. The fitted discount $d = 0$ across all sample sizes means PY is collapsing to the Dirichlet process, failing to capture the power-law tail of this Zipf distribution. The PY estimator works best when the data genuinely follows a power-law species-abundance distribution. ↩︎