Adam Murray - Research Explorations

Information-Theoretic Limits and Phase Transitions in Combinatorial Library Sampling

A mathematical framework for optimal sampling strategies in massive combinatorial chemical spaces, establishing fundamental limits, phase transitions, and experimental design principles through quotient space theory and information geometry.

30 min read

Information-Theoretic Limits and Phase Transitions in Combinatorial Library Sampling

Abstract

We present a mathematical framework for sampling in massive combinatorial chemical libraries where complete enumeration is computationally prohibitive. By modeling chemical libraries as quotient spaces where multiple synthesis instances map to unique chemical entities, we establish fundamental information-theoretic bounds on sample complexity, prove the existence of sharp phase transitions in coverage, and derive optimal experimental design strategies. The framework demonstrates that sampling can achieve 1000× computational speedup while maintaining complete coverage guarantees, with the efficiency gain arising from the inherent redundancy in combinatorial synthesis. We prove that adaptive sampling strategies have logarithmic advantage over uniform sampling, identify critical sampling fractions where phase transitions occur, and show that optimal batch sizes are surprisingly independent of library size under high redundancy conditions.

Interactive Demonstration

Explore the phase transition phenomenon below. The sharp S-curve transition demonstrates how sampling efficiency exhibits critical thresholds—adjust library parameters to see when dramatic phase transitions occur.

Phase Transition Explorer

Explore the sharp phase transition in probability P(Coverage ≥ 1-ε) at critical threshold α_c. Adjust parameters to see the dramatic S-curve behavior predicted by Theorem 4.1.

Parameter Guide:

  • |U| (Unique entities): Number of distinct chemical entities in the library
  • R (Redundancy): Average number of synthesis paths per unique entity (|S|/|U|)
  • ε (Tolerance): Maximum acceptable fraction of uncovered entities (target coverage = 1-ε)
  • α_c (Critical point): Sampling fraction where phase transition occurs = log(1/ε)/R

Larger |U| makes phase transition sharper

10³10⁶

Higher R moves α_c left (fewer samples needed)

1010⁵

Smaller ε moves α_c right (stricter coverage)

0.0010.5
Synthesis Space (|S| = |U|×R)
1.00e+6
Samples Needed (α_c × |S|)
29,958
Speedup (|S|/samples)
33.4×
Critical Point (α_c = log(1/ε)/R)
2.996e-2
True Inflection (α_infl)
2.994e-2
Probability at α_c (P)
50.9%
Inflection Offset (%)
-0.07%

Interpretation:

  • Blue curve: Probability P(Coverage ≥ 1-ε) using exact Poisson CDF from Theorem 4.1
  • Red line: Theoretical critical threshold α_c = log(1/ε)/R (where E[uncovered] = ε|U|)
  • Orange line: True inflection point α_infl = log(1/(ε + 1/|U|))/R (from calculus, shown if different)
  • Yellow region: Phase transition zone exhibiting sharp S-curve behavior
  • For large |U|, α_c ≈ α_infl (offset → 0%). For small |U|, the offset is measurable.
  • Hover over the curve to see exact probability values at each sampling fraction

1. Introduction

1.1 The Combinatorial Explosion in Chemical Discovery

Modern chemical discovery relies on vast combinatorial libraries—DNA-encoded libraries (DELs), peptide arrays, fragment libraries—that can contain billions of synthesis products. A typical DEL might involve:

Complete enumeration requires computational resources that scale with the synthesis space, not the chemical diversity, creating a fundamental bottleneck.

1.2 The Sampling Alternative

Rather than complete enumeration requiring O(|S|) operations where S is the synthesis space, sampling strategies can achieve comprehensive coverage with O(|U| log |U|) operations where U is the unique chemical space. When |S|/|U| >> log |U|, sampling provides dramatic speedups.

1.3 Contributions

We establish:

  1. Information-theoretic lower bounds on sample complexity given partial knowledge
  2. Sharp phase transitions in chemical coverage as sampling fraction increases
  3. Optimal experimental design strategies for multi-modal sampling
  4. Submodular optimization framework for resource allocation
  5. Practical algorithms with provable approximation guarantees

2. Mathematical Framework

2.1 Quotient Space Structure

Definition 2.1 (Chemical Quotient Space)

A combinatorial library forms a quotient space (S,U,π)(S, U, \pi) where:

  • SS = synthesis space (all synthesized instances)
  • UU = chemical space (unique chemical entities)
  • π:SU\pi: S \to U is the canonical projection

The equivalence relation s1s2    π(s1)=π(s2)s_1 \sim s_2 \iff \pi(s_1) = \pi(s_2) partitions SS into equivalence classes.

Definition 2.2 (Redundancy)

The redundancy factor R=S/UR = |S|/|U| measures average equivalence class size.

2.2 The Sampling Problem

Problem (Coverage with Minimal Sampling): Given quotient space (S,U,π)(S, U, \pi) and coverage requirement C[0,1]C \in [0,1], find minimal nn such that:

P({π(s1),,π(sn)}CU)1δP(|\{\pi(s_1), \ldots, \pi(s_n)\}| \geq C|U|) \geq 1 - \delta

2.3 Building Block Structure

For combinatorial libraries with NN positions and BiB_i choices at position ii:

Definition 2.3 (Compositional Space)
  • NN = number of variable positions in combinatorial synthesis
  • BiB_i = number of building block choices at position i{1,,N}i \in \{1,\ldots,N\}
  • Umax=iBi|U|_{max} = \prod_i B_i = maximum diversity (if all combinations were unique)
  • ρ=Umax/U\rho = |U|_{max}/|U| = chemical redundancy (ratio of possible to actual unique entities)
  • R=S/UR = |S|/|U| = synthesis redundancy (average synthesis paths per entity)

3. Information-Theoretic Bounds

3.1 Fundamental Limits

Theorem 3.1 (Information-Theoretic Lower Bound)

For any sampling strategy with nn samples and unknown U|U|:

E[samples needed]UH(P)logUE[\text{samples needed}] \geq \frac{|U|H(P)}{\log|U|}

where H(P)H(P) is the entropy of the distribution of equivalence class sizes.

Proof :

Each sample provides at most logU\log|U| bits of information. Total information needed is UH(P)|U|H(P) bits to reconstruct the quotient structure. Therefore:

n×logUUH(P)n \times \log|U| \geq |U|H(P)

Rearranging gives the bound.

3.2 Adaptive vs. Uniform Sampling

Definition 3.1 (Information Gain)

The information gain from sample sns_n given previous samples Sn1S_{n-1}:

I(sn)=H(USn1)H(USn1{sn})I(s_n) = H(U|S_{n-1}) - H(U|S_{n-1} \cup \{s_n\})

Theorem 3.2 (Logarithmic Advantage of Adaptivity)

For uniform sampling:

E[I(sn)]=O(UobservedU×logU)E[I(s_n)] = O\left(\frac{|U| - \text{observed}}{|U|} \times \log|U|\right)

For optimal adaptive sampling:

E[I(sn)]=O(log(Uobserved))E[I(s_n)] = O(\log(|U| - \text{observed}))

Proof :

Uniform sampling has probability (Uobserved)/U(|U| - \text{observed})/|U| of discovering new entity, providing logU\log|U| bits when successful.

Adaptive sampling can target highest-uncertainty regions, guaranteeing discovery until coverage is complete, providing log(remaining)\log(\text{remaining}) bits deterministically.

Corollary

Adaptive sampling requires O(U)O(|U|) samples versus O(UlogU)O(|U| \log |U|) for uniform sampling.

3.3 Partial Information Regimes

Definition 3.2 (Knowledge Level)

κ[0,1]\kappa \in [0,1] = fraction of quotient structure known a priori, where κ=0\kappa=0 means no knowledge and κ=1\kappa=1 means complete knowledge.

Theorem 3.3 (Sample Complexity with Partial Information)

Given knowledge level κ\kappa about quotient structure:

n(κ)=U(logU+(1κ)logR)1+κ(R1)/Rn^*(\kappa) = \frac{|U|(\log|U| + (1-\kappa)\log R)}{1 + \kappa(R-1)/R}

Proof :

With partial knowledge κ\kappa:

  • Known structure reduces search space by factor (1+κ(R1)/R)(1 + \kappa(R-1)/R)
  • Unknown structure requires additional (1κ)logR(1-\kappa)\log R samples
  • Combines multiplicatively due to information independence

4. Phase Transitions

4.1 Coverage Phase Transition

Definition 4.1 (Sample Count)

nn = number of samples drawn from synthesis space SS.

Definition 4.2 (Sampling Fraction)

α=n/S\alpha = n/|S| = fraction of synthesis space sampled.

Definition 4.3 (Coverage Function)

C(α)=E[observed unique/U]C(\alpha) = E[|\text{observed unique}|/|U|] = expected fraction of unique entities discovered.

Definition 4.4 (Coverage Tolerance)

ε(0,1)\varepsilon \in (0,1) = maximum acceptable fraction of uncovered entities. Target coverage is 1ε1-\varepsilon.

Definition 4.5 (Confidence Parameter)

δ(0,1)\delta \in (0,1) = maximum acceptable failure probability. We seek P(success)1δP(\text{success}) \geq 1-\delta.

Definition 4.6 (Critical Sampling Fraction)

The critical sampling fraction αc(ε)=log(1/ε)R\alpha_c(\varepsilon) = \frac{\log(1/\varepsilon)}{R} is the threshold where phase transition occurs.

Theorem 4.1 (Sharp Phase Transition)

For combinatorial libraries with redundancy R=S/UR = |S|/|U|, the probability of achieving coverage 1ε1-\varepsilon exhibits a sharp transition at αc\alpha_c:

limUP(C(α)>1ε)={0if α<αc(ε)1if α>αc(ε)\lim_{|U| \to \infty} P(C(\alpha) > 1-\varepsilon) = \begin{cases} 0 & \text{if } \alpha < \alpha_c(\varepsilon) \\ 1 & \text{if } \alpha > \alpha_c(\varepsilon) \end{cases}
Proof :

Model as coupon collector with RR copies per coupon.

For α<αc\alpha < \alpha_c: Expected uncovered coupons is Uexp(αR)>εU|U|\exp(-\alpha R) > \varepsilon|U|. By concentration inequalities:

P(uncovered<εU/2)exp(εU/8)P(\text{uncovered} < \varepsilon|U|/2) \leq \exp(-\varepsilon|U|/8)

For α>αc\alpha > \alpha_c: Expected uncovered is <εU< \varepsilon|U|. Similarly:

P(uncovered>2εU)exp(εU/8)P(\text{uncovered} > 2\varepsilon|U|) \leq \exp(-\varepsilon|U|/8)

As U|U| \to \infty, these probabilities 0\to 0, giving sharp transition.

4.2 Multi-Scale Transitions

Theorem 4.2 (Hierarchical Phase Transitions)

For libraries with building blocks B, fragments F, and molecules U:

α1=1/S(first discovery)α2=logBRB(building block coverage)α3=logFRF(fragment coverage)α4=logUR(molecular coverage)\begin{aligned} \alpha_1 &= 1/|S| && \text{(first discovery)} \\ \alpha_2 &= \frac{\log|B|}{R_B} && \text{(building block coverage)} \\ \alpha_3 &= \frac{\log|F|}{R_F} && \text{(fragment coverage)} \\ \alpha_4 &= \frac{\log|U|}{R} && \text{(molecular coverage)} \end{aligned}

Each transition enables qualitatively different chemical insights.

4.3 Percolation in Chemical Space

Theorem 4.3 (Chemical Percolation)

When chemicals form a similarity graph G with average degree d:

αperc=logdd\alpha_{\text{perc}} = \frac{\log d}{d}

Below αperc\alpha_{\text{perc}}: Isolated clusters of discovered chemicals. Above αperc\alpha_{\text{perc}}: Giant component containing Θ(U)\Theta(|U|) chemicals forms.

Proof sketch: Maps to Erdős-Rényi random graph with edge probability p=αRU/S=αp = \alpha R|U|/|S| = \alpha (since R=S/UR = |S|/|U|). Giant component emerges when p>logd/dp > \log d/d.

5. Optimal Experimental Design

5.1 Multi-Modal Sampling

Definition 5.1a (Multi-Modal Variables)

Problem (Multi-Modal Resource Allocation): Given KK experimental modalities with costs QkQ_k, information rates IkI_k, and coverage functions Ck(n)C_k(n), maximize coverage subject to budget BB.

Theorem 5.1 (Submodular Coverage)

The joint coverage function C(n₁,…,n_K) is submodular.

Proof :

For submodularity, must show diminishing returns:

C(n1,,ni+1,)C(n1,,ni,)C(n1,,ni+1,,nj+1,)C(n1,,ni,,nj+1,)C(n_1,\ldots,n_i+1,\ldots) - C(n_1,\ldots,n_i,\ldots) \geq C(n_1,\ldots,n_i+1,\ldots,n_j+1,\ldots) - C(n_1,\ldots,n_i,\ldots,n_j+1,\ldots)

Each additional sample has probability pp of discovering new entity. With more samples, fewer entities remain undiscovered, so pp decreases. Therefore marginal benefit decreases.

Corollary

Greedy allocation achieves (11/e)(1-1/e) approximation of optimal coverage.

5.2 Adaptive Experimental Design

Definition 5.1 (Exploration-Exploitation Trade-off)

At each step, choose between:

  • Exploration: Sample from high-uncertainty regions
  • Exploitation: Sample from high-value regions
Theorem 5.2 (Optimal Sampling Policy)

The optimal policy samples region r with probability:

P(r)value(r)×uncertainty(r)P(r) \propto \sqrt{\text{value}(r) \times \text{uncertainty}(r)}

Proof :

Uses upper confidence bound (UCB) framework. Optimal exploration bonus is proportional to √(uncertainty), while exploitation weight is proportional to value. Product gives optimal trade-off.

5.3 Batch Optimization

Definition 5.2 (Batch Economy)
  • bb = batch size (number of samples processed together)
  • β\beta = setup_cost/per_sample_cost = ratio of fixed setup cost to variable per-sample cost
Theorem 5.3 (Optimal Batch Size)

The cost-minimizing batch size is:

b=β×U×logUb^* = \sqrt{\beta \times |U| \times \log|U|}

independent of S|S| when RlogUR \gg \log|U|.

Proof :

Total cost for coverage:

C(b)=UlogUb×setup:cost+UlogU×per:sample:cost=UlogU×(βb+1)×per:sample:cost\begin{aligned} C(b) &= \frac{|U|\log|U|}{b} \times \text{setup:cost} + |U|\log|U| \times \text{per:sample:cost} \\ &= |U|\log|U| \times \left(\frac{\beta}{b} + 1\right) \times \text{per:sample:cost} \end{aligned}

Minimizing with respect to b:

dCdb=UlogU×βb2×per:sample:cost=0\frac{dC}{db} = -|U|\log|U| \times \frac{\beta}{b^2} \times \text{per:sample:cost} = 0

Since coverage requirement UlogU|U|\log|U| is independent of S|S| under high redundancy, so is bb^*.

5.4 Temporal Optimization

Explore the phase-space dynamics of temporal optimization below. Watch the optimal sampling trajectory evolve according to exponential decay, demonstrating the “front-loading principle” in action.

Temporal Optimization Flow

Phase-space trajectory showing optimal sampling rate over time with exponential decay. The "front-loading principle": sample heavily early, then taper off.

0.0010.1
100500
Time:0.0 / 300
Total Samples (Optimal)
7768.7
Uniform Rate
25.90/time
Decay Half-life
69.3

Interpretation:

  • Blue curve: Optimal sampling rate n*(t) ∝ exp(-λt/2)
  • Red dashed line: Constant uniform sampling rate
  • Shaded area: Total samples acquired (area under curve)
  • Green marker: Current time position (during animation)
  • Gray arrows show flow direction (exponential decay)
  • Front-loading principle: Sample heavily when information is fresh, reduce rate as it decays
Definition 5.3 (Temporal Variables)
  • tt = time from start of experiment
  • λ\lambda = exponential decay rate of sample quality
  • Q(t)=Q0exp(λt)Q(t) = Q_0\exp(-\lambda t) = sample quality at time tt
  • I(t)=I0exp(λt)I(t) = I_0\exp(-\lambda t) = information gained per sample at time tt
  • n(t)n(t) = sampling rate (samples per unit time) at time tt
  • CC = cost per sample (constant)
  • BB = total budget constraint
  • μ\mu = Lagrange multiplier for budget constraint

Problem: Given sample quality decay Q(t)=Q0exp(λt)Q(t) = Q_0\exp(-\lambda t), optimize sampling schedule.

Theorem 5.4 (Front-Loading Principle)

Optimal sampling rate is:

n(t)exp(λt/2)n^*(t) \propto \exp(-\lambda t/2)

Proof :

Information per sample at time tt: I(t)=I0exp(λt)I(t) = I_0\exp(-\lambda t). Cost per sample: constant CC. Lagrangian for maximizing I(t)n(t)dt\int I(t)n(t)dt subject to Cn(t)dtB\int Cn(t)dt \leq B:

L=[I0exp(λt)n(t)μCn(t)]dtL = \int[I_0\exp(-\lambda t)n(t) - \mu Cn(t)]dt

First-order condition: I0exp(λt)=μCI_0\exp(-\lambda t) = \mu C. Solving for n(t)n(t) gives the result.

6. Algorithmic Implementations

6.1 Coverage Estimation Algorithm

Algorithm (Adaptive Coverage Estimator)

1. Sample batch of size √|S|
2. Estimate |U| using Good-Turing estimator
3. While coverage < target:
   a. Compute α_c for current |U| estimate
   b. Sample additional α_c|S|/2 points
   c. Update |U| estimate
4. Return coverage estimate

Theorem 6.1: This algorithm achieves target coverage with probability 1-δ using O(|U|log(|U|/δ)) samples.

6.2 Stratified Sampling

For files F₁,…,F_m with unknown unique counts:

Algorithm (Stratified Sampler)

1. Initial: Sample √|Fᵢ| from each file
2. Estimate |Uᵢ| for each file
3. Allocate: nᵢ ∝ |Fᵢ| × |Uᵢ|log|Uᵢ|
4. Sample according to allocation

Theorem 6.2: Stratified sampling reduces variance by factor:

Var:reduction=(FiUi)2Fi2×Ui\text{Var:reduction} = \frac{(\sum|F_i||U_i|)^2}{\sum|F_i|^2 \times |U_i|}

7. Practical Applications

7.1 DNA-Encoded Libraries

For a typical DEL with:

Complete enumeration: 109×261=2.61×101110^9 \times 261 = 2.61 \times 10^{11} operations Optimal sampling: 104×log(104)×2612.4×10810^4 \times \log(10^4) \times 261 \approx 2.4 \times 10^8 operations

Speedup: 1000×1000\times

7.2 Critical Sampling Fraction

For the DEL example:

αc=log(104)105×104/109=9.2×104\alpha_c = \frac{\log(10^4)}{10^5 \times 10^4/10^9} = 9.2 \times 10^{-4}

Need to sample only 0.092%0.092\% of the library for complete coverage!

7.3 Confidence Bounds

Theorem 7.1 (Finite Sample Bounds)

For n samples with n=U(logU+log(1/δ))n = |U|(\log|U| + \log(1/\delta)):

P(complete coverage)1δP(\text{complete coverage}) \geq 1 - \delta

For 99.9%99.9\% confidence (δ=0.001\delta = 0.001)

n=104(log(104)+log(1000))=104×16.1161,000n = 10^4(\log(10^4) + \log(1000)) = 10^4 \times 16.1 \approx 161{,}000

Still 6000×6000\times fewer than complete enumeration.

8. Extensions and Open Problems

8.1 Non-Uniform Distributions

The analysis assumes uniform distribution of equivalence classes. For non-uniform cases:

Open Problem: Characterize phase transitions for power-law distributions of equivalence class sizes.

8.2 Correlated Samples

When samples are correlated (e.g., similar molecules appear together)

Open Problem: How does correlation structure affect optimal sampling strategies?

8.3 Active Learning

Open Problem: Can we use machine learning to predict which regions of chemical space to sample next?

8.4 Quantum Speedup

Conjecture: Quantum sampling could achieve O(√|U|) sample complexity using amplitude amplification.

9.1 Compressed Sensing

Our framework connects to compressed sensing—recovering high-dimensional signals from few measurements. The quotient structure provides the sparsity needed for efficient recovery.

9.2 Multi-Armed Bandits

The exploration-exploitation trade-off maps to multi-armed bandit problems, where each region of chemical space is an “arm” with unknown reward.

9.3 Statistical Physics

Phase transitions in coverage mirror percolation transitions in statistical physics, suggesting deep connections between chemical sampling and critical phenomena.

10. Conclusions

10.1 Summary of Results

We have established:

  1. Fundamental bounds: Information theory limits sample complexity to Ω(|U|H(P)/log|U|)
  2. Sharp transitions: Coverage exhibits phase transition at α_c = log|U|/(R × |U|/|S|)
  3. Adaptive advantage: Logarithmic improvement over uniform sampling
  4. Optimal design: Submodular structure enables approximation algorithms
  5. Batch independence: Optimal batch size independent of library size under high redundancy

10.2 Practical Impact

The framework provides:

10.3 Theoretical Contributions

Beyond practical speedups, the work reveals:

10.4 Future Directions

The mathematical framework opens several avenues:

  1. Machine learning integration for predictive sampling
  2. Quantum algorithms for further speedup
  3. Online optimization for streaming chemical data
  4. Transfer learning across chemical libraries

The synthesis of information theory, phase transitions, and experimental design provides a principled foundation for efficient exploration of vast combinatorial spaces—a challenge central to modern chemical discovery.

References

[1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley, 2006.

[2] D. J. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press, 2003.

[3] P. Erdős and A. Rényi, “On a classical problem of probability theory,” Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei, vol. 6, pp. 215-220, 1961.

[4] D. J. Newman and L. Shepp, “The double dixie cup problem,” American Mathematical Monthly, vol. 67, no. 1, pp. 58-61, 1960.

[5] D. Achlioptas and A. Naor, “The two possible values of the chromatic number of a random graph,” Annals of Mathematics, vol. 162, no. 3, pp. 1335-1351, 2005.

[6] B. Bollobás, Random Graphs, 2nd ed. Cambridge: Cambridge University Press, 2001.

[7] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions,” Mathematical Programming, vol. 14, no. 1, pp. 265-294, 1978.

[8] A. Krause and D. Golovin, “Submodular Function Maximization,” in Tractability: Practical Approaches to Hard Problems. Cambridge: Cambridge University Press, 2014.

[9] K. Chaloner and I. Verdinelli, “Bayesian experimental design: A review,” Statistical Science, vol. 10, no. 3, pp. 273-304, 1995.

[10] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527-535, 1952.

[11] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry,” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381-5383, 1992.

[12] R. M. Franzini et al., “DNA-encoded chemical libraries,” Annual Review of Biochemistry, vol. 83, pp. 727-743, 2014.


This framework demonstrates that accepting information constraints—sampling rather than enumeration—and applying rigorous mathematics yields algorithms that are both theoretically optimal and practically superior.