Adam Murray - Research Explorations

Graph-Theoretic Validation of Combinatorial Synthesis via Signal Processing

A mathematically rigorous framework for validating DNA-encoded library synthesis using only one-dimensional chromatographic signals and graph-theoretic constraints, without requiring mass spectrometry or molecular structure information.

Graph-Theoretic Validation of Combinatorial Synthesis via Signal Processing

Abstract

This framework provides a mathematical approach for validating combinatorial chemical synthesis using minimal information inputs. By design, the system operates without mass spectrometry or molecular structure information, relying instead on rigorous application of graph theory, discrete Morse theory, and statistical inference to one-dimensional chromatographic signals. The framework proves that truncation relationships in combinatorial libraries form a directed acyclic graph (DAG), establishes the mathematical foundations for using discrete Morse theory in chromatographic peak detection, and demonstrates how global constraint propagation through the DAG enables robust peak classification. This approach achieves synthesis validation through mathematical rigor rather than information abundance, proving that careful application of pure mathematics can solve underdetermined chemical analysis problems.

1. Introduction

1.1 The Synthesis Validation Problem

DNA-encoded libraries (DELs) enable parallel synthesis of millions of compounds through combinatorial chemistry. Each synthesis position can either successfully add its intended building block or fail (truncation), creating a complex mixture of products and truncation byproducts. The validation challenge: determine which syntheses succeeded using analytical chemistry.

Traditional approaches employ information-rich methods:

1.2 The Minimalist Approach

This work takes a deliberately constrained approach, using only:

This constraint-by-design philosophy forces mathematical rigor to compensate for limited observables.

1.3 Key Contributions

This work establishes:

  1. Proof :

    that combinatorial truncation relationships form a DAG

  2. Mathematical justification for discrete Morse theory in peak detection
  3. Statistical framework using Poisson models for significance testing
  4. Optimal assignment via the Hungarian algorithm
  5. Convergence analysis of positional variants to chemical identities

2. Mathematical Foundations

2.1 The Truncation Graph Structure

Definition 2.1 (Truncation Relation)

For compounds AA and BB in a combinatorial library, ABA \preceq B (AA is a truncation of BB) if AA can be obtained from BB by replacing one or more building blocks with null.

Theorem 2.1 (DAG Structure)

The truncation relation \preceq forms a directed acyclic graph.

Proof :

The relation \preceq defines a strict partial order:

Reflexivity: AAA \preceq A (every compound truncates to itself)

Antisymmetry: If ABA \preceq B and BAB \preceq A, then A=BA = B

  • ABA \preceq B implies AB|A| \leq |B| (length non-increasing)
  • BAB \preceq A implies BA|B| \leq |A|
  • Therefore A=B|A| = |B|
  • With equal lengths and mutual truncation, A=BA = B

Transitivity: If ABA \preceq B and BCB \preceq C, then ACA \preceq C

  • Truncation operations compose

Acyclicity: Assume cycle AB...AA \to B \to ... \to A exists

  • Each edge represents truncation (length decrease or null replacement)
  • Following the cycle: A>B>...>A|A| > |B| > ... > |A|
  • Contradiction: A>A|A| > |A|

Therefore, the truncation relation forms a DAG.

2.2 Chemical Identity vs. Positional Identity

Definition 2.2 (Positional Sequence)

An nn-tuple where position ii contains building block bib_i or null.

Definition 2.3 (Chemical Identity)

The actual molecule, independent of synthesis position.

Example
  • Positional: [Val, Null, Null], [Null, Val, Null], [Null, Null, Val] (different)
  • Chemical: All three are “Val” (same molecule)
Theorem 2.2 (Convergence Property)

Multiple positional sequences can represent the same chemical identity, creating convergence in the DAG.

Proof :

Consider positional sequences P1=[b,Null,Null]P_1 = [b, \text{Null}, \text{Null}] and P2=[Null,b,Null]P_2 = [\text{Null}, b, \text{Null}].

  • Chemically, both represent the single building block bb
  • In the DAG, both P1P_1 and P2P_2 are distinct vertices
  • Both map to the same chemical entity bb
  • This creates convergence (diamond patterns) without violating acyclicity

The convergence is not a cycle because edges only point toward simpler structures.

2.3 Equivalence Classes

Definition 2.4 (Equivalence Class)

The set of all positional sequences with the same chemical identity.

Theorem 2.3 (Partition Property)

Chemical equivalence partitions the positional sequence space into disjoint equivalence classes.

Proof :

Define relation \sim: P1P2P_1 \sim P_2 iff P1P_1 and P2P_2 have the same chemical identity.

Reflexivity: PPP \sim P (same chemical identity as itself) Symmetry: P1P2    P2P1P_1 \sim P_2 \implies P_2 \sim P_1 (chemical identity is symmetric) Transitivity: P1P2P_1 \sim P_2 and P2P3    P1P3P_2 \sim P_3 \implies P_1 \sim P_3

Therefore \sim is an equivalence relation, partitioning the space into disjoint classes.

3. Signal Processing Framework

3.1 Discrete Morse Theory for Peak Detection

Definition 3.1 (Discrete Morse Function)

For discrete signal s:ZRs: \mathbb{Z} \to \mathbb{R}, position ii is a critical point if:

  • Local maximum: s[i]>s[i1]s[i] > s[i-1] and s[i]s[i+1]s[i] \geq s[i+1]
  • Local minimum: s[i]<s[i1]s[i] < s[i-1] and s[i]s[i+1]s[i] \leq s[i+1]
Theorem 3.1 (Topological Invariance)

Critical points of discrete Morse functions are invariant under monotonic transformations.

Proof :

Let f:RRf: \mathbb{R} \to \mathbb{R} be strictly monotonic and increasing. For signal ss, consider transformed signal t=fst = f \circ s.

At position ii:

  • s[i]>s[i1]    f(s[i])>f(s[i1])    t[i]>t[i1]s[i] > s[i-1] \implies f(s[i]) > f(s[i-1]) \implies t[i] > t[i-1]
  • s[i]s[i+1]    f(s[i])f(s[i+1])    t[i]t[i+1]s[i] \geq s[i+1] \implies f(s[i]) \geq f(s[i+1]) \implies t[i] \geq t[i+1]

Critical points are preserved.

Corollary

Peak detection via discrete Morse theory is robust to monotonic noise and baseline drift.

3.2 Poisson Statistical Model

Assumption: Chromatographic detector counts follow Poisson statistics (reasonable for photon/ion counting).

Definition 3.2 (Poisson Z-score)

For observed count xx with background rate λ\lambda:

Z=(xλ)/λ+εZ = (x - \lambda) / \sqrt{\lambda + \varepsilon}

where ε\varepsilon prevents division by zero.

Theorem 3.2 (False Positive Bound)

For threshold ZtZ_t, the false positive rate is bounded by:

P(Z>ZtH0)Φ(Zt)P(Z > Z_t | H_0) \leq \Phi(-Z_t)

where Φ\Phi is the standard normal CDF and H0H_0 is the null hypothesis (no peak).

Proof :

Under H0H_0, xPoisson(λ)x \sim \text{Poisson}(\lambda). For large λ\lambda, Poisson(λ)N(λ,λ)\text{Poisson}(\lambda) \approx N(\lambda, \lambda) by CLT. Therefore ZN(0,1)Z \approx N(0, 1) under H0H_0. P(Z>Zt)=1Φ(Zt)=Φ(Zt)P(Z > Z_t) = 1 - \Phi(Z_t) = \Phi(-Z_t).

3.3 Peak Detection Pipeline

Algorithm: Combined Morse-Poisson Peak Detection

  1. Find all local maxima via discrete Morse theory
  2. Compute Poisson Z-scores for each maximum
  3. Retain peaks with Z > Z_threshold
  4. Calculate prominence for chromatographic significance
Theorem 3.3 (Completeness)

The algorithm finds all statistically significant topological peaks.

Proof :
  • Morse theory finds all local maxima (complete enumeration)
  • Poisson filtering removes only statistically insignificant peaks
  • No true peaks are missed if they exceed noise threshold

4. Global Constraint Propagation

4.1 The Classification Problem

Given detected peaks in related compounds, classify each peak as:

4.2 Constraint Propagation Algorithm

Definition 4.1 (L0L_0)

The minimal element of the DAG, representing the fully-null compound.

Theorem 4.1 (NULL Peak Inheritance)

All compounds must contain peaks corresponding to L0L_0‘s peaks.

Proof :

L0L_0 is a truncation of every compound (replace all blocks with null). Any chemical species in L0L_0 must also be present in all compounds. Therefore, L0L_0‘s peaks appear in every chromatogram.

Algorithm: Bottom-Up Constraint Propagation

1. Start from L0 (minimal element)2. Identify NULL peaks (present in L0)3. For each compound in topological order:- Inherit NULL peaks from L0- Match peaks with descendants via Hungarian algorithm- Classify based on retention time constraints\begin{aligned} &1. \text{ Start from } L_0 \text{ (minimal element)} \\ &2. \text{ Identify NULL peaks (present in } L_0\text{)} \\ &3. \text{ For each compound in topological order:} \\ &\quad \text{- Inherit NULL peaks from } L_0 \\ &\quad \text{- Match peaks with descendants via Hungarian algorithm} \\ &\quad \text{- Classify based on retention time constraints} \end{aligned}

4.3 The Hungarian Algorithm Application

Problem: Optimally assign peaks between related compounds.

Formulation: Bipartite matching with cost matrix C[i,j]=RTiRTjC[i,j] = |RT_i - RT_j|

Theorem 4.2 (Optimal Assignment)

The Hungarian algorithm finds the globally optimal peak assignment minimizing total retention time deviation in O(n3)O(n^3).

Proof :

The Hungarian algorithm solves the assignment problem optimally. Our cost matrix satisfies requirements (non-negative, real-valued). The solution minimizes C[i,π(i)]\sum C[i, \pi(i)] over all permutations π\pi.

5. Chemical Assumptions and Validity

5.1 The Additivity Assumption

Assumption 5.1

(Retention Time Additivity)

RT(compound)RT(blocki)+ε\text{RT}(\text{compound}) \approx \sum \text{RT}(\text{block}_i) + \varepsilon

where ε\varepsilon is small noise.

Justification: Valid for:

  • Peptides in reverse-phase LC (hydrophobicity is approximately additive)
  • Building blocks without strong interactions
  • Linear synthesis without cyclization
Theorem 5.1 (Retention Order)

Under additivity, products elute after their truncations.

Proof :

For compound C with blocks {b1,,bn}\{b_1, \ldots, b_n\} and truncation T missing block bib_i:

RT(C)jRT(bj)RT(T)jiRT(bj)RT(C)RT(T)RT(bi)>0\begin{aligned} \text{RT}(C) &\approx \sum_j \text{RT}(b_j) \\ \text{RT}(T) &\approx \sum_{j \neq i} \text{RT}(b_j) \\ \text{RT}(C) - \text{RT}(T) &\approx \text{RT}(b_i) > 0 \end{aligned}

Therefore RT(C) > RT(T).

5.2 Pre-validation Assumption

Assumption 5.2

: Building blocks are validated before library synthesis.

Implication: Intra-block truncations represent coupling failures, not defective blocks.

Consequence: Block-level analysis captures the primary failure modes.

6. Complexity Analysis

6.1 Graph Construction

Theorem 6.1: DAG construction has complexity O(N×2B)O(N \times 2^B) where NN is the number of maximal compounds and BB is the maximum number of blocks.

Proof :

Each maximal compound generates at most 2B2^B truncations (each block present/absent). For NN maximal compounds: O(N×2B)O(N \times 2^B) vertices. Edge construction: O(V2)O(V^2) worst case, but sparse in practice.

6.2 Signal Processing

Theorem 6.2: Peak detection complexity is O(n)O(n) per chromatogram where nn is signal length.

Proof :
  • Discrete Morse theory: Single pass to find local maxima, O(n)O(n)
  • Poisson Z-score: O(1)O(1) per peak, O(p)O(p) total where pnp \ll n
  • Prominence calculation: O(n)O(n) single pass
  • Total: O(n)O(n) per signal

6.3 Constraint Propagation

Theorem 6.3: Global classification has complexity O(V×p3)O(V \times p^3) where VV is the number of compounds and pp is the maximum peaks per compound.

Proof :
  • Topological sort: O(V+E)O(V + E)
  • Per compound: Hungarian algorithm O(p3)O(p^3)
  • Total: O(V×p3)O(V \times p^3)

7. Theoretical Guarantees

7.1 Soundness

Theorem 7.1 (Classification Soundness)

The constraint propagation algorithm never produces inconsistent classifications.

Proof :

The algorithm only propagates constraints that must hold:

  • NULL peaks inherited from L0L_0 (Theorem 4.1)
  • Retention order constraints (Theorem 5.1)
  • No conflicting assignments possible (Hungarian optimality)

The DAG structure ensures no circular dependencies.

7.2 Completeness Under Assumptions

Theorem 7.2 (Conditional Completeness)

Given perfect additivity and sufficient peak separation, the algorithm correctly classifies all peaks.

Proof sketch: Under perfect conditions:

  • All peaks are detected (Morse completeness)
  • Retention order is preserved (additivity)
  • Optimal assignment is unique (Hungarian algorithm)
  • Constraints determine unique classification

7.3 Robustness

Theorem 7.3 (Graceful Degradation)

Violations of assumptions lead to “UNKNOWN” classifications, not incorrect classifications.

Proof :

When constraints conflict:

  • Algorithm detects inconsistency
  • Assigns “UNKNOWN” rather than forcing classification
  • Never produces provably incorrect result

8. Information-Theoretic Perspective

8.1 The Information Deficit

Observation: The system operates with severe information constraints:

8.2 Information Recovery

Theorem 8.1 (Information Sufficiency)

The DAG structure plus 1D signals contain sufficient information for synthesis validation under the stated assumptions.

Proof sketch: Information sources:

  1. Topological peaks (presence/absence of compounds)
  2. Retention order (truncation relationships)
  3. Peak inheritance (NULL reference)
  4. DAG constraints (global consistency)

Together, these constrain the solution space to enable validation.

9. Practical Implications

9.1 Why Constraints by Design?

The deliberate constraints achieve:

9.2 When the Framework Applies

This approach is suitable when:

9.3 Limitations

The framework assumes:

10.1 Traditional Approaches

Most synthesis validation uses:

This framework proves these are not necessary.

10.2 Mathematical Innovations

The novel applications include:

10.3 Theoretical Connections

The framework connects to:

11. Open Problems

11.1 Theoretical Questions

  1. Optimal truncation boundary: Can chromatographic theory provide principled boundary settings?

  2. Minimal information requirements: What is the information-theoretic minimum for synthesis validation?

  3. Convergence rates: How quickly does constraint propagation converge?

11.2 Extensions

  1. Non-additive retention: How to handle strong molecular interactions?

  2. Partial information: Can mass or structure information be incorporated when available?

  3. Online analysis: Can the framework operate in streaming mode?

12. Conclusions

12.1 Summary

This framework demonstrates that rigorous mathematics can solve underdetermined chemical problems. By deliberately constraining available information, the approach forces mathematical rigor that paradoxically makes it more robust than information-rich approaches.

Key insights:

12.2 Broader Implications

This work exemplifies a philosophy: mathematical rigor over information abundance. Many chemical analysis problems may be solvable with less information but more mathematics.

The framework proves that careful application of:

Can extract sufficient information from minimal observations.

12.3 Future Directions

The mathematical framework suggests extensions to:

The core insight—that mathematical structure compensates for limited observations—has broad applicability in analytical chemistry.

References

[1] R. Diestel, Graph Theory, 5th ed. Berlin: Springer, 2017.

[2] T. H. Cormen et al., Introduction to Algorithms, 3rd ed. Cambridge, MA: MIT Press, 2009.

[3] R. Forman, “Morse theory for cell complexes,” Advances in Mathematics, vol. 134, no. 1, pp. 90-145, 1998.

[4] T. Lewiner et al., “Applications of Forman’s discrete Morse theory to topology visualization and mesh compression,” IEEE Transactions on Visualization and Computer Graphics, vol. 10, no. 5, pp. 499-508, 2003.

[5] H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83-97, 1955.

[6] J. Munkres, “Algorithms for the assignment and transportation problems,” Journal of SIAM, vol. 5, no. 1, pp. 32-38, 1957.

[7] F. J. Anscombe, “The transformation of Poisson, binomial and negative-binomial data,” Biometrika, vol. 35, no. 3-4, pp. 246-254, 1948.

[8] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry,” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381-5383, 1992.

[9] R. M. Franzini et al., “DNA-encoded chemical libraries: advancing beyond conventional small-molecule libraries,” Accounts of Chemical Research, vol. 47, no. 4, pp. 1247-1255, 2014.


This framework demonstrates that mathematical rigor can replace information abundance in chemical analysis, achieving robust synthesis validation through graph theory and signal processing alone.