Adam Murray - Research Explorations

Generalized Adduct Intervals Problem: A Formal Mathematical Proof

A rigorous mathematical framework establishing optimal mass spacing strategies for mass spectrometry-based detection of combinatorial cyclic peptide libraries, providing provable upper bounds for multi-objective evolutionary algorithms in drug discovery.

Generalized Adduct Intervals Problem

Abstract

This work presents a rigorous mathematical framework for the Generalized Adduct Intervals Problem, establishing optimal mass spacing strategies for mass spectrometry-based detection of combinatorial cyclic peptide libraries. The framework provides provable upper bounds on library size that are essential for multi-objective evolutionary algorithms in drug discovery applications. By bridging algorithmic optimization theory with analytical chemistry constraints, this work enables principled design of large-scale peptide libraries while guaranteeing unique MS identification of each library member.

Interactive Exploration

Explore the Adduct Intervals Problem interactively below. Adjust ionization modes, mass ranges, and resolution to see how parameter choices affect library size and interval spacing. This visualizer demonstrates the core optimization challenge addressed by the mathematical framework presented in this work.

Computed Parameters

Number of peptides (n):0
Spacing (δ):0.000 Da
Critical separation (κ):0
Valid configuration:✓ No overlaps

1. Background and Motivation

1.1 The Cyclic Peptide Library Challenge

Cyclic peptides have emerged as powerful therapeutic candidates due to their high biological activity, selectivity, target affinity, and proteolytic stability. Large combinatorial libraries of cyclic peptides can be synthesized using split-and-pool methods and screened against biological targets. However, a critical bottleneck exists: cyclic peptides produce complex fragment ions in mass spectrometry, making post-screening sequence determination problematic.

The fundamental question this work addresses is: Given mass spectrometry detection constraints, what is the maximum number of distinct cyclic peptides that can be included in a combinatorial library while ensuring each can be uniquely identified?

1.2 Mass Spectrometry Detection: The Physical Foundation

1.2.1 Fundamental Principle

Mass spectrometry detects charged species by measuring their mass-to-charge ratios in electromagnetic fields. Neutral molecules cannot be detected—ionization is mandatory, not optional.

Critical Implication: For a molecule with bare (neutral) mass mm:

1.2.2 Adduct Formation Process

During ionization, molecules acquire charge by forming adducts—complexes with charged species from the ionization environment. A single molecule with bare mass mm can form multiple distinct adduct species simultaneously, each appearing at a different mass-to-charge ratio.

General phenomenon:

For adduct set A={a1,a2,,ak}A = \{a_1, a_2, \ldots, a_k\}:

Example for clarification:

Consider two peptides with H⁺ adduct (a1=1.008a_1 = 1.008 Da):

Even though mB=501.0m_B = 501.0 numerically equals the observable peak mA+a1=501.008m_A + a_1 = 501.008, no ambiguity arises because mBm_B itself is invisible to the instrument.

1.2.3 Detection Uncertainty

Each observed peak has intrinsic width ±T\pm T due to:

1.2.4 The Library Identification Challenge

For a combinatorial library with nn peptides having bare masses M={m0,m1,,mn1}M = \{m_0, m_1, \ldots, m_{n-1}\}:

Challenge: Maximize nn subject to non-overlapping detection intervals for all adduct peaks.

1.2.5 Ionization Method Examples

Different ionization techniques produce characteristic adduct sets:

MethodCommon AdductsTypical k
ESIH⁺, Na⁺, K⁺, NH₄⁺3-4
APCIH⁺, (H₂O)H⁺1-2
MALDIH⁺, Na⁺, K⁺, matrix3-5
Multiply-chargedVarious z statesVariable

The framework presented applies to any adduct set AA, regardless of ionization method or chemical composition.

1.3 Multi-Objective Optimization Context

In designing combinatorial peptide libraries, multiple competing objectives must be balanced. Example objectives include:

  1. Maximize library diversity (number of unique peptides)
  2. Maximize chemical space coverage (structural diversity)
  3. Maximize synthetic accessibility (cost and feasibility)
  4. Optimize cheminformatic descriptors (e.g., logP, TPSA, rotatable bonds, hydrogen bond donors/acceptors)
  5. Maximize sequence diversity (residue composition, stereochemistry)

The specific objectives and their relative importance depend on the therapeutic target, screening strategy, and development stage. For beyond Rule of Five (BRoF) drugs like cyclic peptides, objectives often emphasize chemical space exploration over traditional small molecule drug-likeness metrics. This mathematical framework provides the theoretical upper bound for library size (objective 1), which serves as a critical constraint in multi-objective evolutionary algorithms (MOEAs) such as NSGA-III, regardless of which additional objectives are included.

1.4 Novel Contributions

This work uniquely:

2. Problem Statement and Theoretical Foundation

The core theoretical innovation of this work is the derivation of a formal mathematical proof for the Generalized Adduct Intervals Problem, which provides a complete mathematical foundation for mass differentiation optimization with arbitrary adduct sets.

2.1 Given Inputs

Physical Interpretation of Inputs:

2.2 Mathematical Formulation

2.2.1 Design Variables and Observables

Design Variables: Bare (neutral) peptide masses M={m0,m1,,mn}M = \{m_0, m_1, \ldots, m_n\}

Observable Quantities: Peak intervals Ij(mi)I_j(m_i) for all (i,j)(i,j) pairs

2.2.2 Objective and Constraints

Objective: Construct mass set MM to maximize M|M| (library size)

Constraint 1 (Interval Definition): Each designed mass mim_i produces kk observable peaks. Peak jj from mass ii creates interval:

Ij(mi)=(mi+ajT,mi+aj+T)I_j(m_i) = (m_i + a_j - T, m_i + a_j + T)

Physical meaning: Any signal detected in this interval could be peptide ii with adduct jj.

Constraint 2 (Non-Overlap): All observable intervals must be pairwise disjoint:

(i,j)(i,j),Ij(mi)Ij(mi)=\forall (i,j) \neq (i',j'), \quad I_j(m_i) \cap I_{j'}(m_{i'}) = \emptyset

Physical meaning: Unique identification—no ambiguity about which peptide-adduct pair produced each observed peak.

Constraint 3 (Range Coverage): All observable peaks must fall within detection range:

i{0,,n},  j{1,,k},Ij(mi)R\forall i \in \{0, \ldots, n\}, \; \forall j \in \{1, \ldots, k\}, \quad I_j(m_i) \subseteq R

Physical meaning: All adduct forms of all peptides are detectable by the instrument.

Note on Bare Mass Overlaps:

The non-overlap constraint applies only to the knkn observable intervals, not the nn bare masses. Bare masses can overlap with:

This is permissible because P(observing mi)=0P(\text{observing } m_i) = 0 for all ii.

2.3 Validity Conditions

For the construction to accommodate at least one mass, we require: UL>aka1+2TU - L > a_k - a_1 + 2T

Proof of Necessity:

If ULaka1+2TU - L \leq a_k - a_1 + 2T, consider any mass mm that could fit both its smallest and largest adduct intervals within RR.

The span required for all adduct intervals of mass mm is: Span=[m+a1T,m+ak+T]\text{Span} = [m + a_1 - T, m + a_k + T] Width=(m+ak+T)(m+a1T)=aka1+2T\text{Width} = (m + a_k + T) - (m + a_1 - T) = a_k - a_1 + 2T

For this to fit in R=[L,U]R = [L, U]: ULaka1+2TU - L \geq a_k - a_1 + 2T

Strict inequality ensures at least one mass can be accommodated: UL>aka1+2TU - L > a_k - a_1 + 2T \quad \square

2.4 Applicability

This formulation generalizes beyond specific cases (such as three-adduct systems with [M+H]+[\text{M+H}]^+, [M+Na]+[\text{M+Na}]^+, [M+K]+[\text{M+K}]^+) to provide optimal solutions for arbitrary adduct sets, applicable to diverse mass spectrometry ionization conditions and instrumental configurations.

3. Fundamental Theorems

Theorem 3.1 (Optimal Spacing for Same-Adduct Non-Overlap)

Statement: For intervals of the same adduct type from consecutive masses to be non-overlapping, the minimum spacing between consecutive masses is δ=2T\delta = 2T.

Proof :

Consider intervals Ij(mi)I_j(m_i) and Ij(mi+1)I_j(m_{i+1}) for some fixed j{1,2,,k}j \in \{1, 2, \ldots, k\}.

Ij(mi)=(mi+ajT,mi+aj+T)I_j(m_i) = (m_i + a_j - T, m_i + a_j + T) Ij(mi+1)=(mi+1+ajT,mi+1+aj+T)I_j(m_{i+1}) = (m_{i+1} + a_j - T, m_{i+1} + a_j + T)

For non-overlap, we require: mi+aj+Tmi+1+ajTm_i + a_j + T \leq m_{i+1} + a_j - T

Simplifying: mi+Tmi+1Tm_i + T \leq m_{i+1} - T mi+1mi2Tm_{i+1} - m_i \geq 2T

Let δ=mi+1mi\delta = m_{i+1} - m_i be the spacing between consecutive masses. Then δ2T\delta \geq 2T.

For optimal density (maximum number of masses), we choose the minimum allowable spacing: δ=2T\boxed{\delta = 2T}

Optimality Argument:

The spacing δ=2T\delta = 2T is not merely necessary but also sufficient:

  • Necessity: Any δ<2T\delta' < 2T causes consecutive same-adduct intervals to overlap
  • Sufficiency: With δ=2T\delta = 2T, same-adduct intervals exactly touch at boundaries
  • Uniqueness: Any δ>2T\delta' > 2T wastes space, reducing maximum library size

Therefore, δ=2T\delta = 2T is the unique optimal spacing for maximizing density under same-adduct constraints.

Theorem 3.2 (Critical Constraint for Cross-Adduct Non-Overlap)

Statement: The critical constraint determining the maximum number of consecutive masses occurs between Ik(mi)I_k(m_i) and I1(mi+κ)I_1(m_{i+\kappa}) for some κ1\kappa \geq 1.

Proof :

Consider any two intervals Ij(mi)I_j(m_i) and Ij(mi)I_{j'}(m_{i'}) where (i,j)(i,j)(i,j) \neq (i',j') and i<ii < i'.

For non-overlap: mi+aj+T<mi+ajTm_i + a_j + T < m_{i'} + a_{j'} - T

This gives us: mimi>ajaj+2Tm_{i'} - m_i > a_j - a_{j'} + 2T

To find the most restrictive constraint, we need to maximize the right-hand side. This occurs when:

  • aja_j is maximized: aj=aka_j = a_k (largest adduct)
  • aja_{j'} is minimized: aj=a1a_{j'} = a_1 (smallest adduct)
  • iii' - i is minimized subject to satisfying the constraint

Therefore, the critical constraint is: mi+κmi>aka1+2Tm_{i+\kappa} - m_i > a_k - a_1 + 2T

where κ\kappa is the minimum integer such that this constraint is satisfied.

Lemma 3.3 (Determination of Critical Parameter )

Statement: With uniform spacing δ=2T\delta = 2T, the critical parameter is: κ=aka12T+1\kappa = \left\lceil\frac{a_k - a_1}{2T} + 1\right\rceil

Proof :

With uniform spacing δ=2T\delta = 2T, we have: mi+κ=mi+κ2Tm_{i+\kappa} = m_i + \kappa \cdot 2T

The critical constraint from Theorem 2.2 becomes: mi+κ2Tmi>aka1+2Tm_i + \kappa \cdot 2T - m_i > a_k - a_1 + 2T κ2T>aka1+2T\kappa \cdot 2T > a_k - a_1 + 2T κ>aka1+2T2T=aka12T+1\kappa > \frac{a_k - a_1 + 2T}{2T} = \frac{a_k - a_1}{2T} + 1

Since κ\kappa must be a positive integer: κ=aka12T+1\boxed{\kappa = \left\lceil\frac{a_k - a_1}{2T} + 1\right\rceil}

Physical Interpretation: κ\kappa represents the minimum separation (in number of masses) required between two masses to ensure their most extreme adduct intervals don’t overlap.

Theorem 3.4 (Optimal Offset for Boundary Alignment)

Statement: To maximize range utilization by ensuring the leftmost interval starts at LL, the optimal offset is: ε=Ta1\varepsilon = T - a_1

Proof :

Let mi=L+iδ+εm_i = L + i \cdot \delta + \varepsilon for i=0,1,2,i = 0, 1, 2, \ldots

The leftmost interval is I1(m0)=(m0+a1T,m0+a1+T)I_1(m_0) = (m_0 + a_1 - T, m_0 + a_1 + T).

For this interval to start exactly at LL: m0+a1T=Lm_0 + a_1 - T = L L+ε+a1T=LL + \varepsilon + a_1 - T = L ε=Ta1\boxed{\varepsilon = T - a_1}

Theorem 3.5 (Explicit Mass Construction Formula)

Statement: With optimal parameters δ=2T\delta = 2T and ε=Ta1\varepsilon = T - a_1, the mass values are: M={L+Ta1,L+3Ta1,L+5Ta1,,L+(2n+1)Ta1}M = \{L + T - a_1, L + 3T - a_1, L + 5T - a_1, \ldots, L + (2n+1)T - a_1\}

Or equivalently: mi=L+(2i+1)Ta1for i=0,1,2,,nm_i = L + (2i+1)T - a_1 \quad \text{for } i = 0, 1, 2, \ldots, n

Proof :

Substituting the optimal parameters into the general form: mi=L+iδ+εm_i = L + i \cdot \delta + \varepsilon mi=L+i2T+(Ta1)m_i = L + i \cdot 2T + (T - a_1) mi=L+2iT+Ta1m_i = L + 2iT + T - a_1 mi=L+(2i+1)Ta1\boxed{m_i = L + (2i+1)T - a_1}

Theorem 3.6 (Maximum Number of Masses)

Statement: The maximum number of masses that can be accommodated is: nmax=ULak+a12T2Tn_{\max} = \left\lfloor\frac{U - L - a_k + a_1 - 2T}{2T}\right\rfloor

with the total number of masses being nmax+1n_{\max} + 1 (since we index from 0).

Proof :

For all intervals to lie within R=[L,U]R = [L, U], the rightmost interval Ik(mn)I_k(m_n) must satisfy: mn+ak+TUm_n + a_k + T \leq U

Substituting mn=L+(2n+1)Ta1m_n = L + (2n+1)T - a_1: L+(2n+1)Ta1+ak+TUL + (2n+1)T - a_1 + a_k + T \leq U L+2nT+Ta1+ak+TUL + 2nT + T - a_1 + a_k + T \leq U L+2nT+aka1+2TUL + 2nT + a_k - a_1 + 2T \leq U 2nTULak+a12T2nT \leq U - L - a_k + a_1 - 2T nULak+a12T2Tn \leq \frac{U - L - a_k + a_1 - 2T}{2T}

Since nn must be a non-negative integer: nmax=ULak+a12T2T\boxed{n_{\max} = \left\lfloor\frac{U - L - a_k + a_1 - 2T}{2T}\right\rfloor}

Note: If nmax<0n_{\max} < 0, the range is too small to accommodate any mass with all its adduct intervals.

4. Correctness and Optimality Guarantees

Theorem 4.1 (Greedy Interval Packing Algorithm)
Definition 4.1

(Non-overlap Constraint): Intervals Ij(mi)I_j(m_i) and Ij(mi)I_{j'}(m_{i'}) are non-overlapping if and only if: mi+aj+T<mi+ajTormi+aj+T<mi+ajTm_i + a_j + T < m_{i'} + a_{j'} - T \quad \text{or} \quad m_{i'} + a_{j'} + T < m_i + a_j - T

Equivalently: mi<mi+ajaj2Tormi>mi+ajaj+2Tm_i < m_{i'} + a_{j'} - a_j - 2T \quad \text{or} \quad m_i > m_{i'} + a_{j'} - a_j + 2T

Definition 4.2

(Forbidden Zone): Given previously placed mass mm_\ell and adduct indices j,j{1,,k}j, j' \in \{1, \ldots, k\}, the forbidden zone is: F,j,j=[m+ajaj2T,m+ajaj+2T]\mathcal{F}_{\ell,j,j'} = [m_\ell + a_{j'} - a_j - 2T, m_\ell + a_{j'} - a_j + 2T]

A candidate mass mm violates non-overlap with Ij(m)I_{j'}(m_\ell) if Ij(m)Ij(m)I_j(m) \cap I_{j'}(m_\ell) \neq \emptyset, which occurs precisely when mF,j,jm \in \mathcal{F}_{\ell,j,j'}.

Definition 4.3

(Cumulative Forbidden Set): After placing masses Mi={m0,m1,,mi1}\mathcal{M}_i = \{m_0, m_1, \ldots, m_{i-1}\}, the cumulative forbidden set is: Fi==0i1j=1kj=1kF,j,j\mathcal{F}_i = \bigcup_{\ell=0}^{i-1} \bigcup_{j=1}^k \bigcup_{j'=1}^k \mathcal{F}_{\ell,j,j'}

Lemma 4.4

(Valid Position Characterization): A mass mm is valid for the ii-th position if and only if:

  1. m[L,U]m \in [L, U] (within mass range)
  2. mmi1+2Tm \geq m_{i-1} + 2T (minimum spacing for same-mass intervals)
  3. mFim \notin \mathcal{F}_i (avoids all forbidden zones)

Proof of Lemma 4.4

Conditions (1) and (2) are immediate requirements. Condition (3) follows from Definition 4.2: mFim \notin \mathcal{F}_i if and only if mm satisfies the non-overlap constraint with all previously placed intervals.

Definition 4.5

(Greedy Mass Sequence): The greedy sequence is defined recursively:

m0=L+Ta1m_0 = L + T - a_1

mi=inf{m[L,U]:mmi1+2T and mFi}m_i = \inf\{m \in [L, U] : m \geq m_{i-1} + 2T \text{ and } m \notin \mathcal{F}_i\}

with termination when mi+ak+T>Um_i + a_k + T > U or no valid mim_i exists.

Lemma 4.6

(Well-definedness): The infimum in Definition 4.5 is attained and yields mi[L,U]m_i \in [L, U] for all ii before termination.

Proof of Lemma 4.6

Since Fi\mathcal{F}_i is a finite union of closed intervals, the set Si=[mi1+2T,U]FiS_i = [m_{i-1} + 2T, U] \setminus \mathcal{F}_i is either empty or a finite union of intervals.

If SiS_i is empty, the algorithm terminates.

If SiS_i is non-empty, the infimum equals the left endpoint of the leftmost available gap. Since forbidden zones are closed intervals, this infimum is attained and belongs to SiS_i.

Theorem 4.7

(Greedy Optimality): The greedy algorithm produces a sequence of maximum cardinality among all valid sequences satisfying the non-overlap constraint.

Proof :

Let G=(g0,g1,,gng1)\mathcal{G} = (g_0, g_1, \ldots, g_{n_g-1}) be the greedy sequence and O=(o0,o1,,ono1)\mathcal{O} = (o_0, o_1, \ldots, o_{n_o-1}) be any other valid sequence. We prove ngnon_g \geq n_o by showing gioig_i \leq o_i for all i<min(ng,no)i < \min(n_g, n_o).

Base case: By construction, g0=L+Ta1g_0 = L + T - a_1 is the minimum valid starting position. Thus g0o0g_0 \leq o_0.

Inductive step: Assume gjojg_j \leq o_j for all j<ij < i.

For any <i\ell < i, since gog_\ell \leq o_\ell, the forbidden zones generated by the greedy sequence are left-shifted or equivalent compared to those generated by O\mathcal{O}. Therefore, valid positions for O\mathcal{O} at step ii include or precede valid positions for G\mathcal{G}.

Since gig_i is the earliest valid position for G\mathcal{G} and oio_i is a valid position for O\mathcal{O}, we have gioig_i \leq o_i.

By induction, gioig_i \leq o_i for all i<min(ng,no)i < \min(n_g, n_o).

If ng<non_g < n_o, then after placing gng1g_{n_g-1}, no valid position exists for greedy. But since gioig_i \leq o_i for all i<ngi < n_g and O\mathcal{O} successfully places ongo_{n_g}, this contradicts the greedy termination condition.

Therefore, ngnon_g \geq n_o, proving optimality. ✓

Remark

: This greedy algorithm is provably correct by construction—it explicitly verifies all non-overlap constraints through forbidden zone checking. The previous approach using fixed κ\kappa-based stepping fails to prevent overlaps when ii<κi' - i < \kappa for certain adduct configurations.

Theorem 4.2 (Asymptotic Density)

Statement: For the greedy sequence with ULaka1U - L \gg a_k - a_1, the achieved spacing approaches δ=2T\delta = 2T and the number of masses is asymptotically: nUL(aka1+2T)2T+1n \approx \frac{U - L - (a_k - a_1 + 2T)}{2T} + 1

Proof :

When ULU - L is large compared to aka1a_k - a_1, boundary effects at LL and UU become negligible.

In the interior of [L,U][L, U], if no forbidden zones beyond the minimum 2T2T spacing interfere, the greedy algorithm places masses at uniform spacing mimi1=2Tm_i - m_{i-1} = 2T, which is the minimum required for same-mass intervals to not overlap (Theorem 3.1).

For generic adduct sets where cross-adduct forbidden zones do not create long-range obstructions, the greedy placement maintains this uniform spacing, yielding the asymptotic density formula. The floor function in practice accounts for discrete positions and boundary constraints. ✓

Note on Correctness vs. Density: The key advantage of the greedy algorithm over previous approaches is correctness—it guarantees zero overlaps by construction through explicit forbidden zone checking. The density achieved depends on the specific adduct configuration and whether forbidden zones create gaps, but the greedy algorithm provably achieves the maximum possible density for any given adduct set (Theorem 4.7).

5. Implementation Summary

5.1 Greedy Algorithm Implementation

Given inputs R=[L,U]R = [L, U], A={a1,,ak}A = \{a_1, \ldots, a_k\}, and TT:

  1. Verify Validity: Check that UL>aka1+2TU - L > a_k - a_1 + 2T and j<j,  ajaj>2T\forall j < j', \; a_{j'} - a_j > 2T

  2. Initialize: Set m0=L+Ta1m_0 = L + T - a_1 and i=0i = 0

  3. Greedy Placement Loop:

    • Compute Forbidden Zones: For all <i\ell < i, j,j{1,,k}j, j' \in \{1, \ldots, k\}: F,j,j=[m+ajaj2T,m+ajaj+2T]\mathcal{F}_{\ell,j,j'} = [m_\ell + a_{j'} - a_j - 2T, m_\ell + a_{j'} - a_j + 2T]

    • Merge Overlapping Zones: Construct Fi=,j,jF,j,j\mathcal{F}_i = \bigcup_{\ell,j,j'} \mathcal{F}_{\ell,j,j'} by sorting and merging overlapping intervals

    • Find Next Valid Position: Search for: mi=min{mmi1+2T:mFi and m+ak+TU}m_i = \min\{m \geq m_{i-1} + 2T : m \notin \mathcal{F}_i \text{ and } m + a_k + T \leq U\}

    • Termination: If no valid mim_i exists or mi+ak+T>Um_i + a_k + T > U, stop. Otherwise, increment ii and repeat.

  4. Output: Return sequence (m0,m1,,mn1)(m_0, m_1, \ldots, m_{n-1}) with nn total masses

5.2 Computational Complexity

Note: While asymptotically slower than the fixed-stepping approach, the greedy algorithm guarantees correctness. For typical problem sizes in combinatorial library design (n102n \sim 10^2-10310^3, k3k \sim 3-55), the computational cost remains negligible.

5.3 Application Examples Across Ionization Methods

The algorithm applies uniformly to any adduct set. We demonstrate with multiple realistic scenarios.

Interactive Exploration: The scenarios below can be explored using the interactive visualizer in the Interactive Exploration section above. Toggle between ionization modes and adjust parameters to see real-time updates.

The following sections provide detailed worked examples corresponding to common experimental scenarios.

5.3.1 Standard ESI-MS: The Three-Adduct Case

Experimental Context: Electrospray ionization time-of-flight mass spectrometry for cyclic peptide libraries

Given Parameters:

Step 1: Verify Validity Conditions

Range check: UL>aka1+2TU - L > a_k - a_1 + 2T 1000100>38.9641.008+2(0.5)1000 - 100 > 38.964 - 1.008 + 2(0.5) 900>37.956+1.0=38.956900 > 37.956 + 1.0 = 38.956 \quad \checkmark

Adduct separation check: minj<j(ajaj)=min{21.982,15.974}=15.974>2T=1.0\min_{j<j'}(a_{j'} - a_j) = \min\{21.982, 15.974\} = 15.974 > 2T = 1.0 \quad \checkmark

Step 2: Compute Algorithm Parameters

Spacing: δ=2T=2(0.5)=1.0 Da\delta = 2T = 2(0.5) = 1.0 \text{ Da}

Offset: ε=Ta1=0.51.008=0.508 Da\varepsilon = T - a_1 = 0.5 - 1.008 = -0.508 \text{ Da}

Maximum count: nmax=ULak+a12T2Tn_{\max} = \left\lfloor\frac{U - L - a_k + a_1 - 2T}{2T}\right\rfloor =100010038.964+1.0081.01.0= \left\lfloor\frac{1000 - 100 - 38.964 + 1.008 - 1.0}{1.0}\right\rfloor =861.0441.0=861= \left\lfloor\frac{861.044}{1.0}\right\rfloor = 861

Total masses: nmax+1=862n_{\max} + 1 = 862 (indexed from i=0i=0 to i=861i=861)

Critical separation: κ=aka12T+1=37.9561.0+1=38.956=39\kappa = \left\lceil\frac{a_k - a_1}{2T} + 1\right\rceil = \left\lceil\frac{37.956}{1.0} + 1\right\rceil = \lceil 38.956 \rceil = 39

Step 3: Generate Mass Values

Formula: mi=L+(2i+1)Ta1=100+(2i+1)(0.5)1.008m_i = L + (2i+1)T - a_1 = 100 + (2i+1)(0.5) - 1.008

Index iiCalculationMass mim_i (Da)
0100+1(0.5)1.008100 + 1(0.5) - 1.00899.492
1100+3(0.5)1.008100 + 3(0.5) - 1.008100.492
2100+5(0.5)1.008100 + 5(0.5) - 1.008101.492
3100+7(0.5)1.008100 + 7(0.5) - 1.008102.492
38100+77(0.5)1.008100 + 77(0.5) - 1.008137.492
39100+79(0.5)1.008100 + 79(0.5) - 1.008138.492
860100+1721(0.5)1.008100 + 1721(0.5) - 1.008959.492
861100+1723(0.5)1.008100 + 1723(0.5) - 1.008960.492

Step 4: Demonstrate Observable Intervals

For m0=99.492m_0 = 99.492 Da, the three adduct peaks appear at:

I1(m0)=(99.492+1.0080.5,99.492+1.008+0.5)=(100.000,101.000)I_1(m_0) = (99.492 + 1.008 - 0.5, 99.492 + 1.008 + 0.5) = (100.000, 101.000) I2(m0)=(99.492+22.9900.5,99.492+22.990+0.5)=(121.982,122.982)I_2(m_0) = (99.492 + 22.990 - 0.5, 99.492 + 22.990 + 0.5) = (121.982, 122.982) I3(m0)=(99.492+38.9640.5,99.492+38.964+0.5)=(137.956,138.956)I_3(m_0) = (99.492 + 38.964 - 0.5, 99.492 + 38.964 + 0.5) = (137.956, 138.956)

For m1=100.492m_1 = 100.492 Da:

I1(m1)=(101.000,102.000)I_1(m_1) = (101.000, 102.000) I2(m1)=(122.982,123.982)I_2(m_1) = (122.982, 123.982) I3(m1)=(138.956,139.956)I_3(m_1) = (138.956, 139.956)

Step 5: Verify Non-Overlap (Sample Cases)

Case A: Same adduct, consecutive masses

Compare I1(m0)I_1(m_0) and I1(m1)I_1(m_1):

Case B: Different adducts, same mass

Compare I1(m0)I_1(m_0) and I2(m0)I_2(m_0):

Case C: Critical separation—extreme adducts

Compare I3(m0)I_3(m_0) and I1(m39)I_1(m_{39}) (at critical κ=39\kappa = 39 steps):

This confirms masses separated by κ\kappa steps have their extreme adduct intervals just barely non-overlapping.

Physical Interpretation:


5.3.2 Reduced Adduct Set: APCI with Two Adducts

Experimental Context: Atmospheric pressure chemical ionization (fewer adduct types)

Given:

Calculation: δ=2(0.3)=0.6 Da\delta = 2(0.3) = 0.6 \text{ Da} ε=0.31.008=0.708 Da\varepsilon = 0.3 - 1.008 = -0.708 \text{ Da} nmax=60017.0260.60.6=970n_{\max} = \left\lfloor\frac{600 - 17.026 - 0.6}{0.6}\right\rfloor = 970 κ=17.0260.6+1=30\kappa = \left\lceil\frac{17.026}{0.6} + 1\right\rceil = 30

Result: 971 peptides (13% increase over ESI case)

Analysis: Fewer adducts + better resolution → larger library capacity


5.3.3 Ultra-High Resolution: Orbitrap with Multiple Adducts

Experimental Context: Orbitrap mass spectrometry with extended adduct set

Given:

Calculation: δ=2(0.01)=0.02 Da\delta = 2(0.01) = 0.02 \text{ Da} nmax=100053.9300.020.02=47,303n_{\max} = \left\lfloor\frac{1000 - 53.930 - 0.02}{0.02}\right\rfloor = 47,303

Result: 47,304 peptides (55× improvement over standard ESI!)

Analysis: Resolution dominates capacity—ultra-high resolution enables massive libraries despite more adducts


5.3.4 Comparative Analysis

MethodkT (Da)Range (Da)n_maxPeaksLimiting Factor
ESI-TOF30.59008622,586Resolution
APCI20.36009711,942Range
Orbitrap50.01100047,304236,520Range

Key Insights:

  1. Resolution impact: Halving TT approximately doubles capacity (inverse linear relationship)
  2. Adduct count: More adducts reduce capacity, but effect is sublinear
  3. Range size: Direct linear impact on capacity
  4. Optimization trade-off: High resolution compensates for many adducts

Practical Implications:

6. Connections to Existing Theory

6.1 Relationship to Classical Interval Scheduling

This problem extends the classical interval scheduling problem from computer science in several novel ways:

Classical Interval SchedulingGeneralized Adduct Intervals Problem
Given set of intervalsGenerates intervals from masses
One interval per taskkk intervals per mass (one per adduct)
Select subset to maximize countConstruct masses to maximize count
Arbitrary interval positionsUniform spacing required for synthesis
Greedy algorithm optimalConstructive algorithm with proven bounds

While classical interval scheduling uses a greedy algorithm selecting intervals by earliest finishing time, our problem requires a constructive approach that generates optimally-spaced masses.

6.2 Gap in Mass Spectrometry Literature

Existing MS tools focus on post-acquisition handling of adduct overlaps:

This approach is fundamentally different: it prevents overlap by design through optimal mass spacing, enabling larger libraries with guaranteed MS resolution.

6.3 Bridge to Combinatorial Library Design

Combinatorial cyclic peptide libraries face unique challenges:

This framework provides the missing theoretical foundation for determining maximum library size given MS constraints.

7. Applications to Multi-Objective Optimization

7.1 Integration with Multi-Objective Evolutionary Algorithms

The theoretical upper bound serves multiple critical functions in MOEAs:

7.1.1 Fitness Function Normalization

For a multi-objective problem with objectives:

The normalized fitness for f1f_1 becomes: f^1=nactualnmax[0,1]\hat{f}_1 = \frac{n_{actual}}{n_{\max}} \in [0,1]

where nmaxn_{\max} is computed using our formula.

7.1.2 Constraint Handling

Solutions proposing more than nmaxn_{\max} peptides are automatically infeasible: g(x)=nproposed(x)nmax0g(x) = n_{proposed}(x) - n_{\max} \leq 0

This hard constraint prevents the algorithm from exploring impossible regions of the search space.

7.1.3 Reference Point Generation (NSGA-III)

NSGA-III uses reference points to maintain diversity. The theoretical maximum helps define the aspiration level for the “number of peptides” objective, ensuring reference points are realistically achievable.

7.2 Example: NSGA-III Implementation

def calculate_upper_bound(mass_range, adducts, resolution)
    """
    Calculate maximum number of peptides using our formula

    Args:
        mass_range: tuple (L, U) defining mass range
        adducts: list of adduct masses [a1, a2, ..., ak]
        resolution: half-width T of detection intervals

    Returns:
        n_max: maximum number of distinguishable peptides
    """
    L, U = mass_range
    a_min, a_max = min(adducts), max(adducts)
    n_max = floor((U - L - a_max + a_min - 2*resolution) / (2*resolution))
    return max(0, n_max)

def fitness_function(solution, upper_bound)
    """
    Multi-objective fitness evaluation

    Returns:
        objectives: [normalized_count, diversity, feasibility]
        constraints: [count_constraint]
    """
    n_peptides = len(solution.peptides)

    # Objective 1: Normalized peptide count
    f1 = n_peptides / upper_bound

    # Objective 2: Chemical diversity (example metric)
    f2 = calculate_diversity(solution.peptides)

    # Objective 3: Synthesis feasibility
    f3 = evaluate_synthesis_feasibility(solution)

    # Constraint: Cannot exceed theoretical maximum
    g = n_peptides - upper_bound

    return [f1, f2, f3], [g]

7.3 Dynamic Bound Adjustment

As the MOEA explores different conditions:

The upper bound can be dynamically recalculated to provide accurate constraints for each scenario.

7.4 Performance Metrics

The theoretical bound enables rigorous performance assessment:

Efficiency=Achieved Library SizeTheoretical Maximum×100%\text{Efficiency} = \frac{\text{Achieved Library Size}}{\text{Theoretical Maximum}} \times 100\%

This metric allows fair comparison across different:

8. Conclusions

This mathematical framework provides:

  1. Rigorous Foundation: Complete proofs for all constraints and optimality claims within uniform-spacing constructions
  2. General Applicability: Works for arbitrary adduct sets beyond standard configurations
  3. Practical Implementation: Direct translation to efficient algorithms with O(n) generation complexity
  4. MOEA Integration: Essential upper bounds for fitness normalization and constraint handling
  5. Extensibility: Framework can be adapted for non-uniform spacing or additional constraints

Significance for Drug Discovery

This work bridges a critical gap between:

By providing provable upper bounds on library size, this framework enables:

Future Directions

  1. Extension to non-uniform spacing: Investigate whether relaxing uniform spacing can increase nmaxn_{\max}
  2. Multiple charge states: Extend framework to handle z>1z > 1 charge states
  3. Tandem MS constraints: Incorporate MS/MS fragmentation patterns
  4. Machine learning integration: Use bounds to guide neural architecture search for library design
  5. Experimental validation: Synthesize optimally-spaced libraries and verify MS resolution

The theoretical foundation presented here directly informs optimization algorithms for mass spectrometry experimental design, enabling maximum differentiation of molecular species while guaranteeing no spectral overlaps. This represents a significant advance in the rational design of large-scale cyclic peptide libraries for drug discovery applications.

References

[1] K. Deb and H. Jain, “An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 4, pp. 577-601, 2014.

[2] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182-197, 2002.

[3] I. Das and J. E. Dennis, “Normal-boundary intersection: A new method for generating the Pareto surface in nonlinear multicriteria optimization problems,” SIAM Journal on Optimization, vol. 8, no. 3, pp. 631-657, 1998.

[4] J. Kleinberg and E. Tardos, Algorithm Design. Boston, MA: Addison-Wesley, 2005.

[5] “Interval Scheduling,” Algorithm Notes. [Online]. Available: https://stumash.github.io/Algorithm_Notes/greedy/intervals/intervals.html

[6] S. H. Joo et al., “High-throughput sequence determination of cyclic peptide library members by partial Edman degradation/mass spectrometry,” Journal of the American Chemical Society, vol. 128, no. 39, pp. 13000-13009, 2006.

[7] J. E. Redman et al., “Automated mass spectrometric sequence determination of cyclic peptide library members,” Journal of Combinatorial Chemistry, vol. 5, no. 1, pp. 33-40, 2003.

[8] P. I. Kitov et al., “Sliding Window Adduct Removal Method (SWARM) for enhanced electrospray ionization mass spectrometry binding data,” Journal of The American Society for Mass Spectrometry, vol. 30, no. 8, pp. 1446-1454, 2019.

[9] A. F. M. Gavriilidou et al., “AdductHunter: identifying protein-metal complex adducts in mass spectra,” Journal of Cheminformatics, vol. 16, no. 1, p. 15, 2024.

[10] C. C. Coello, G. B. Lamont, and D. A. Van Veldhuizen, Evolutionary Algorithms for Solving Multi-Objective Problems. New York: Springer, 2007.

[11] K. Miettinen, Nonlinear Multiobjective Optimization. Boston, MA: Kluwer Academic Publishers, 1999.