Adam Murray - Research Explorations

Parallel Chemical Hierarchies: A Multi-Perspective Embedding Strategy for Cyclic Peptide Drug Discovery

A theoretical framework for molecular embeddings that preserves both synthetic and biological organizational principles through parallel hierarchical decomposition, with applications to cyclic peptide permeability prediction.

Abstract

We present a theoretical framework for molecular embeddings based on parallel hierarchical decomposition, motivated by the unique challenges of cyclic peptide drug discovery. Unlike traditional nested hierarchies, our approach maintains two parallel organizational views—synthetic fragments and biological residues—that capture orthogonal chemical principles. We prove that this parallel structure preserves more information than serial hierarchies, enables efficient long-range interaction modeling, and naturally represents the dual nature of modified cyclic peptides. The framework is particularly suited for permeability prediction, where both synthetic accessibility and biological activity contribute to drug-like properties.

1. Introduction

1.1 The Cyclic Peptide Permeability Challenge

Cyclic peptides occupy a unique space in drug discovery—large enough to target “undruggable” protein-protein interactions, yet potentially permeable enough to be orally bioavailable. The key challenge is predicting and optimizing cell permeability, which depends on a complex interplay of factors:

These properties emerge from both the peptide sequence (biological view) and chemical modifications (synthetic view), neither of which fully captures the complete picture.

1.2 The Representation Problem

Traditional molecular representations force a choice:

Option 1: Chemical graph representation

Option 2: Sequence-based representation

Option 3: Hierarchical representation

For modified cyclic peptides—with N-methylations, D-amino acids, and non-natural residues—none of these approaches is sufficient.

1.3 Our Contribution

We propose a parallel hierarchical decomposition that maintains both synthetic and biological views simultaneously:

        Atoms
       /      \
  Fragments  Residues
       \      /
       Molecule

This structure:

  1. Preserves more information than serial hierarchies (proven via information theory)
  2. Enables cross-level interactions between fragments and residues
  3. Naturally represents modified cyclic peptides
  4. Facilitates learning of permeability-relevant features

2. Theoretical Foundations

2.1 Information-Theoretic Analysis

Definition 2.1 (Serial Hierarchical Decomposition)

A serial hierarchy processes information through a single intermediate level:

Atoms (A)Hierarchy (H)Molecule (M)\text{Atoms } (A) \to \text{Hierarchy } (H) \to \text{Molecule } (M)

Definition 2.2 (Parallel Hierarchical Decomposition)

A parallel hierarchy processes information through multiple independent levels:

Atoms (A){Fragments (F),Residues (R)}Molecule (M)\text{Atoms } (A) \to \{\text{Fragments } (F), \text{Residues } (R)\} \to \text{Molecule } (M)

Theorem 2.1 (Information Preservation)

Parallel decomposition preserves at least as much information as any serial decomposition.

Proof :

Let I(A;M)I(A; M) be the mutual information between atoms and molecular properties.

For serial decomposition: By the data processing inequality:

I(A;M)I(A;H)H(A)I(A; M) \leq I(A; H) \leq H(A)

For parallel decomposition: Using the chain rule of mutual information:

I(A;MF,R)=I(A;M,F,R)I(A;F,R)=I(A;M)+I(A;F,RM)I(A;F,R)\begin{aligned} I(A; M|F,R) &= I(A; M,F,R) - I(A; F,R) \\ &= I(A; M) + I(A; F,R|M) - I(A; F,R) \end{aligned}

Since FF and RR are both functions of AA:

I(A;F,R)=H(F,R)H(F,RA)=H(F,R)I(A; F,R) = H(F,R) - H(F,R|A) = H(F,R)

By conditioning reduces entropy:

I(A;MF,R)max(I(A;MF),I(A;MR))I(A; M|F,R) \geq \max(I(A; M|F), I(A; M|R))

Since any serial hierarchy HH can be viewed as either FF or RR alone:

I(A;MF,R)I(A;MH)I(A; M|F,R) \geq I(A; M|H)

Therefore, parallel decomposition preserves at least as much information.

2.2 Graph-Theoretic Framework

Definition 2.3 (Molecular Partition)

A partition π\pi of atomic set AA groups atoms into disjoint subsets.

Definition 2.4 (Fragment Partition)

πF\pi_F groups atoms by synthetic building blocks (e.g., BRICS decomposition).

Definition 2.5 (Residue Partition)

πR\pi_R groups atoms by biological units (amino acid residues).

Lemma 2.2 (Partition Refinement)

The meet (intersection) of two partitions provides finer granularity than either partition alone.

Proof :

For partitions πF\pi_F and πR\pi_R, their meet πFπR\pi_F \wedge \pi_R is defined as:

πFπR={BiBjBiπF,BjπR,BiBj}\pi_F \wedge \pi_R = \{B_i \cap B_j \mid B_i \in \pi_F, B_j \in \pi_R, B_i \cap B_j \neq \emptyset\}

The entropy of a partition H(π)=i(Bi/n)log(Bi/n)H(\pi) = -\sum_i (|B_i|/n) \log(|B_i|/n).

Since each block in πFπR\pi_F \wedge \pi_R is a subset of blocks in both πF\pi_F and πR\pi_R:

blocks(πFπR)max(blocks(πF),blocks(πR))|\text{blocks}(\pi_F \wedge \pi_R)| \geq \max(|\text{blocks}(\pi_F)|, |\text{blocks}(\pi_R)|)

Therefore:

H(πFπR)max(H(πF),H(πR))H(\pi_F \wedge \pi_R) \geq \max(H(\pi_F), H(\pi_R))

The finer partition captures more structural information.

2.3 Attention-Based Long-Range Interactions

Definition 2.6 (Graph Distance)

In a molecular graph GG, d(i,j)d(i,j) is the shortest path between atoms ii and jj.

Definition 2.7 (Attention Mechanism)

Attention computes pairwise relevance scores:

Attention(i,j)=softmax(QiKjT/d)\text{Attention}(i,j) = \text{softmax}(Q_i \cdot K_j^T / \sqrt{d})

Theorem 2.3 (Information Flow Complexity)

Message passing requires O(diameter(G))O(\text{diameter}(G)) steps for global information flow, while attention requires O(1)O(1) steps.

Proof :

In message passing with kk iterations:

  • Information from node ii reaches nodes within distance kk
  • Full propagation requires k=diameter(G)k = \text{diameter}(G) iterations
  • Time complexity: O(diameter(G)×E)O(\text{diameter}(G) \times |E|)

With attention mechanism:

  • All pairs compute attention scores simultaneously
  • Information flows directly between any pair
  • Time complexity: O(V2)O(|V|^2) but depth O(1)O(1)

For cyclic peptides, diameter(G)n/2\text{diameter}(G) \approx n/2 for nn residues, making attention significantly more efficient for long-range interactions.

3. The Parallel Hierarchical Framework

3.1 Formal Construction

Definition 3.1 (Parallel Hierarchical Molecular Graph)

A tuple (A,F,R,M,φF,φR,ψ)(A, F, R, M, \varphi_F, \varphi_R, \psi) where:

  • AA = atomic features Rn×da\in \mathbb{R}^{n \times d_a}
  • FF = fragment features Rkf×df\in \mathbb{R}^{k_f \times d_f}
  • RR = residue features Rkr×dr\in \mathbb{R}^{k_r \times d_r}
  • MM = molecular features Rdm\in \mathbb{R}^{d_m}
  • φF:AF\varphi_F: A \to F (fragment assignment)
  • φR:AR\varphi_R: A \to R (residue assignment)
  • ψ:F×RM\psi: F \times R \to M (molecular composition)

3.2 Cross-Level Interactions

Definition 3.2 (Fragment-Residue Interaction Tensor)

TRkf×kr×dcT \in \mathbb{R}^{k_f \times k_r \times d_c} captures interactions between fragments and residues.

For cyclic peptides, this captures critical relationships:

  • N-methylation (fragment) on specific residues affects permeability
  • D-amino acids (residue property) influence backbone conformation
  • Aromatic fragments participate in π-π stacking across residues

3.3 Permeability-Relevant Features

The parallel structure naturally captures permeability determinants:

Fragment Level:

Residue Level:

Cross-Level:

4. Theoretical Properties

4.1 Representational Capacity

Theorem 4.1 (Capacity Bound)

The parallel hierarchy has higher representational capacity than any single hierarchy.

Proof :

Representational capacity can be measured by the dimension of the feature space.

Single hierarchy:

dim(H)=kh×dh\dim(H) = k_h \times d_h

Parallel hierarchy:

dim(F×R)=kf×df+kr×dr+kf×kr×dc\dim(F \times R) = k_f \times d_f + k_r \times d_r + k_f \times k_r \times d_c

The cross-term kf×kr×dck_f \times k_r \times d_c represents additional capacity from interactions.

For cyclic peptides:

  • kfk_f \approx number of modification types
  • krk_r \approx number of residues
  • Cross-term grows as O(kf×kr)O(k_f \times k_r)

Therefore:

dim(F×R)>dim(H) for any H with comparable complexity\dim(F \times R) > \dim(H) \text{ for any } H \text{ with comparable complexity}

Therefore, the parallel hierarchy has higher representational capacity.

4.2 Gradient Flow Properties

Theorem 4.2 (Gradient Robustness)

Parallel pathways reduce gradient vanishing probability.

Proof :

Let p be the probability of gradient vanishing through a single path.

Serial hierarchy:

P(gradient reaches atoms)=1pP(\text{gradient reaches atoms}) = 1 - p

Parallel hierarchy with independent pathways:

P(gradient vanishes)=P(both paths fail)=p2P(gradient reaches atoms)=1p2\begin{aligned} P(\text{gradient vanishes}) &= P(\text{both paths fail}) = p^2 \\ P(\text{gradient reaches atoms}) &= 1 - p^2 \end{aligned}

Since p(0,1)p \in (0,1):

1p2>1p1 - p^2 > 1 - p

The parallel structure provides gradient flow redundancy.

4.3 Compositional Learning

Definition 4.1 (Compositional Rules)

Learned mappings that generalize to novel combinations:

  • Fragment compatibility: P(f_i connects to f_j)
  • Residue compatibility: P(r_i follows r_j)
  • Cross-level compatibility: P(f_i compatible with r_j)

Conjecture 4.1: The parallel hierarchy learns compositional rules that generalize to novel modified peptides.

Supporting argument (not a proof) The architecture separates:

  1. Local chemical rules (fragments)
  2. Sequential patterns (residues)
  3. Interaction rules (cross-level)

This factorization encourages learning reusable components rather than memorizing complete structures.

5. Application to Cyclic Peptide Permeability

5.1 Why Parallel Hierarchy Suits Cyclic Peptides

Cyclic peptides have inherent dual nature:

Biological organization (Residues)

CYCLO(Arg-D-Phe-Pro-NMe-Val-Leu)

Synthetic organization (Fragments)

Guanidine-Phenyl-Pyrrolidine-NMethyl-Isopropyl-Isobutyl

Neither view is complete; both are necessary.

5.2 Permeability Feature Extraction

The parallel hierarchy naturally extracts permeability-relevant features:

Theorem 5.1 (Feature Completeness)

The parallel decomposition captures all first-order permeability determinants.

Proof sketch: Permeability determinants include:

  1. Size/lipophilicity → captured by fragment features
  2. Hydrogen bonding → fragment-residue interactions
  3. Flexibility → residue sequence patterns
  4. Charge distribution → both levels contribute

Each determinant maps to architectural components:

  • Fragments encode chemical properties
  • Residues encode conformational preferences
  • Cross-attention captures shielding effects

The parallel structure spans the space of permeability features.

5.3 Modified Residue Representation

Example N-methylated leucine

Traditional hierarchy struggles:

Parallel hierarchy:

This factorization enables generalization to novel modifications.

6. Theoretical Implications

6.1 Connection to Multi-View Learning

The parallel hierarchy implements a form of multi-view learning where views are:

  1. Structurally coupled (share atomic foundation)
  2. Semantically distinct (capture different chemical principles)
  3. Mutually informative (cross-level interactions)

This relates to co-training and multi-kernel learning theory.

6.2 Inductive Biases

The architecture encodes strong but complementary biases:

Fragment bias: Local chemical environment determines properties

Residue bias: Sequence patterns determine structure

By maintaining both biases simultaneously, the model avoids premature commitment to a single view.

6.3 Theoretical Limitations

Limitation 1: Not all molecules benefit from dual representation

Limitation 2: Optimal decomposition is task-dependent

Limitation 3: Computational overhead

7. Experimental Directions

While this work focuses on theoretical foundations, the framework suggests several testable hypotheses:

7.1 Compositional Generalization

Hypothesis: Models using parallel hierarchy should better generalize to:

7.2 Interpretability

Hypothesis: Learned attention patterns should reveal:

7.3 Transfer Learning

Hypothesis: The factorized representation should enable:

8.1 Connection to Category Theory

The parallel decomposition can be viewed as a span in the category of molecular representations:

FARF \leftarrow A \rightarrow R

with the molecular level as the pushout combining both views.

8.2 Connection to Information Geometry

The parallel hierarchy defines a product manifold:

Mmolecule=Mfragment×Mresidue\text{M}_\text{molecule} = \text{M}_\text{fragment} \times \text{M}_\text{residue}

with the Riemannian metric incorporating both geometric structures.

8.3 Connection to Tensor Decomposition

The framework performs an implicit tensor factorization:

TmoleculeTatom×1Ufragment×2Uresidue×3Uinteraction\text{T}_\text{molecule} \approx \text{T}_\text{atom} \times_1 \text{U}_\text{fragment} \times_2 \text{U}_{residue} \times_3 \text{U}_\text{interaction}

similar to Tucker decomposition but with chemical constraints.

9. Conclusions

9.1 Summary of Contributions

We presented a theoretical framework for molecular embeddings based on parallel hierarchical decomposition. Key theoretical results include:

  1. Proof that parallel decomposition preserves more information than serial hierarchies
  2. Analysis of gradient flow properties showing improved robustness
  3. Framework for representing modified cyclic peptides with dual organization
  4. Connection to permeability prediction requirements

9.2 Implications for Drug Discovery

The parallel hierarchy framework:

9.3 Future Theoretical Work

Open questions include:

  1. Optimal decomposition: How to choose the best fragment/residue partitions?
  2. Theoretical guarantees: Can we prove generalization bounds?
  3. Extension to other modalities: Can this framework incorporate 3D structure?
  4. Automated decomposition: Can we learn the hierarchical structure from data?

The parallel hierarchical framework provides a theoretically grounded approach to molecular representation that aligns with the dual nature of cyclic peptides, offering a principled foundation for permeability prediction and molecular design.

References

[1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley-Interscience, 2006.

[2] G. C. Rota, “On the foundations of combinatorial theory I. Theory of Möbius functions,” Zeitschrift für Wahrscheinlichkeitstheorie, vol. 2, no. 4, pp. 340-368, 1964.

[3] D. I. Shuman et al., “The emerging field of signal processing on graphs,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83-98, 2013.

[4] P. G. Dougherty et al., “Understanding Cell Penetration of Cyclic Peptides,” Chemical Reviews, vol. 119, no. 17, pp. 10241-10287, 2019.

[5] A. Furukawa et al., “Passive Membrane Permeability in Cyclic Peptomer Scaffolds,” ACS Chemical Biology, vol. 15, no. 10, pp. 2633-2640, 2020.

[6] M. R. Naylor et al., “Cyclic peptide natural products chart the frontier of oral bioavailability,” Current Opinion in Chemical Biology, vol. 47, pp. 117-126, 2018.

[7] J. Gilmer et al., “Neural Message Passing for Quantum Chemistry,” in Proc. ICML, 2017.

[8] Z. Xu et al., “How Powerful are Graph Neural Networks?” in Proc. ICLR, 2019.

[9] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proc. COLT, 1998.

[10] C. Xu et al., “A survey on multi-view learning,” arXiv preprint arXiv:1304.5634, 2013.


This theoretical framework is part of ongoing research in cyclic peptide drug discovery. A complete implementation with empirical validation is forthcoming.