Parallel Chemical Hierarchies: A Multi-Perspective Embedding Strategy for Cyclic Peptide Drug Discovery

May 28, 2025

A theoretical framework for molecular embeddings that preserves both synthetic and biological organizational principles through parallel hierarchical decomposition, with applications to cyclic peptide permeability prediction.

Abstract

We present a theoretical framework for molecular embeddings based on parallel hierarchical decomposition, motivated by the unique challenges of cyclic peptide drug discovery. Unlike traditional nested hierarchies, our approach maintains two parallel organizational views—synthetic fragments and biological residues—that capture orthogonal chemical principles. We prove that this parallel structure preserves more information than serial hierarchies, enables efficient long-range interaction modeling, and naturally represents the dual nature of modified cyclic peptides. The framework is particularly suited for permeability prediction, where both synthetic accessibility and biological activity contribute to drug-like properties.

1. Introduction

1.1 The Cyclic Peptide Permeability Challenge

Cyclic peptides occupy a unique space in drug discovery—large enough to target “undruggable” protein-protein interactions, yet potentially permeable enough to be orally bioavailable. The key challenge is predicting and optimizing cell permeability, which depends on a complex interplay of factors:

Molecular size and lipophilicity (traditional drug-like properties)
Intramolecular hydrogen bonding (shielding polar groups)
Conformational flexibility (ability to adopt permeable conformations)
Non-canonical modifications (N-methylation, D-amino acids, unusual residues)

These properties emerge from both the peptide sequence (biological view) and chemical modifications (synthetic view), neither of which fully captures the complete picture.

1.2 The Representation Problem

Traditional molecular representations force a choice:

Option 1: Chemical graph representation

Treats all atoms equally
Loses peptide sequence information
Computationally expensive for large molecules

Option 2: Sequence-based representation

Preserves biological information
Loses chemical modification details
Fails for non-canonical residues

Option 3: Hierarchical representation

Atoms → Functional groups → Molecule
Creates information bottleneck
Forces single organizational view

For modified cyclic peptides—with N-methylations, D-amino acids, and non-natural residues—none of these approaches is sufficient.

1.3 Our Contribution

We propose a parallel hierarchical decomposition that maintains both synthetic and biological views simultaneously:

        Atoms
       /      \
  Fragments  Residues
       \      /
       Molecule

This structure:

Preserves more information than serial hierarchies (proven via information theory)
Enables cross-level interactions between fragments and residues
Naturally represents modified cyclic peptides
Facilitates learning of permeability-relevant features

2. Theoretical Foundations

2.1 Information-Theoretic Analysis

Definition 2.1 (Serial Hierarchical Decomposition)

A serial hierarchy processes information through a single intermediate level:

$\text{Atoms } (A) \to \text{Hierarchy } (H) \to \text{Molecule } (M)$

Definition 2.2 (Parallel Hierarchical Decomposition)

A parallel hierarchy processes information through multiple independent levels:

$\text{Atoms } (A) \to \{\text{Fragments } (F), \text{Residues } (R)\} \to \text{Molecule } (M)$

Theorem 2.1 (Information Preservation)

Parallel decomposition preserves at least as much information as any serial decomposition.

Proof :

Let $I(A; M)$ be the mutual information between atoms and molecular properties.

For serial decomposition: By the data processing inequality:

$I(A; M) \leq I(A; H) \leq H(A)$

For parallel decomposition: Using the chain rule of mutual information:

\begin{aligned} I(A; M|F,R) &= I(A; M,F,R) - I(A; F,R) \\ &= I(A; M) + I(A; F,R|M) - I(A; F,R) \end{aligned}

Since $F$ and $R$ are both functions of $A$ :

$I(A; F,R) = H(F,R) - H(F,R|A) = H(F,R)$

By conditioning reduces entropy:

$I(A; M|F,R) \geq \max(I(A; M|F), I(A; M|R))$

Since any serial hierarchy $H$ can be viewed as either $F$ or $R$ alone:

$I(A; M|F,R) \geq I(A; M|H)$

Therefore, parallel decomposition preserves at least as much information.

■

2.2 Graph-Theoretic Framework

Definition 2.3 (Molecular Partition)

A partition $\pi$ of atomic set $A$ groups atoms into disjoint subsets.

Definition 2.4 (Fragment Partition)

$\pi_F$ groups atoms by synthetic building blocks (e.g., BRICS decomposition).

Definition 2.5 (Residue Partition)

$\pi_R$ groups atoms by biological units (amino acid residues).

Lemma 2.2 (Partition Refinement)

The meet (intersection) of two partitions provides finer granularity than either partition alone.

Proof :

For partitions $\pi_F$ and $\pi_R$ , their meet $\pi_F \wedge \pi_R$ is defined as:

$\pi_F \wedge \pi_R = \{B_i \cap B_j \mid B_i \in \pi_F, B_j \in \pi_R, B_i \cap B_j \neq \emptyset\}$

The entropy of a partition $H(\pi) = -\sum_i (|B_i|/n) \log(|B_i|/n)$ .

Since each block in $\pi_F \wedge \pi_R$ is a subset of blocks in both $\pi_F$ and $\pi_R$ :

$|\text{blocks}(\pi_F \wedge \pi_R)| \geq \max(|\text{blocks}(\pi_F)|, |\text{blocks}(\pi_R)|)$

Therefore:

$H(\pi_F \wedge \pi_R) \geq \max(H(\pi_F), H(\pi_R))$

The finer partition captures more structural information.

■

2.3 Attention-Based Long-Range Interactions

Definition 2.6 (Graph Distance)

In a molecular graph $G$ , $d(i,j)$ is the shortest path between atoms $i$ and $j$ .

Definition 2.7 (Attention Mechanism)

Attention computes pairwise relevance scores:

$\text{Attention}(i,j) = \text{softmax}(Q_i \cdot K_j^T / \sqrt{d})$

Theorem 2.3 (Information Flow Complexity)

Message passing requires $O(\text{diameter}(G))$ steps for global information flow, while attention requires $O(1)$ steps.

Proof :

In message passing with $k$ iterations:

Information from node $i$ reaches nodes within distance $k$
Full propagation requires $k = \text{diameter}(G)$ iterations
Time complexity: $O(\text{diameter}(G) \times |E|)$

With attention mechanism:

All pairs compute attention scores simultaneously
Information flows directly between any pair
Time complexity: $O(|V|^2)$ but depth $O(1)$

For cyclic peptides, $\text{diameter}(G) \approx n/2$ for $n$ residues, making attention significantly more efficient for long-range interactions.

3. The Parallel Hierarchical Framework

■

3.1 Formal Construction

Definition 3.1 (Parallel Hierarchical Molecular Graph)

A tuple $(A, F, R, M, \varphi_F, \varphi_R, \psi)$ where:

$A$ = atomic features $\in \mathbb{R}^{n \times d_a}$
$F$ = fragment features $\in \mathbb{R}^{k_f \times d_f}$
$R$ = residue features $\in \mathbb{R}^{k_r \times d_r}$
$M$ = molecular features $\in \mathbb{R}^{d_m}$
$\varphi_F: A \to F$ (fragment assignment)
$\varphi_R: A \to R$ (residue assignment)
$\psi: F \times R \to M$ (molecular composition)

3.2 Cross-Level Interactions

Definition 3.2 (Fragment-Residue Interaction Tensor)

$T \in \mathbb{R}^{k_f \times k_r \times d_c}$ captures interactions between fragments and residues.

For cyclic peptides, this captures critical relationships:

N-methylation (fragment) on specific residues affects permeability
D-amino acids (residue property) influence backbone conformation
Aromatic fragments participate in π-π stacking across residues

3.3 Permeability-Relevant Features

The parallel structure naturally captures permeability determinants:

Fragment Level:

Lipophilic groups (permeability enhancers)
Polar functional groups (permeability barriers)
Hydrogen bond donors/acceptors

Residue Level:

Sequence patterns (e.g., Pro-Pro for rigidity)
D/L stereochemistry
N-methylation patterns

Cross-Level:

Intramolecular hydrogen bonds (polar group shielding)
Aromatic-aromatic interactions
Modification-sequence compatibility

4. Theoretical Properties

4.1 Representational Capacity

Theorem 4.1 (Capacity Bound)

The parallel hierarchy has higher representational capacity than any single hierarchy.

Proof :

Representational capacity can be measured by the dimension of the feature space.

Single hierarchy:

$\dim(H) = k_h \times d_h$

Parallel hierarchy:

$\dim(F \times R) = k_f \times d_f + k_r \times d_r + k_f \times k_r \times d_c$

The cross-term $k_f \times k_r \times d_c$ represents additional capacity from interactions.

For cyclic peptides:

$k_f \approx$ number of modification types
$k_r \approx$ number of residues
Cross-term grows as $O(k_f \times k_r)$

Therefore:

$\dim(F \times R) > \dim(H) \text{ for any } H \text{ with comparable complexity}$

Therefore, the parallel hierarchy has higher representational capacity.

■

4.2 Gradient Flow Properties

Theorem 4.2 (Gradient Robustness)

Parallel pathways reduce gradient vanishing probability.

Proof :

Let p be the probability of gradient vanishing through a single path.

Serial hierarchy:

$P(\text{gradient reaches atoms}) = 1 - p$

Parallel hierarchy with independent pathways:

\begin{aligned} P(\text{gradient vanishes}) &= P(\text{both paths fail}) = p^2 \\ P(\text{gradient reaches atoms}) &= 1 - p^2 \end{aligned}

Since $p \in (0,1)$ :

$1 - p^2 > 1 - p$

The parallel structure provides gradient flow redundancy.

■

4.3 Compositional Learning

Definition 4.1 (Compositional Rules)

Learned mappings that generalize to novel combinations:

Fragment compatibility: P(f_i connects to f_j)
Residue compatibility: P(r_i follows r_j)
Cross-level compatibility: P(f_i compatible with r_j)

Conjecture 4.1: The parallel hierarchy learns compositional rules that generalize to novel modified peptides.

Supporting argument (not a proof) The architecture separates:

Local chemical rules (fragments)
Sequential patterns (residues)
Interaction rules (cross-level)

This factorization encourages learning reusable components rather than memorizing complete structures.

5. Application to Cyclic Peptide Permeability

5.1 Why Parallel Hierarchy Suits Cyclic Peptides

Cyclic peptides have inherent dual nature:

Biological organization (Residues)

CYCLO(Arg-D-Phe-Pro-NMe-Val-Leu)

Sequence determines backbone conformation
Residue properties affect recognition

Synthetic organization (Fragments)

Guanidine-Phenyl-Pyrrolidine-NMethyl-Isopropyl-Isobutyl

Functional groups determine lipophilicity
Modifications affect permeability

Neither view is complete; both are necessary.

5.2 Permeability Feature Extraction

The parallel hierarchy naturally extracts permeability-relevant features:

Theorem 5.1 (Feature Completeness)

The parallel decomposition captures all first-order permeability determinants.

Proof sketch: Permeability determinants include:

Size/lipophilicity → captured by fragment features
Hydrogen bonding → fragment-residue interactions
Flexibility → residue sequence patterns
Charge distribution → both levels contribute

Each determinant maps to architectural components:

Fragments encode chemical properties
Residues encode conformational preferences
Cross-attention captures shielding effects

The parallel structure spans the space of permeability features.

5.3 Modified Residue Representation

Example N-methylated leucine

Traditional hierarchy struggles:

Is it a modified leucine? (residue view)
Is it a specific chemical structure? (fragment view)

Parallel hierarchy:

Residue level: Leucine-like backbone position
Fragment level: N-methyl modification, isobutyl side chain
Cross-level: N-methylation at this position affects backbone flexibility

This factorization enables generalization to novel modifications.

6. Theoretical Implications

6.1 Connection to Multi-View Learning

The parallel hierarchy implements a form of multi-view learning where views are:

Structurally coupled (share atomic foundation)
Semantically distinct (capture different chemical principles)
Mutually informative (cross-level interactions)

This relates to co-training and multi-kernel learning theory.

6.2 Inductive Biases

The architecture encodes strong but complementary biases:

Fragment bias: Local chemical environment determines properties

Matches medicinal chemistry intuition
Enables fragment-based drug design reasoning

Residue bias: Sequence patterns determine structure

Matches peptide chemistry knowledge
Enables sequence-based optimization

By maintaining both biases simultaneously, the model avoids premature commitment to a single view.

6.3 Theoretical Limitations

Limitation 1: Not all molecules benefit from dual representation

Small molecules may not have meaningful residue structure
Proteins might need additional hierarchical levels

Limitation 2: Optimal decomposition is task-dependent

Different partitions might suit different properties
No universal “best” decomposition exists

Limitation 3: Computational overhead

Maintains multiple feature sets
Requires cross-level attention computation

7. Experimental Directions

While this work focuses on theoretical foundations, the framework suggests several testable hypotheses:

7.1 Compositional Generalization

Hypothesis: Models using parallel hierarchy should better generalize to:

Novel combinations of known modifications
Longer/shorter peptide sequences
Different cyclization patterns

7.2 Interpretability

Hypothesis: Learned attention patterns should reveal:

Which modifications affect permeability
Critical fragment-residue interactions
Structural motifs for permeability

7.3 Transfer Learning

Hypothesis: The factorized representation should enable:

Transfer between peptide families
Knowledge sharing across modification types
Few-shot learning for novel residues

8.1 Connection to Category Theory

The parallel decomposition can be viewed as a span in the category of molecular representations:

$F \leftarrow A \rightarrow R$

with the molecular level as the pushout combining both views.

8.2 Connection to Information Geometry

The parallel hierarchy defines a product manifold:

$\text{M}_\text{molecule} = \text{M}_\text{fragment} \times \text{M}_\text{residue}$

with the Riemannian metric incorporating both geometric structures.

8.3 Connection to Tensor Decomposition

The framework performs an implicit tensor factorization:

$\text{T}_\text{molecule} \approx \text{T}_\text{atom} \times_1 \text{U}_\text{fragment} \times_2 \text{U}_{residue} \times_3 \text{U}_\text{interaction}$

similar to Tucker decomposition but with chemical constraints.

9. Conclusions

9.1 Summary of Contributions

We presented a theoretical framework for molecular embeddings based on parallel hierarchical decomposition. Key theoretical results include:

Proof that parallel decomposition preserves more information than serial hierarchies
Analysis of gradient flow properties showing improved robustness
Framework for representing modified cyclic peptides with dual organization
Connection to permeability prediction requirements

9.2 Implications for Drug Discovery

The parallel hierarchy framework:

Addresses the unique challenges of modified cyclic peptides
Preserves both synthetic and biological information
Enables learning of compositional rules
Facilitates permeability prediction and optimization

9.3 Future Theoretical Work

Open questions include:

Optimal decomposition: How to choose the best fragment/residue partitions?
Theoretical guarantees: Can we prove generalization bounds?
Extension to other modalities: Can this framework incorporate 3D structure?
Automated decomposition: Can we learn the hierarchical structure from data?

The parallel hierarchical framework provides a theoretically grounded approach to molecular representation that aligns with the dual nature of cyclic peptides, offering a principled foundation for permeability prediction and molecular design.

References

[1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley-Interscience, 2006.

[2] G. C. Rota, “On the foundations of combinatorial theory I. Theory of Möbius functions,” Zeitschrift für Wahrscheinlichkeitstheorie, vol. 2, no. 4, pp. 340-368, 1964.

[3] D. I. Shuman et al., “The emerging field of signal processing on graphs,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83-98, 2013.

[4] P. G. Dougherty et al., “Understanding Cell Penetration of Cyclic Peptides,” Chemical Reviews, vol. 119, no. 17, pp. 10241-10287, 2019.

[5] A. Furukawa et al., “Passive Membrane Permeability in Cyclic Peptomer Scaffolds,” ACS Chemical Biology, vol. 15, no. 10, pp. 2633-2640, 2020.

[6] M. R. Naylor et al., “Cyclic peptide natural products chart the frontier of oral bioavailability,” Current Opinion in Chemical Biology, vol. 47, pp. 117-126, 2018.

[7] J. Gilmer et al., “Neural Message Passing for Quantum Chemistry,” in Proc. ICML, 2017.

[8] Z. Xu et al., “How Powerful are Graph Neural Networks?” in Proc. ICLR, 2019.

[9] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proc. COLT, 1998.

[10] C. Xu et al., “A survey on multi-view learning,” arXiv preprint arXiv:1304.5634, 2013.

This theoretical framework is part of ongoing research in cyclic peptide drug discovery. A complete implementation with empirical validation is forthcoming.

Abstract

1. Introduction

1.1 The Cyclic Peptide Permeability Challenge

1.2 The Representation Problem

1.3 Our Contribution

2. Theoretical Foundations

2.1 Information-Theoretic Analysis

2.2 Graph-Theoretic Framework

2.3 Attention-Based Long-Range Interactions

3. The Parallel Hierarchical Framework

3.1 Formal Construction

3.2 Cross-Level Interactions

3.3 Permeability-Relevant Features

4. Theoretical Properties

4.1 Representational Capacity

4.2 Gradient Flow Properties

4.3 Compositional Learning

Conjecture 4.1: The parallel hierarchy learns compositional rules that generalize to novel modified peptides.

5. Application to Cyclic Peptide Permeability

5.1 Why Parallel Hierarchy Suits Cyclic Peptides

5.2 Permeability Feature Extraction

5.3 Modified Residue Representation

Example N-methylated leucine

6. Theoretical Implications

6.1 Connection to Multi-View Learning

6.2 Inductive Biases

6.3 Theoretical Limitations

7. Experimental Directions

7.1 Compositional Generalization

7.2 Interpretability

7.3 Transfer Learning

8. Related Theoretical Frameworks

8.1 Connection to Category Theory

8.2 Connection to Information Geometry

8.3 Connection to Tensor Decomposition

9. Conclusions

9.1 Summary of Contributions

9.2 Implications for Drug Discovery

9.3 Future Theoretical Work

References