Since 2013 generative neural networks are used for tasks like generating audio or image data. However, there is no publication which uses their capabilities for de novo ligand and or protein design yet. In this work, a generative neural network is introduced – the PG-VUGAN (progressively growing variational U-NET generative adversarial network) with which it is intended to fill this knowledge-gap.
The PG-VUGAN consumes a rich molecular image (RMI) of either the ligand or the pocket and can generate its complementary counterpart. This is practically demonstrated for de novo ligand design in this paper. The RMI is a new image-based format for molecular structures, which is specifically designed for being performantly processed by convolutional neural networks. Its suitability is demonstrated by developing a state-of-the-art binding-affinity regressor. Summing up, a first step towards artificially generated ligands and proteins via generative neural networks was made.
Protein-ligand interactions control cellular processes and are therefore essential for all living beings. Hence, generating complementary ligands for a protein-structure or vice-versa the prediction of complementary protein-structures for ligands is a desirable intent of science. Possible use-cases for de novo ligand and protein design can be found in all fields of biotechnology and reach from drug discovery and individual medicine up to the creation of artificial enzymes.
Designing these molecules from scratch is challenging; and yet, the technology for de novo design is in its early stages. The reason is, that existing tools rely on the assumptions of experts and on mathematical approximations with which their real physical nature can only be simulated partly. Artificial neural networks promise to pass these limitations.
Table of Contents
Introduction
Overview
Basics
1. Biological background and terms
1.1 Proteins
1.2 The key lock principle
1.2.1 Drugs and receptors
1.2.2 Intermolecular interactions
1.3 Enzyme, theozyme and theosite
2. Data formats for molecular structures
2.1 1D – Arrays
2.1.1 Atom list / rich atom list
2.1.2 SMILES
2.1.3 Descriptors
2.1.4 Fingerprints
2.2 2D-matrix
2.2.1 Adjacency matrix
2.2.1.1 Coulomb matrix
2.2.1.2 Contact map and coevolutionary analysis
2.2.2 Images of a visualization tool
2.3 3D-Matrix
2.3.1 Voxel representations
2.3.1.1 Rich voxel
2.3.1.2 Wavelet
2.3.2 GRID maps (3D - pharmacophore)
Drug and protein design
3. Drug design
3.1 Structure based drug design
3.1.1 Docking and virtual high throughput screening
3.1.2 Scoring functions
3.1.2.1 Assisted model building with energy refinement
3.1.3 Incremental construction docking tools and FlexX
3.1.4 Evolutionary algorithms and Autodock 4.2
3.1.5 Shape-based docking
3.2 Ligand based drug design
3.2.1 Library search
3.2.2 Quantitative-structure-activity relationships models
3.3 De novo drug design via molecular modeling
3.3.1 Incremental construction algorithms
3.3.1.1 LUDI
3.3.1.2 FlexNovo
3.3.2 Evolutionary algorithms
4. Protein design
4.1 Directed evolution
4.2 Rational design
4.3 De novo protein design
4.3.1 Rosetta Commons
4.3.1.1 Rosetta (ab initio) structure prediction
4.3.1.2 Rosetta Match
4.3.1.3 RosettaDesign
4.3.2 ScaffoldSelection
Deep learning
5. Recent architectural enhancements of deep models and new architectures
5.1 Deep residual learning
5.2 Inception modules & InceptResNet v2
5.3 Attention modules
5.3.1 Filter-generating network
5.3.2 Squeeze-and-Excitation block
5.3.3 Spatial transformer
5.3.4 Residual attention module
5.4 3D convolutional neural networks
5.5 Multi-view networks
5.6 Graph convolutional networks
5.7 Tree-LSTM
5.7.1 LSTM - cell
5.7.2 N-ary Tree-LSTM cell
6. Generative neural networks
6.1 Generative adversarial network
6.2 Autoencoders
6.2.1 Variational autoencoder
6.2.2 Adversarial autoencoder
6.3 VAEGAN
7. Recent proceedings in generative neural networks.
7.1 Common issues of training GANs and how to deal with them
7.1.1 Mini batch discrimination
7.1.2 Feature matching
7.1.3 Historical averaging
7.1.4 Noisy labels
7.1.5 Semi-Supervised GAN
7.1.6 Least squares GAN
7.1.7 Wasserstein GAN
7.1.7.1 WGAN with gradient penalty
7.2 Recently as useful proven architectures
7.2.1 U-NET
7.2.2 Variational U-NET
7.2.3 Patch networks
7.2.4 Discovery GAN
7.2.5 BicycleGAN
7.2.6 StackGAN
7.2.7 Progressively growing GAN
8. Deep learning for drug discovery
8.1 De novo drug design via deep learning
8.1.1 SMILES variational autoencoder
8.1.2 Wavelet autoencoder
8.1.3 druGAN
8.2 Feature regressors for molecular properties
8.2.1 KDEEP: a 3D convolutional network
8.2.2 SchNET: a graph convolutional network
9. Dealing with dataset limitations
9.1 Data augmentation
9.2 Transfer learning
9.3 Multitask learning
De novo 3D ligand and protein design via deep learning
10. Overview: applied de novo design
11. Data preparation
11.1 Datasets
11.1.1 PDBbind
11.1.1.1 Demarcation to Binding MOAD
11.1.1.2 Machine learning based methods for binding affinity regression in comparison
11.2 Test complexes: Ibuprofen, HIV-Integrase and 3-dehydroquinate dehydratase
11.3 Pose normalization
Explorative phase
12. RAL based approaches
12.1 SchNET variations
12.2 VAE and double VAE(GAN) for protein-complexes with SchNET
12.3 Ligand autoencoding with RAL based VAEs
12.3.1 Strategies to tackle the sparsity problem
12.4 Conclusion RAL based approach
13. Rich molecular image based approaches
13.1 Rich molecular image
13.2 VAEGANs for ligands represented as rich molecular image
13.2.1 PDBbind analysis and dataset reduction
13.2.2 RMI based VAEGAN on the filtered dataset
13.3 Conclusions for RMI based VAE and VAEGAN approaches
14. VUNET and the VUGAN for de novo design
14.1 VUNET
14.2 VUGAN
14.2.1 VUGAN trained on the RV format
14.3 How to use the VUNET and VUGAN for de novo protein design
14.4 Additional use-cases
15. Summary and conclusion of the explorative phase
Refinement phase
16. PG-VUGAN for de novo design
16.1 Loss functions, penalties, and output variations
17. Improving the rich molecular image format
17.1 Rich molecular image with atomic radius
17.2 Min-max scaling
17.3 RMI for complexes
17.3.1 Ligand vs. complex PCA based pose normalization
17.3.2 Comparing representations for complexes
18. Designing a binding affinity regressor
18.1 Convolutional architectures in comparison
18.2 Designing a binding affinity regressor
18.3 Multi-view networks
19. Compensating the limitations of the PDBbind dataset
19.1 Multi-task learning
19.2 Network-based transfer learning
19.3 Data augmentation
20. Abridgement of the engagements towards increased binding affinity regression performance
20.1 MV- DilSEption a model with beneficial contributions
21. Rethinking the PG-VUGAN method
21.1 Architecture
21.2 Reducing the output channels of the rich molecular image
21.3 Transfer learning
21.4 Image resizing
21.5 Loss contributions
21.6 Growing procedure
21.6.1 Initiation criterion
21.6.2 Layer fade-in
21.7 Stabilizing the adversarial training
21.7.1 Least-squares GAN
21.7.2 Semi-supervised learning
21.7.3 Mini batch discrimination
21.7.4 Feature matching
21.7.4.1 Activity penalty for the discriminators feature matching layer
21.7.5 Using a latent feature regressor
21.7.6 Training balancing
21.7.6.1 Data balancing
21.7.6.2 Loss normalization
21.7.6.3 Generator / discriminator training ratio balancing
21.8 The approach as pseudo code
21.9 Result
22. Summary and conclusion
Research Goals and Topics
This Master's thesis aims to develop a novel computational approach for de novo 3D ligand and protein design using generative artificial neural networks. The core research question addresses whether it is possible to realize an in silico system that generates a complementary ligand for a specific protein binding-site (and vice versa) by leveraging deep learning architectures, thereby overcoming the limitations of existing fragment-based or heuristic-driven design methods.
- Application of Generative Adversarial Networks (GANs) and Autoencoders for molecular structure generation.
- Development of the Rich Molecular Image (RMI) format for efficient processing by Convolutional Neural Networks.
- Design and evaluation of the PG-VUGAN (Progressively growing variational U-NET generative adversarial network) for ligand-protein complex modeling.
- Comparison of various deep learning architectures and data representations for binding affinity regression.
- Strategies to mitigate dataset limitations and the "sparsity problem" in molecular 3D representations.
Excerpt from the Book
3.1 Structure based drug design
The cardinal SBDD technique is to perform a virtual high throughput screening (vHTS) with a docking tool. Over 60 tools exist which can be categorized as incremental construction algorithms (most popular: FlexX), evolutionary algorithms (most popular: AutoDock) and shape-based algorithms (most popular Dock) [46]. The docking procedure itself then can be done rigid or flexible. The most popular method of each category is summarized in this section.
In vHTS a database of compounds is screened for possible target interaction, by docking the molecules into the pocket [47]. The term “docking” means that the compound is placed into the binding pocket and that free energy (pseudo binding affinity) is calculated with a scoring function. Minimizing the free energy, e.g. by constantly updating the molecule’s orientation and position in the pocket, is the purpose of a docking tool. The top scoring compounds are taken as possible leads, filtered for druggability and processed further in the drug design pipeline.
The core of each docking tool is its scoring function which tries to calculate the free energy. Scoring functions are grouped into force-fields, knowledge-based methods, empirical and machine learning based methods. A force-field consists of equations which try to express real physical structure-interactions like described in section 1.2.2 by mathematical approximations. Knowledge based methods add terms that consider pair-wise energy potentials extracted from known ligand-receptor complexes to the force-field [48].
Empirical scoring functions add terms to the equation which take the number of various interaction types like hydrogen bonds into account, where the individual coefficients of each term are usually determined by fitting regression models. Machine learning methods directly replace all “handcrafted” mathematical equations by a learned approximation (from data) itself [49]. At this point it can be noted that sometimes the term force-field embraces all types of scoring functions and that indeed scoring functions can be freely mixed.
Summary of Chapters
Biological background and terms: This chapter provides a brief introduction to the essential concepts of proteins, the key-lock principle, and intermolecular interactions relevant for understanding protein-ligand binding.
Data formats for molecular structures: This section details how molecular data is encoded for computational use, covering 1D arrays, 2D matrices, and 3D voxel-based representations.
Drug design: This chapter reviews traditional structure-based and ligand-based drug design techniques, including docking, evolutionary algorithms, and QSAR models.
Protein design: This chapter outlines methodologies for designing new protein functions or structures, contrasting directed evolution and rational design with de novo protein design.
Recent architectural enhancements of deep models and new architectures: An overview of modern deep learning building blocks, such as residual connections, inception modules, and attention mechanisms, and their applicability to 3D data.
Generative neural networks: This chapter introduces the foundational concepts of GANs, autoencoders, and their variants, serving as the basis for the proposed design models.
Recent proceedings in generative neural networks: A discussion on common challenges in training generative models, such as mode collapse and non-convergence, and architectural solutions like U-NETs and progressively growing architectures.
Deep learning for drug discovery: A survey of existing deep learning-based approaches for drug design and property regression, highlighting the shift toward generative modeling.
Dealing with dataset limitations: This chapter explores techniques to overcome data scarcity in deep learning, specifically focusing on data augmentation, transfer learning, and multitask learning.
Overview: applied de novo design: A structural guide illustrating how the different components of the proposed system (data prep, model, loss function) are composed for de novo design.
Data preparation: Describes the datasets used in this work (e.g., PDBbind, QM9) and the preprocessing steps, including pose normalization, for molecular structure generation.
RAL based approaches: Details the first explorative experiments using RAL (Rich Atomlist) representations and the challenges encountered with data sparsity.
Rich molecular image based approaches: Introduces the Rich Molecular Image (RMI) format as a solution to represent 3D molecular data for image-based deep learning models.
VUNET and the VUGAN for de novo design: Explains the development and implementation of the VUNET and VUGAN architectures, which combine U-NET capabilities with variational inference for generative tasks.
Summary and conclusion of the explorative phase: A recap of the insights gained from early prototypes and the transition to refined methodologies.
PG-VUGAN for de novo design: Describes the final architectural refinement, introducing progressive growing to improve the resolution and generation quality of the proposed models.
Improving the rich molecular image format: Investigates optimizations to the RMI format, including atomic radii and normalization, to enhance the regression performance and structural detail.
Designing a binding affinity regressor: Documents the development of models to estimate binding affinity, leading to the identification of effective architectures like DilSEption.
Compensating the limitations of the PDBbind dataset: Assesses the impact of multitask learning and transfer learning on model performance and generalization.
Abridgement of the engagements towards increased binding affinity regression performance: A summary of the efforts made to improve the regression models during the refinement phase.
Rethinking the PG-VUGAN method: Reviews the final architectural revisions and training stabilization techniques applied to the PG-VUGAN.
Summary and conclusion: Concludes the thesis by evaluating the achievements and pointing out potential directions for future research in de novo ligand and protein design.
Keywords
De novo drug design, Protein design, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Deep Learning, Bioinformatics, Binding affinity, Rich Molecular Image (RMI), PDBbind, Protein-ligand interaction, Convolutional Neural Networks, Molecular modeling, Structure-based drug design, Transfer learning, Multitask learning
Frequently Asked Questions
What is the primary focus of this thesis?
The thesis focuses on realizing de novo 3D ligand and protein design in silico using artificial neural networks, specifically exploring how to generate complementary structures for a given protein binding-site.
Which molecular representation formats are evaluated?
The work evaluates various formats, including 1D arrays (RAL, SMILES), 2D matrices (adjacency matrices, images), and 3D tensors (voxels, wavelets, RMI), ultimately identifying the RMI format as particularly performant.
What is the core objective of the research?
The primary goal is to find a deep learning architecture that can generate 3D molecular structures by explicitly considering the target pocket, thereby bypassing traditional, error-prone computational chemistry approximations.
Which scientific methodology is primarily applied?
The work employs deep generative modeling, specifically using VAEGAN and PG-VUGAN (Progressively growing variational U-NET GAN) architectures, trained through an explorative and refinement phase using PDBbind data.
What are the main components of the proposed architecture?
The proposed system integrates U-NET-based encoders/decoders, spatial attention mechanisms, and feature regressors, optimized with progressive growing techniques and custom loss functions like the modified binary cross-entropy (mBCE).
How does the work address data scarcity?
To mitigate the limitations of small datasets, the research investigates multitask learning, network-based transfer learning, and data augmentation through rotational sampling of molecular structures.
How does the RMI format improve upon traditional representations?
The Rich Molecular Image format maps 3D coordinate data into 2D matrices that can be processed by established, high-performance image-processing CNNs while retaining necessary conformational and depth information.
What are the main findings regarding the effectiveness of the proposed design?
The results show that while the PG-VUGAN could successfully learn coarse spatial arrangements and ligand-like shapes, the generation of chemically valid, high-resolution structures remains a significant challenge, necessitating further architectural optimizations.
- Quote paper
- Matthias Rieger (Author), 2019, Steps towards de Novo 3D Ligand and Protein Design via Deep Learning, Munich, GRIN Verlag, https://www.grin.com/document/926236