Grin logo
de en es fr
Shop
GRIN Website
Publish your texts - enjoy our full service for authors
Go to shop › Computer Science - Bioinformatics

Steps towards de Novo 3D Ligand and Protein Design via Deep Learning

Title: Steps towards de Novo 3D Ligand and Protein Design via Deep Learning

Master's Thesis , 2019 , 167 Pages , Grade: 1,3

Autor:in: Matthias Rieger (Author)

Computer Science - Bioinformatics
Excerpt & Details   Look inside the ebook
Summary Excerpt Details

Since 2013 generative neural networks are used for tasks like generating audio or image data. However, there is no publication which uses their capabilities for de novo ligand and or protein design yet. In this work, a generative neural network is introduced – the PG-VUGAN (progressively growing variational U-NET generative adversarial network) with which it is intended to fill this knowledge-gap.

The PG-VUGAN consumes a rich molecular image (RMI) of either the ligand or the pocket and can generate its complementary counterpart. This is practically demonstrated for de novo ligand design in this paper. The RMI is a new image-based format for molecular structures, which is specifically designed for being performantly processed by convolutional neural networks. Its suitability is demonstrated by developing a state-of-the-art binding-affinity regressor. Summing up, a first step towards artificially generated ligands and proteins via generative neural networks was made.

Protein-ligand interactions control cellular processes and are therefore essential for all living beings. Hence, generating complementary ligands for a protein-structure or vice-versa the prediction of complementary protein-structures for ligands is a desirable intent of science. Possible use-cases for de novo ligand and protein design can be found in all fields of biotechnology and reach from drug discovery and individual medicine up to the creation of artificial enzymes.

Designing these molecules from scratch is challenging; and yet, the technology for de novo design is in its early stages. The reason is, that existing tools rely on the assumptions of experts and on mathematical approximations with which their real physical nature can only be simulated partly. Artificial neural networks promise to pass these limitations.

Excerpt


Table of Contents

Introduction

Overview

Basics

1. Biological background and terms

1.1 Proteins

1.2 The key lock principle

1.2.1 Drugs and receptors

1.2.2 Intermolecular interactions

1.3 Enzyme, theozyme and theosite

2. Data formats for molecular structures

2.1 1D – Arrays

2.1.1 Atom list / rich atom list

2.1.2 SMILES

2.1.3 Descriptors

2.1.4 Fingerprints

2.2 2D-matrix

2.2.1 Adjacency matrix

2.2.1.1 Coulomb matrix

2.2.1.2 Contact map and coevolutionary analysis

2.2.2 Images of a visualization tool

2.3 3D-Matrix

2.3.1 Voxel representations

2.3.1.1 Rich voxel

2.3.1.2 Wavelet

2.3.2 GRID maps (3D - pharmacophore)

Drug and protein design

3. Drug design

3.1 Structure based drug design

3.1.1 Docking and virtual high throughput screening

3.1.2 Scoring functions

3.1.2.1 Assisted model building with energy refinement

3.1.3 Incremental construction docking tools and FlexX

3.1.4 Evolutionary algorithms and Autodock 4.2

3.1.5 Shape-based docking

3.2 Ligand based drug design

3.2.1 Library search

3.2.2 Quantitative-structure-activity relationships models

3.3 De novo drug design via molecular modeling

3.3.1 Incremental construction algorithms

3.3.1.1 LUDI

3.3.1.2 FlexNovo

3.3.2 Evolutionary algorithms

4. Protein design

4.1 Directed evolution

4.2 Rational design

4.3 De novo protein design

4.3.1 Rosetta Commons

4.3.1.1 Rosetta (ab initio) structure prediction

4.3.1.2 Rosetta Match

4.3.1.3 RosettaDesign

4.3.2 ScaffoldSelection

Deep learning

5. Recent architectural enhancements of deep models and new architectures

5.1 Deep residual learning

5.2 Inception modules & InceptResNet v2

5.3 Attention modules

5.3.1 Filter-generating network

5.3.2 Squeeze-and-Excitation block

5.3.3 Spatial transformer

5.3.4 Residual attention module

5.4 3D convolutional neural networks

5.5 Multi-view networks

5.6 Graph convolutional networks

5.7 Tree-LSTM

5.7.1 LSTM - cell

5.7.2 N-ary Tree-LSTM cell

6. Generative neural networks

6.1 Generative adversarial network

6.2 Autoencoders

6.2.1 Variational autoencoder

6.2.2 Adversarial autoencoder

6.3 VAEGAN

7. Recent proceedings in generative neural networks.

7.1 Common issues of training GANs and how to deal with them

7.1.1 Mini batch discrimination

7.1.2 Feature matching

7.1.3 Historical averaging

7.1.4 Noisy labels

7.1.5 Semi-Supervised GAN

7.1.6 Least squares GAN

7.1.7 Wasserstein GAN

7.1.7.1 WGAN with gradient penalty

7.2 Recently as useful proven architectures

7.2.1 U-NET

7.2.2 Variational U-NET

7.2.3 Patch networks

7.2.4 Discovery GAN

7.2.5 BicycleGAN

7.2.6 StackGAN

7.2.7 Progressively growing GAN

8. Deep learning for drug discovery

8.1 De novo drug design via deep learning

8.1.1 SMILES variational autoencoder

8.1.2 Wavelet autoencoder

8.1.3 druGAN

8.2 Feature regressors for molecular properties

8.2.1 KDEEP: a 3D convolutional network

8.2.2 SchNET: a graph convolutional network

9. Dealing with dataset limitations

9.1 Data augmentation

9.2 Transfer learning

9.3 Multitask learning

De novo 3D ligand and protein design via deep learning

10. Overview: applied de novo design

11. Data preparation

11.1 Datasets

11.1.1 PDBbind

11.1.1.1 Demarcation to Binding MOAD

11.1.1.2 Machine learning based methods for binding affinity regression in comparison

11.2 Test complexes: Ibuprofen, HIV-Integrase and 3-dehydroquinate dehydratase

11.3 Pose normalization

Explorative phase

12. RAL based approaches

12.1 SchNET variations

12.2 VAE and double VAE(GAN) for protein-complexes with SchNET

12.3 Ligand autoencoding with RAL based VAEs

12.3.1 Strategies to tackle the sparsity problem

12.4 Conclusion RAL based approach

13. Rich molecular image based approaches

13.1 Rich molecular image

13.2 VAEGANs for ligands represented as rich molecular image

13.2.1 PDBbind analysis and dataset reduction

13.2.2 RMI based VAEGAN on the filtered dataset

13.3 Conclusions for RMI based VAE and VAEGAN approaches

14. VUNET and the VUGAN for de novo design

14.1 VUNET

14.2 VUGAN

14.2.1 VUGAN trained on the RV format

14.3 How to use the VUNET and VUGAN for de novo protein design

14.4 Additional use-cases

15. Summary and conclusion of the explorative phase

Refinement phase

16. PG-VUGAN for de novo design

16.1 Loss functions, penalties, and output variations

17. Improving the rich molecular image format

17.1 Rich molecular image with atomic radius

17.2 Min-max scaling

17.3 RMI for complexes

17.3.1 Ligand vs. complex PCA based pose normalization

17.3.2 Comparing representations for complexes

18. Designing a binding affinity regressor

18.1 Convolutional architectures in comparison

18.2 Designing a binding affinity regressor

18.3 Multi-view networks

19. Compensating the limitations of the PDBbind dataset

19.1 Multi-task learning

19.2 Network-based transfer learning

19.3 Data augmentation

20. Abridgement of the engagements towards increased binding affinity regression performance

20.1 MV- DilSEption a model with beneficial contributions

21. Rethinking the PG-VUGAN method

21.1 Architecture

21.2 Reducing the output channels of the rich molecular image

21.3 Transfer learning

21.4 Image resizing

21.5 Loss contributions

21.6 Growing procedure

21.6.1 Initiation criterion

21.6.2 Layer fade-in

21.7 Stabilizing the adversarial training

21.7.1 Least-squares GAN

21.7.2 Semi-supervised learning

21.7.3 Mini batch discrimination

21.7.4 Feature matching

21.7.4.1 Activity penalty for the discriminators feature matching layer

21.7.5 Using a latent feature regressor

21.7.6 Training balancing

21.7.6.1 Data balancing

21.7.6.2 Loss normalization

21.7.6.3 Generator / discriminator training ratio balancing

21.8 The approach as pseudo code

21.9 Result

22. Summary and conclusion

Research Goals and Topics

This Master's thesis aims to develop a novel computational approach for de novo 3D ligand and protein design using generative artificial neural networks. The core research question addresses whether it is possible to realize an in silico system that generates a complementary ligand for a specific protein binding-site (and vice versa) by leveraging deep learning architectures, thereby overcoming the limitations of existing fragment-based or heuristic-driven design methods.

  • Application of Generative Adversarial Networks (GANs) and Autoencoders for molecular structure generation.
  • Development of the Rich Molecular Image (RMI) format for efficient processing by Convolutional Neural Networks.
  • Design and evaluation of the PG-VUGAN (Progressively growing variational U-NET generative adversarial network) for ligand-protein complex modeling.
  • Comparison of various deep learning architectures and data representations for binding affinity regression.
  • Strategies to mitigate dataset limitations and the "sparsity problem" in molecular 3D representations.

Excerpt from the Book

3.1 Structure based drug design

The cardinal SBDD technique is to perform a virtual high throughput screening (vHTS) with a docking tool. Over 60 tools exist which can be categorized as incremental construction algorithms (most popular: FlexX), evolutionary algorithms (most popular: AutoDock) and shape-based algorithms (most popular Dock) [46]. The docking procedure itself then can be done rigid or flexible. The most popular method of each category is summarized in this section.

In vHTS a database of compounds is screened for possible target interaction, by docking the molecules into the pocket [47]. The term “docking” means that the compound is placed into the binding pocket and that free energy (pseudo binding affinity) is calculated with a scoring function. Minimizing the free energy, e.g. by constantly updating the molecule’s orientation and position in the pocket, is the purpose of a docking tool. The top scoring compounds are taken as possible leads, filtered for druggability and processed further in the drug design pipeline.

The core of each docking tool is its scoring function which tries to calculate the free energy. Scoring functions are grouped into force-fields, knowledge-based methods, empirical and machine learning based methods. A force-field consists of equations which try to express real physical structure-interactions like described in section 1.2.2 by mathematical approximations. Knowledge based methods add terms that consider pair-wise energy potentials extracted from known ligand-receptor complexes to the force-field [48].

Empirical scoring functions add terms to the equation which take the number of various interaction types like hydrogen bonds into account, where the individual coefficients of each term are usually determined by fitting regression models. Machine learning methods directly replace all “handcrafted” mathematical equations by a learned approximation (from data) itself [49]. At this point it can be noted that sometimes the term force-field embraces all types of scoring functions and that indeed scoring functions can be freely mixed.

Summary of Chapters

Biological background and terms: This chapter provides a brief introduction to the essential concepts of proteins, the key-lock principle, and intermolecular interactions relevant for understanding protein-ligand binding.

Data formats for molecular structures: This section details how molecular data is encoded for computational use, covering 1D arrays, 2D matrices, and 3D voxel-based representations.

Drug design: This chapter reviews traditional structure-based and ligand-based drug design techniques, including docking, evolutionary algorithms, and QSAR models.

Protein design: This chapter outlines methodologies for designing new protein functions or structures, contrasting directed evolution and rational design with de novo protein design.

Recent architectural enhancements of deep models and new architectures: An overview of modern deep learning building blocks, such as residual connections, inception modules, and attention mechanisms, and their applicability to 3D data.

Generative neural networks: This chapter introduces the foundational concepts of GANs, autoencoders, and their variants, serving as the basis for the proposed design models.

Recent proceedings in generative neural networks: A discussion on common challenges in training generative models, such as mode collapse and non-convergence, and architectural solutions like U-NETs and progressively growing architectures.

Deep learning for drug discovery: A survey of existing deep learning-based approaches for drug design and property regression, highlighting the shift toward generative modeling.

Dealing with dataset limitations: This chapter explores techniques to overcome data scarcity in deep learning, specifically focusing on data augmentation, transfer learning, and multitask learning.

Overview: applied de novo design: A structural guide illustrating how the different components of the proposed system (data prep, model, loss function) are composed for de novo design.

Data preparation: Describes the datasets used in this work (e.g., PDBbind, QM9) and the preprocessing steps, including pose normalization, for molecular structure generation.

RAL based approaches: Details the first explorative experiments using RAL (Rich Atomlist) representations and the challenges encountered with data sparsity.

Rich molecular image based approaches: Introduces the Rich Molecular Image (RMI) format as a solution to represent 3D molecular data for image-based deep learning models.

VUNET and the VUGAN for de novo design: Explains the development and implementation of the VUNET and VUGAN architectures, which combine U-NET capabilities with variational inference for generative tasks.

Summary and conclusion of the explorative phase: A recap of the insights gained from early prototypes and the transition to refined methodologies.

PG-VUGAN for de novo design: Describes the final architectural refinement, introducing progressive growing to improve the resolution and generation quality of the proposed models.

Improving the rich molecular image format: Investigates optimizations to the RMI format, including atomic radii and normalization, to enhance the regression performance and structural detail.

Designing a binding affinity regressor: Documents the development of models to estimate binding affinity, leading to the identification of effective architectures like DilSEption.

Compensating the limitations of the PDBbind dataset: Assesses the impact of multitask learning and transfer learning on model performance and generalization.

Abridgement of the engagements towards increased binding affinity regression performance: A summary of the efforts made to improve the regression models during the refinement phase.

Rethinking the PG-VUGAN method: Reviews the final architectural revisions and training stabilization techniques applied to the PG-VUGAN.

Summary and conclusion: Concludes the thesis by evaluating the achievements and pointing out potential directions for future research in de novo ligand and protein design.

Keywords

De novo drug design, Protein design, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Deep Learning, Bioinformatics, Binding affinity, Rich Molecular Image (RMI), PDBbind, Protein-ligand interaction, Convolutional Neural Networks, Molecular modeling, Structure-based drug design, Transfer learning, Multitask learning

Frequently Asked Questions

What is the primary focus of this thesis?

The thesis focuses on realizing de novo 3D ligand and protein design in silico using artificial neural networks, specifically exploring how to generate complementary structures for a given protein binding-site.

Which molecular representation formats are evaluated?

The work evaluates various formats, including 1D arrays (RAL, SMILES), 2D matrices (adjacency matrices, images), and 3D tensors (voxels, wavelets, RMI), ultimately identifying the RMI format as particularly performant.

What is the core objective of the research?

The primary goal is to find a deep learning architecture that can generate 3D molecular structures by explicitly considering the target pocket, thereby bypassing traditional, error-prone computational chemistry approximations.

Which scientific methodology is primarily applied?

The work employs deep generative modeling, specifically using VAEGAN and PG-VUGAN (Progressively growing variational U-NET GAN) architectures, trained through an explorative and refinement phase using PDBbind data.

What are the main components of the proposed architecture?

The proposed system integrates U-NET-based encoders/decoders, spatial attention mechanisms, and feature regressors, optimized with progressive growing techniques and custom loss functions like the modified binary cross-entropy (mBCE).

How does the work address data scarcity?

To mitigate the limitations of small datasets, the research investigates multitask learning, network-based transfer learning, and data augmentation through rotational sampling of molecular structures.

How does the RMI format improve upon traditional representations?

The Rich Molecular Image format maps 3D coordinate data into 2D matrices that can be processed by established, high-performance image-processing CNNs while retaining necessary conformational and depth information.

What are the main findings regarding the effectiveness of the proposed design?

The results show that while the PG-VUGAN could successfully learn coarse spatial arrangements and ligand-like shapes, the generation of chemically valid, high-resolution structures remains a significant challenge, necessitating further architectural optimizations.

Excerpt out of 167 pages  - scroll top

Details

Title
Steps towards de Novo 3D Ligand and Protein Design via Deep Learning
College
University of Tubingen  (Faculty of Science / Department of Bioinformatics)
Grade
1,3
Author
Matthias Rieger (Author)
Publication Year
2019
Pages
167
Catalog Number
V926236
ISBN (eBook)
9783346294548
Language
English
Tags
Drug design Protein design Enzyme design de novo drug design generative adversarial networks GAN Progressively growing GAN New datastructures for molecules Protein database U-NET Rich molecular image Rich smiles Binding affinity prediction Drug-Target interaction KDEEP Survey StackGAN Wasserstein GAN Binding affinity regression Multi-view networks
Product Safety
GRIN Publishing GmbH
Quote paper
Matthias Rieger (Author), 2019, Steps towards de Novo 3D Ligand and Protein Design via Deep Learning, Munich, GRIN Verlag, https://www.grin.com/document/926236
Look inside the ebook
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
  • Depending on your browser, you might see this message in place of the failed image.
Excerpt from  167  pages
Grin logo
  • Grin.com
  • Shipping
  • Contact
  • Privacy
  • Terms
  • Imprint