Study and Analysis of Knowledgebase of Molecular Systems and to Develop Model for Prediction of Molecular Structure

Doctoral Thesis / Dissertation, 2010

227 Pages










I Introduction
1.1 Introduction
1.2 The Research Area, Problem Domain and Literature Survey
1.3 Relevance of research
1.4 Details of Remaining Chapters
1.5 References

II Computational Techniques, Tools and Technologies to support Bioinformatics
2.1 Introduction
2.2 ACD/Chem Sketch
2.2.1 Introduction
2.2.2 ACD/ChemSketch includes
2.2.3 Structure Representation
2.2.4 IUPAC International Chemical Identifier
2.3 NMRPrediction
2.3.1 Introduction
2.4 ArgusLab
2.4.1 Introduction
2.4.2 Building of Benzene
2.5.1 Main Feature
2.5.2 Sequence Analysis
2.5.3 Codon Frequency
2.5.4 Nonsynonymous codon substitution
2.6 Jemboss
2.6.1 Introduction
2.6.2 Local and Remote File Manager
2.6.3 Jemboss Alignment Editor
2.6.4 Sequence List
2.6.5 Jemboss Alignment Editor
2.7 Chemical Markup Language (CML)
2.7.1 Introduction
2.7.2 Reading XML Documents
2.7.3 Examples of the molecules with CML
2.8.1 Introduction
2.8.2 Canonicalization
2.8.3 SMILES Specification Rules Atoms Bonds Branches Cyclic Structures Disconnected Structures
2.10 References

III Alignment of Pairs and Multiple Sequences and Phylogenetic Analysis
3.1 Introduction
3.2 Sequence Description
3.3 Pair wise Sequence Alignment
3.3.1 Local versus Global Alignment
3.3.2 Methods of Sequence Alignment
3.4 Multiple Sequence Alignment
3.4.1 Methods of Multiple Sequence Alignment
3.4.2 Application of Multiple Alignments
3.5 Phylogenetic Analysis
3.5.1 Methods of Phylogenetic Analysis
3.5.2 Computational Considerations
3.6 References

IV Similarities Search and Sequence Alignment
4.1 FASTA Algorithm
4.1.2 FASTA Implementation
4.2 BLAST Algorithm
4.2.1 BLAST Output
4.2.2 BLAST Services
4.2.4 FASTA and BLAST Algorithms Comparison
4.3 References

V Protein Structure and Cheminformatics
5.1 Introduction
5.2 Different Levels of Protein Structure
5.3 Prediction Methods
5.4 Secondary Structure Prediction
5.5 The Protein Folding Problem
5.6 Cheminformatics
5.6.1 Introduction
5.6.2 Challenges of Drug Design
5.6.3 The Drug Discovery Pipeline
5.6.4 Computer-Aided Drug Design (CADD)
5.6.5 Difficulties Implementing Denovo Design
5.7 References

VI Conformational Study of Molecules using Tools
6.1 Introduction
6.2 Experimental Work
6.2.1 Activity No.-1
6.2.2 Activity No.-2
6.2.3 Activity No.-3 Sequence Analysis Using Jemboss Nucleotide Sequence Using DAMBE Protein sequence Using Jemboss
6.3 Data Analysis and Experimental Outcome
6.4 Conclusions and Future Scope of Research


Research is to see what everybody else has seen, and to think what nobody else has thought. This work is also no exception. It is my pleasure to convey my gratitude to all those who have directly or indirectly contributed to make this work successful.

First and foremost, this dissertation represents a great deal of time and effort not only on my part, but on the part of my supervisor, Dr. N. N. Jani, Ex. Prof. & Head, Computer Science Dept., Saurashtra University, Rajkot. I expressed my profound gratitude to Dr. Jani sir for his endless encouragement throughout my research work. He has helped me shape my research from day one, pushed me to get through the inevitable research setbacks, and encouraged me to achieve to the best of my ability. A person with great concern for his students, he will remain an exemplar in my future.

I take opportunity to express my deep sense of gratitude to Dr. V. S. Patel, Director, SICARD, Sardar Patel Centre for Science and Tech., Vallabh Vidyanagar, Anand, Gujarat for providing me visit at research centre. I got a chance to interact with sophisticated instruments like X-Ray Diffractometer (XRD) and Inductively Coupled Plasma Spectrometer (ICP).

I express my respectful gratitude to Dr. M. M. Patel, Ex. Director, ISTAR, Vallabh Vidyanagar, Anand, Dr. D. J. Desai, Principal, V.P. & R.P.T.P Science College, Vallabh Vidyanagar and Dr. O. S. Srivastava, HOD, MCA Dept., ISTAR for providing me all kinds of facility and moral support for completing my research work on time.

Last but not least, I am thankful to my family members for their moral support and constant motivation for encouraging me in completing my work successfully.


Abbildung in dieser Leseprobe nicht enthalten


Abbildung in dieser Leseprobe nicht enthalten


1.1 Introduction

This research work aims to analyze experimental data about biochemical properties and their corresponding kinetics. In this research the attempt has been made to analyze protein and DNA structure using tools such as DAMBE and Jemboss. Some Molecular Visualization or Analysis tools are already developed that reads, analyses, and cross­correlates experimental information which is useful for chemist, Organist Chemist, Biochemist and Druggist.

Under this research the analysis of different chemical and biochemical substances including drugs using tools like ACD/ChemSketch and NMR Prediction have been performed. The information obtained by the way of analysis that facilitates for in depth understanding of structures and that makes possible for a quantification of new chemical structure.

In this research using ACD/ChemSketch compounds are stored in databases and SMILE code (Simplified Molecular Input Line Specification) is generated. A SMILE defines the molecules in the form of alphanumeric chains. In this research work chemical shift of every carbon atom of the molecule have been displayed by using NMR Prediction.

Under this research CML codes of molecules have been developed and that codes have been used for molecular information like symmetry, and atom and bond attributes. Here multiple observations of the same molecule like conformational analysis and NMR prediction have been performed.

Using Pubchem/NCBI additional miscellaneous information such as bioactivity analysis by structure & activity similarity and revised compound selection after addition of similar compounds have been analyzed.

Under the research work geometric optimization of molecules, chemical structure visualization and calculation of electronic absorption spectra of chemical structure have been performed using ArgusLab tool. In this research Single Entry Point Calculation, Molecular Orbital calculation on grids for plotting HOMO and LUMO and ESP Mapped Density calculations have been also performed.

Under the research work of different types of analysis like prediction of protein secondary structure, isoelectric point calculation etc. have been performed on nucleotide and protein sequence using DAMBE and Jemboss tools.

The aim of this research work is to develop a model for the prediction of molecular structure. In research work bioinformatics and cheminformatics approaches on molecule has been covered. In this research an integrated bioinformatics and cheminformatics approach has been discussed that enables retrieval and visualization of biological relationships across heterogeneous data sources. So, now it is getting importance to integrate biological information on large molecules and their interaction networks with programs chemical information on small drug molecules.

Bioinformatists and Chemoinformatists have working independently in their respective fields. But now development of small molecule drugs and small drug molecules with known properties has been utilized to study the functions of large networks of biological molecules in the fields of chemical biology.

The objective of this research work is to assist the organic and biochemist in each step of the synthesis planning process for prediction of molecular structure. This research work provides a series of methods and tools for chemical or biochemical applications. Built-in catalogs of fine chemicals or biochemical provide suitable starting materials for a synthesis or molecular structure prediction target. Using similarity searches or substructure searches the connection between the target compound and available starting materials has been achieved.

This research work aims to search strategic bonds in target molecule for synthesis procedures. Structural criteria of each bond within the query molecule are also taken into account. In this research data mining tools has been used to predict physical properties of structures. In research work analysis on knowledgebase molecular system has been performed and a model has been developed that uses information to make decisions and suggest new strategies for chemistry and biochemistry problems.

The knowledgebase molecular system has three components:

A. Knowledgebase as Chemical Memory: An attempt has been made to concentrate on knowledge based data with an increasing number of chemical systems. Taking advantage of data sharing, each calculation increases the level of ‘experience’ of expert system extending the knowledge base upon which new hypothesis and chemical concept has been derived.

B. Data Mining: A component for increasing the chemical knowledge is extracting chemically meaningful data out of large scale chemical simulations with minimum human effort. The challenges lies in distinguishing data that is irrelevant for specific question under specific investigation for those that are important. To carry out this task an attempt has been made to concentrate on a knowledgebase system that process the molecular orbital and trace changes and similarities between molecules. Under this research visualization techniques have been used to enlarge scope of analysis.

C. Towards Artificial Chemical Intelligence: The final part of this research to formulate hypothesis based on data provided by molecules .Under this research work an attempt has been made for prediction of molecular structure. Finally, a research work result has been collected and then analyzed using analysis tools and then evaluated the result.

1.2 The Research Area, Problem Domain and Literature Survey

Bioinformatics and management of scientific data are critical to support life science discovery. As computational models of proteins, cells and organisms become increasingly realistic much biology research has migrated from the wetlab to the computer. Successfully accomplishing the translation of biology in silico, however, requires access to a huge amount of information from across the research community. Much of information is currently available from publicly accessible data sources and more is being added daily. Unfortunately, scientists are not currently able to identify easily and exploit this information because of the variety of semantics, interfaces and data formats used by the underlying data sources. Providing biochemist, medical researcher and computer scientist with integrated access to all information they need a consistent format requires overcoming a large number of technical, social and political challenges.

In the last decade, biologist have experienced a fundamental revolution from traditional research and development (R&D) consisting in discovering and understanding genes , metabolic pathways and cellular mechanisms to large scale computer-based R&D that simulates the disease , the physiology , the molecular mechanism and pharmacology. This represents a shift away from life science’s empirical roots in which it was an interactive process. Today it is systematic thematic and predictive with genomics, informatics and automation all playing a role. This fusion of biology and information science is expected to continue and expand for predictable futures. The first consequence of this revolution is the explosion of available data that bimolecular researchers have to exploit. For example, an average pharmaceutical company currently uses information from at least 40 databases[1], each containing large amounts of data (e.g. as of June 2002, GenBank [2,3] provides access to 20,649,000,000 bases in 17,471,000 sequences) that can be analyzed using a variety of complex tools such as FASTA, BLAST etc.

Over past several years, bioinformatics has become both an all encompassing term for every thing relating to computer science and biology and an every trendy one. There are variety of reasons for this including : (1) As computational biology evolves and expands , the need for solutions to the data integration problems it faces increases; (2) the media are beginning to understand the implications of genomics revolution that has been going on the last 15 or more years ; (3) the recent headlines and debates surrounding the cloning of animals and humans ; and (4) to appear cutting edge , many companies have relabeled the work the work as they are doing as bioinformatics instead of geneticists , biologists or computer science.

The analysis of data sets is one of the most important tasks in investigation of properties of chemical or biochemical compounds. Especially in Drug Design, methods are used to characterize complete sets of chemical or biochemical compounds instead of describing individual molecule. Data Mining, i.e. the exploration of large amounts of data in search for consistent patterns, correlation or other systematic relationships, can be helpful tool to evaluate “hidden” information in a set of molecules. Finding the adequate information for representation of new chemical structures is one of the most important problems in chemical data mining.

With the progressive specialization in services and extensive use of computational methods the steady increase of data is barely manageable even by a team of scientist. Thereby the interest in specific information is pushed into backward while global information of complete sets of data is becoming more and more important. Thus, the recognition of superior information for complete data sets becomes one of the most important tasks for information management in science.

In Chemistry or Biochemistry the investigation of molecular structures and of their properties is one of the most important areas. In chemistry an own language and namespace for molecular exists, that is still in development stage. With increase of computational information processing several conventions and formats for chemical information have been developed.

But, in one of the most important communication media of modern times, the internet, the chemical language has been used only in a few applications. While a couple of databases were accessible via WWW, no service exists, that allows a data mining of chemical datasets by the use of this specific language.

The task of Data Mining in chemical or biochemical context is to evaluate “hidden” information in set of chemical data. One of the differences of Data Mining compared to conventional database queries is the production of new information that is used to characterize chemical data in a more general way. Generally, it is not be possible to hold all of the potentially required information in a data set of chemical structures. Thus, the extraction of relevant information and production of reliable secondary information are important.

The similarity of two compounds concerning their biological activity is one of central tasks in the development of pharmaceutical products. A typical application is retrieval of structures with defined biological activity from a database. Biological activity is of special interest the development of drugs. The diversity of structures in a data set of drugs has been the interest for the synthesis of new compounds. With increasing variety of data set, the chance to find a new way of synthesis for a compound with similar biological property is increasing.

Therefore, finding the adequate information for representation of chemical structures is one of the basic problems in chemical data mining. Several methods have been developed in the last decades for the description of molecules including their chemical or biochemical properties.

1.3 Relevance of the Research

Data Mining Service Chemistry (DMSC)[4]is a project for the development and exploration of chemical data sets. With this service it is possible to analyze chemical or biochemical data sets for molecular patterns and systematic relationships using the methods like Statistical analyses and neural networks of individual molecules.

System for Drug Discovery (QIS D2)[5]is a unique adaptive learning system designed to predict potential large-scale drug characteristics such as toxicity and efficacy. BioSpice is a set of software tools designed to represent and simulate cellular processes.

A new computer program is developed that describes, GRINSP (geometrically restrained inorganic structure prediction)[6], which allows the exploration of the possibilities of occurrence of 3-, 4-, 5- and 6-connected three-dimensional networks.

A global optimization method[7]is presented for predicting the minimum energy structure of small protein-like molecules. This method begins by collecting a large number of molecular conformations, each obtained by finding a local minimum of a potential energy function from a random starting point. The information from these conformers is then used to form a convex quadratic global underestimating function for the potential energy of all known conformers.

GenomeThreader[8]implements several data types in a reusable manner. Compared to its predecessor GeneSeqer, it is considerably faster, easier to maintain, and extensible. It is widely used for gene structure prediction.

The general approach[9]for the prediction of possible crystal structures consists of the global exploration of the energy landscape of the chemical system, with typical methods being simulated annealing or genetic algorithms. In the case of simulated annealing, combinations of model potentials and Ab initio calculations for the energy evaluation are state of the art.

The characteristics[10]of a free web-based spectral database for the chemical research community, containing 13C NMR spectra data from more than 4000 natural compounds, and with a continuous increasing. This database allows flexible searching via chemical structure, substructure, name, and family of compounds, as well as spectral features as chemical shift, allowing the structural elucidation of known and unknown compounds by comparison of 13C NMR data.

In this research work planning has been made to provide a centralized access to a wide variety of data mining methods, like statistical processing and prediction of molecular structure. With this service it is possible to submit data sets or to compile a data set by extracting structures from chemical databases via Internet. For submitting or compiled data sets descriptors have been calculated with an extensive set of options. On the basis of these descriptors, several methods of data analysis have been performed on the data set.

1.4 Details of Remaining Chapters

This thesis is meant to be a major step in my personal interest in prediction of molecular structure.

Second chapter of this thesis provides an overview of tools like ACD/ChemSketch, NMR Prediction, Argus Lab, DAMBE and Jemboss. ACD/Labs is used for developed molecular structures, reactions, and schematic diagrams and calculated chemical properties of different substances (chemical and biochemical). NMR Prediction tool is used to perform estimation of *H-NMR and 13C-NMR of different substances. ArgusLab tool is used to build chemical structure and optimized its geometry. DAMBE tool is used to manipulate and analyze molecular sequence data. Jemboss can perform activities on sequences like predicting protein secondary structure etc. CML is designed to represent molecular information. SMILES (Simplified Molecular Input Line Entry System) is a line notation for entering and representing molecules.

Third chapter of this thesis provides an overview of pair wise sequence alignment and multiple sequence alignment. In this chapter alignment score and gap penalty between sequences has been calculated. Multiple sequence alignment is useful in finding patterns in nucleotide sequences and for identifying structural and functional domains in protein families. The method of converting MSA to a phylogenetic tree has been used to reduce the problem of a multiple alignment to an iterative process of pair-wise alignments.

Forth chapter of this thesis provides an overview of sequence alignment tools like BLAST and FASTA. Here their working methods and the syntax used by these tools has been discussed. FASTA uses algorithm to search for similarities between one sequence and any group of sequences of same type (nucleic acid or protein) as the query sequence. BLAST uses a heuristic algorithm that seeks local as opposed to global alignments and is therefore able to detect relationships among sequences that share only isolated regions of similarity.

Fifth chapter of this thesis provides an overview of protein structure and Cheminformatics. The subunits of a protein are amino acids. The primary structure is the sequence of residues in the polypeptide chain. Secondary structure is a local regularly occurring structure in proteins and is mainly formed through hydrogen bonds between backbone atoms. Tertiary structure describes the packing of alpha-helices, beta-sheets and random coils with respect to each other on the level of one whole polypeptide chain. Ab Initio method and Heuristic methods have been used for protein structure prediction.

Sixth chapter of this thesis shows the strong interaction between representation and the methods used for data analysis: molecular representation need to capture relevant information and be compatible with the statistical methods used to analyze the data. The chapters review molecular representations and put focus on model validation using statistics, visualization methods, and standardization approaches.

1.5 References

[1] M. Peitsch. “From Genome to Protein Space.” Presentation at the Fifth Annual Symposium in Bioinformatics, Singapore, October, 2000.

[2] D. Benson, I. Karsch -Mizarachi, D.Lipman . “Genbank.” Nucleic Acids research 31, no 1 (2003): 23-27 ,

[3] “Growth of GenBank.” (2003):

[4] CRC NCRC Institute for Information technology Artificial Intelligence subject index:

[5] Ying Zhao, Charles Zhou, Ian Oglesby, Cliff Zhou Quantum Intelligence, Inc. 3375 Scott Blvd Suite 100 ,Santa Clara CA 95054.

[6] Universite' du Maine, Laboratoire des oxydes et Fluorures, CNRS UMR 6010, Avenue O. Messiaen, 72085 Le Mans Cedex 9, France.

[7] K.A. Dill, A.T. Phillips, and J.B. Rosen, Molecular Structure Prediction by Global Optimization,

[8] Gordon Gremme , Volker Brendel , Michael E. Sparks , Stefan Kurtz, Engineering a software tool for gene structure prediction in higher organisms, Information and Software Technology 47 (2005) 965-978

[9] K Doll, J C Sch’’on and M Jansen , Structure prediction based on ab initio simulated Annealing , Max-Planck-Institute for Solid State Research, Heisenbergstr. 1, D-70569 Stuttgart, Germany.

[10] Kochev, N., Monev, V., Bangov, I.: Searching Chemical Structures. In: Chemoinformatics: A textbook. Wiley-VCH (2003) 291-318


2.1 Introduction

Under this research work tools like ACD/ChemSketch, NMR Prediction, Argus Lab, DAMBE and Jemboss have been discussed.

ACD/Labs[1]has been used for developed molecular structures, reactions, and schematic diagrams and calculated chemical properties of different substances (chemical and biochemical). NMR Prediction[2]tool has been used to perform estimation of 1H- NMR and 13C-NMR of different substances. The proton shift estimation program has been invoked by this tool and the result has been displayed written to the drawing. Sometimes the drawing is changed to allow the display of certain shifts. ArgusLab[3]tool has been used to build chemical structure and its geometry has been optimized. It is being used for visualization of frontier p molecular orbitals of chemical structure.

DAMBE[4]tool has been used to manipulate and analyze molecular sequence data. DAMBE is used for calculation of genetic distances or phylogenetic reconstruction. Jemboss[5]has been used for interactively editing sequence alignment. Different activities on sequences have been performed by this tool like Editing Functions, Locking Sequences, Trim Sequences, Colour Schemes, Scoring Matrix, Consensus Sequence, Identity Table and Consensus Plot etc.

2.2 ACD/ChemSketch

2.2.1 Introduction

ACD/ChemSketch is the powerful all-purpose chemical drawing and graphics package from ACD/Labs developed to help chemists quickly and easily draw molecular structures, reactions, and schematic diagrams, calculate chemical properties, and design professional reports and presentations. ACD/Labs has been fully dedicated to building integrated solutions that enable data transfer and connection with in chemical organizations.

ChemBasic is a simple, convenient, and functionally rich programming language for presentation and manipulation of molecular structure related objects and all the contents of ACD /Labs current and future programs. ChemBasic is founded on, and fully integrated with, ACD/Labs existing functionality. At the same time, ChemBasic has all of the things a programming language should have: numeric and string variables, arrays, flow control and conditional operators, input output procedures, etc.

ChemBasic inherits from generic BASIC and some of its extensions. Most evident is a product of Microsoft's Visual Basic for Applications (VBA). ChemBasic is designed as object oriented language. This means that all the chemistry related things are described as objects—that is, specific data structures which correspond to molecules, conformations, etc. I can design multi item input forms using ChemBasic programs using ACD/Forms Manager.

2.2.2 ACD/ChemSketch includes

- Structure mode for drawing chemical structures and calculating their properties.
- Draw mode or text and graphics processing.
- Additional modules that extend the ChemSketch possibilities (most of them should be purchased separately).

Structure mode. General information

In the Structure mode, following actions can be performed:

- Chemical structures can be drawn using the buttons located on the Structure toolbar, Atoms toolbar and References toolbar.

- For the selected structure the molar refractivity, molar volume, parachor, index of refraction, surface tension, density, and some other physicochemical properties can be calculated.
- Chemical structures can be finding according to their systematic or non- systematic names, therapeutic category or inhibited enzyme by using the integrated ACD/Dictionary .
- Most favorable tautomeric forms of the drawn structure can be checked and can be automatically corrected the structure by using the integrated Tautomeric Forms function on the Structure toolbar.
- An optimized 3D model of a 2D structure can be get.

In Draw mode, the following actions can be performed:

- Graphical objects such as lines, arrows, rectangles, ellipses, arcs, polylines, and polygons can be drawn by using the Drawing toolbar buttons.
- Objects can be manipulated.
- Location of objects on the page with a ruler and gridlines can be controlled.

2.2.3 Structure Representation

Antialiasing has been supported by ACD/ChemSketch that displays chemical structures drawn with smooth lines. Antialiasing is a computer rendering technique that blurs the hard edges and adds shaded pixels to create the appearance of smoothness. This addresses the common issue with printers and computer monitors, when, due to the relatively low resolution, the tilted lines appear “stairlike” instead of smooth straight lines or curves. For example, compare the two pictures below:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Stairlike curves

Treat some bonds to metal atoms as coordination bonds

ACD/Labs support the usage of a special coordination bond to represent a specific bonding between a ligand and a metal center in coordination structures. Such a bond indicates a connection but does not affect the valence of the corresponding atoms. However, often use of the regular single bond to represent a coordination that leads to formal violation of valence rules. Such a violation is marked in ACD/ChemSketch by “crossed atoms”.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2 : Coordination Bond Representation

2.2.4 IUPAC International Chemical Identifier (InChI)

The IUPAC International Chemical Identifier (InChI™) is a non-proprietary identifier enabling unambiguous identification of chemical substances for electronic handling of chemical structural information. InChI codes significantly expand the use of InChI encoding for structure specification and searching over the Internet. For example:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.3:2,5-diamino-5-oxopentanoic acid

InChI generation options include an option for InChIKey generation:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.4: InChIKey Option

For quick access of InChI generation, a special button "Generate InChI" has been added to the top toolbar:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.5: Generate InChIKey Button

2.3 NMRPrediction

2.3.1 Introduction

This software performs different estimation of a structure. It estimates *H-NMR. It invokes the proton shift estimation program and displays the results written to the drawing. Sometimes the drawing is changed to allow the display of certain shifts. It estimates 13C-NMR. It invokes the carbon-13 shift estimation program and displays the results written to the drawing. Show Protocol command displays detailed information about the most recently invoked shift estimation. Calculate 3D Coordinates command displays the currently drawn structure as a 3D display in its own window. The molecule can be rotated by moving the mouse.

2.3.2 Taking example of Glutamyl

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.6: Example of Glutamyl 1H-NMR spectra of Glutamyl

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.6: Example of Glutamyl

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.7: 1H-NMR spectra of Glutamyl

Table 2.1: Shift Prediction Protocol

Abbildung in dieser Leseprobe nicht enthalten

2.4 ArgusLab

2.4.1 Introduction

Argus Lab performing following capabilities:

- Build chemical structure and optimize its geometry.
- Visualize frontier p molecular orbital’s of chemical structure.
- Calculate the electronic absorption spectra of chemical structure.
- Use a surface to visualize the spin-density in a molecule with unpaired spins.
- Make a surface that maps the electrostatic potential to the electron density.
- Using surfaces to see what happens to the electron density when a molecule absorbs light.

2.4.2 Building of Benzene

Benzene structure can be built from scratch and its geometry can be optimized. After addition of atoms from editor Benzene molecule can be generated and bonds can be made automatically. Following structure can be shown.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.8: Building of Benzene

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.9: Bonds in Benzene Structure

Visualize the Building of Benzene

Visualize the Building of Benzene

ArgusLab with generated MO grid files.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.10: Building of Benzene Visualization

Visualization of MOs of Benzene

Visualization of MOs of Benzene

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.11: Visualization of MOs of Benzene

Calculating the electronic UV/Visible absorption spectrum of Benzene

The electronic excited states of benzene can be calculated using the semi-empirical ZINDO method which is parameterized for low-energy excited states of organic and organo-metallic molecules.

Calculating the ZINDO Electronic Spectra of a Molecule

The calculation consists of a ground-state closed shell SCF calculation followed by a configuration interaction calculation, using single-excited configurations, to solve for the excited states. Currently, only singlet excited states can be calculated.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.12: Calculating the ZINDO Electronic Spectra of a Molecule

Making an electrostatic potential-mapped electron density surface

ArgusLab can generate Mapped surfaces. These are surfaces where one property is mapped onto a surface created by another property. The most popular example of this is to map the electrostatic potential (ESP) onto a surface of the electron density. In an ESP- mapped density surface, the electron density surface gives the shape of the surface while the value of the ESP on that surface gives the colors.

The electrostatic potential is the potential energy felt by a positive "test" charge at a particular point in space. If the ESP is negative, this is a region of stability for the positive test charge. Conversely, if the ESP is positive, this is a region of relative instability for the positive test charge. Thus, an ESP-mapped density surface can be used to show regions of a molecule that might be more favorable to nucleophilic or electrophilic attack, making these types of surfaces useful for qualitative interpretations of chemical reactivity.

Steps for calculating the following surface:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.13: The surface in a mesh rendering to make it easier to see the underlying molecular structure

This is an ESP-mapped density surface of formaldehyde. The colors are the value of the ESP at the points on the electron density surface. The color map is given on the left. The large red region around the oxygen-end of the molecule. There is enhanced electron density here. The red color indicates the most negative regions of the electrostatic potential where a positive test charge would have favorable interaction energy. The hydrogen-end of the molecule, with the magenta color, shows regions of relatively unfavorable energy for the ESP.

Making the Surface: Generate the grid data

All surfaces are constructed from grid data that is generated from a calculation. To generate the grid data, a single-point energy calculation of formaldehyde can be run.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.14: Generate the Grid Data

Seeing the lone pairs on the oxygen

Some of the surface's settings can be altered to visualize the lone pair electron density on the oxygen.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.15: The lone pairs on the oxygen

Using surfaces to see change in the electron density when a molecule absorbs light

Here the first excited state of simple molecule formaldehyde (CH2O) has been examined. The highest occupied molecular orbital (HOMO) of formaldehyde is a non-bonding type MO that is in the plane of the molecule. The lowest unoccupied molecular orbital (LUMO) is a p MO perpendicular to the plane of the molecule. The first excited state of formaldehyde is an n->p* transition that is composed almost exclusively of the HOMO- >LUMO transition.

Calculate the electronic absorption spectra or formaldehyde

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.16 : Visualizing the frontier MOs

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.17 : Visualizing the frontier MOs(Diagram)

Electron Density Difference

Different surface can be made to show the difference of the excited state minus the ground state electron density.

Abbildung in dieser Leseprobe nicht enthalten

Mapping the ESP difference onto the electron density

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.19: Mapping the ESP difference onto the electron density


DAMBE stands for Data Analysis in Molecular Biology and Evolution. It is an integrated software package for retrieving, organizing, manipulating aligning and analyzing molecular sequence data. Allele frequency data can also be used by DAMBE for calculating genetic distances or phylogenetic reconstruction.

2.5.1 Main Feature

DAMBE's main features can be classified into the following five categories:

1 Database and network functions:

(a) Molecular sequences can be directly read from GenBank or other networked computers;
(b) Specific sequences from GenBank sequences can be extracted by using information contained in the FEATURES table of GenBank sequence files;

2 Sequence conversion and manipulation utilities:

(a) It can be automatically detected and can be converted to 18 most commonly used molecular data formats;
(b) Complementary sequences can be getting.
(c) Protein-coding nucleotide sequences can be translated into amino acid sequences, with 12 different genetic codes implemented;
(d) Sequence can be aligned,
(e) Site-wise unresolved nucleotide, amino acid or codon sites can be deleted.
(f) Particular sites can be extracted, e.g., first, second or third codon positions, for particular analyses;

3 Sequence analysis can be focused on, factors affecting the frequency parameters in substitution models:

(a) Nucleotide and Dinucleotide frequencies
(b) Codon frequencies
(c) Amino acid frequencies
(d) Amino acid properties can be plotted along the sequence; with the following properties implemented:

- Polarity
- Polar requirement
- Chemical composition of the side chain
- Volume
- Hydropathy
- Isoelectric point
- Aromaticity

4 Basic comparative sequence analysis can be performed that focus on factors affecting the rate ratio parameters in substitution models:

(a) Nucleotide substitution pattern
(b) Codon substitution pattern
(c) Amino acid substitution pattern
(d) Substitution saturation

5 Advanced comparative sequence analysis can be performed

(a) Phylogenetic reconstruction based on the distance, maximum parsimony and maximum likelihood methods
(b) Reconstruction of ancestral sequences
(c) Testing the molecular clock hypothesis
(d) Evaluating relative statistical support of alternative phylogenetic hypotheses (e.g., alternative phylogenetic trees)
(e) Fitting probability distributions to substitution data over sites.

2.5.2 Sequence Analysis

This command computes the nucleotide and dinucleotide frequencies.

A part of a sample output (for one sequence) is shown below:

Abbildung in dieser Leseprobe nicht enthalten


The output is of two parts for each sequence, the first part lists the nucleotide frequencies, with "Other" stands for all characters that are not "acgtu", e.g., "-?.". The second part lists the di-nucleotide frequencies and the expected frequencies when there is no association or repulsion between nucleotides (i.e., the probability of two nucleotides sitting next to each other depends entirely on their frequencies). The di-nucleotides are counted from the beginning to the end of the sequence, with the nucleotides on the left column being the first, and those on the top row being the second, of the dinucleotide. From the first part of the output, it has been interpreted that A is being used more frequently than other nucleotides.

2.5.3 Codon Frequency

This opens a dialog box for computing codon frequencies and codon usage bias. A part of a default sample output, based on a segment of the Influenza A viruses, is shown below:

Output from sequences in file C:\MS\virus\virus.rst on

Sequence length = 969 (After excluding '?', '-' and 'n'.) Number of codons = 323

Abbildung in dieser Leseprobe nicht enthalten

The codon usage table is based on the following sequences:

Abbildung in dieser Leseprobe nicht enthalten

The output is in two parts. The first is a table of codon frequencies categorized into codon families, and the second lists nucleotide frequencies separately for each of the three codon positions designated as CodSite in the output.

2.5.4 Nonsynonymous codon substitution:

The sequence pairs available for selection on the left list depends on what input file format that is being used. If input format is NOT the RST format, then the number of possible sequence pairs is simply N*(N-1)/2. A partial sample output for a set of elongation factor 1-sequences (for only one pair-wise comparison between two chelicerate species) is shown below:

Abbildung in dieser Leseprobe nicht enthalten

Pair-wise comparisons along the tree are either between internal nodes, or between an internal node and a terminal node. This information is shown at the beginning of each pair-wise comparison. The first column shows the sequential numbering of codons along the DNA sequences (after deleting unresolved codons). The second and third columns show which codon pairs are involved in the substitution, and the fourth and fifth columns show the corresponding amino acids.

2.6 Jemboss

2.6.1 Introduction

Jemboss is developed by the EMBOSS team and is a graphical interface to the European Molecular Biology Open Software Suite, EMBOSS. Jemboss incorporates the 200+ applications of both the EMBOSS and EMBASSY packages.

The job manager is used to monitor the status of batch processes. These are those EMBOSS applications that are computationally intensive. Instead of waiting for the results these processes are submitted as batch, which frees the interface for other analyses to be carried out. This product includes code licensed from RSA Data Security.

2.6.2 Local and Remote File Manager

The users local and the remote file systems can be displayed. The local files are those stored on the computer that Jemboss is being run on. The remote files are the users files located on the server machine that runs the EMBOSS applications.

The activities performed by file manager are:

- Drag and Drop Files
- Transferring Files
- Refresh' File Manager
- Multiple File Selection

2.6.3 Jemboss Results Manager

Applications in Jemboss can be run 'interactively' or in 'batch' mode. Interactive applications wait for the process to finish and the results pop up on the screen. Batch process run in the background so that other tasks can be performed in Jemboss while the application is running. In both cases the results are stored on the server machine and can be retrieved at any time.

2.6.4 Sequence List

This window allows us to store their commonly used sequences.

An EMBOSS list file contains "references" to sequences, for example the file has been looked like: opsd_abyko.fasta, sw: opsd_xenla, sw: opsd_c* and @another_list etc.

The sequence length has been calculated by ‘Calculate sequence attributes' under the 'Tools' menu. The sequence start and end positions has been displayed.

2.6.5 Jemboss Alignment Editor

The Jemboss Alignment Editor has been used interactively to edit a sequence alignment (read in fasta or MSF format). It can also be used from the command line to produce image files of the alignment (e.g. within a script).

Following activities has been performed by alignment editor:

- Loading Sequences
- Editing Functions
- Locking Sequences
- Trim Sequences
- Colour Schemes
- Scoring Matrix
- Consensus Sequence

2.7 Chemical Markup Language (CML)

2.7.1 Introduction

This is a variety of XML[6]designed to represent molecular information. It has been used to store chemical formulas and to display the molecules in graphical formats.

CML[7]has been developed to carry molecules, crystallographic data and reactions using an XML language. A universal, platform and application independent format for storing and exchanging chemical information has been offered by CML. CML outlines a variety of general purpose 'data-holder' elements and a smaller number of more specifically chemical elements (e.g. <molecule>, <reaction>, <crystal>) used to indicate chemical 'objects'. For example, a <molecule> will contain a <list> of <atom>s, which in turn have three <float>s specifying Cartesian coordinates for each atom.

CML provides no default conventions for labeling data elements and puts few restrictions on element ordering. The design of CML and contains minimal preconceptions as to the type of chemical information that has been stored using it.

2.7.2 Reading XML Documents[8]

Here is an example from the CML Schema:


<molecule id="m1">


<atom elementType="N"/>

<atom elementType="O"/>




The first tag is <cmi>. This is the top level tag. The next tag is <moiecuie id="mi">. The CML Schema reference says that the <moiecuie> tag is "a container for atoms, bonds and submolecules.


Excerpt out of 227 pages


Study and Analysis of Knowledgebase of Molecular Systems and to Develop Model for Prediction of Molecular Structure
Saurashtra University  (Computer Science Dept.)
Catalog Number
ISBN (eBook)
ISBN (Book)
Dr. Binod Kumar is Director & Professor at JSPM Jayawant Institute of Computer Applications , affiliated to Savitribai Phule Pune University, India. He worked as Associate Professor at School of Engineering and Computer Technology, Quest International University, MALAYSIA . He is Senior Member of IEEE Computer Society as well Association for Computing Machinery (ACM, USA).He is reviewer of Journals like Elsevier, SpringerPlus and TPC of various IEEE sponsored conferences. He is Editorial Board member of nearly 45 International Journals. He has done PhD(CS) , M.Phil(CS) and MCA (NIT).
study, analysis, knowledgebase, molecular, systems, develop, model, prediction, structure
Quote paper
Dr. Binod Kumar (Author), 2010, Study and Analysis of Knowledgebase of Molecular Systems and to Develop Model for Prediction of Molecular Structure, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Study and Analysis of Knowledgebase of Molecular Systems and to Develop Model for Prediction of Molecular Structure

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free