A Case Study of Oryza sativa. Annotation of Plant Genome

Scientific Study, 2020

60 Pages

IDSAsr Study (Author)



List of Tables

List of Figures



About Oryza Sativa Japonica

Review Of Literature

Methods And Methodology

Selection of Sample Sequence
Tabulating the Sequence
Performing Multiple Sequence Alignment (MSA)
Establishing Photogenic Relationship
Prediction of Structure
Constructing a 3D Model
Scanning of Motif
Mapping of Motifs
Creating Protein-Protein Interaction Prediction

Results And Discussion

Multiple Sequence Alignment (MSA) Output



List of Tables

Table 1: - 10 Relevant Sequences Producing Significant Alignments for HCP (Under nr Database)

Table 2: - 10 Relevant Sequences Producing Significant Alignments for RR (Under pdb Database)

Table 3: - 10 Relevant Sequences Producing Significant Alignments for HCP (Under Nr Database)

Table 4: - 4 Putative Conserved Domains Detected

Table 5: - 10 Relevant Sequences for RR Producing Significant Alignments: (Under pdb Database)

Table 6: - 2 Putative Conserved Domains Detected

Table 7: - 10 groups with maximum alignment score (among 59 homolog sequences of HCP):

Table 8: - Representation of Motif Characters

Table 9: - residue conformation

Table 10: -Motif Site Distribution

Table 11: -Representation of Motif Characters

Table 12: -Residue Conformation

Table 13: -Motif Site Distribution

Table 14: -Interactions and Total Stabilizing Energy for HCP

Table 15: -Potential Hydrogen Bonds for HCP

Table 16: -Potential Hydrophobic Interactions for HCP

Table 17: -Potential Salt Bridges for HCP

Table 18: -Potential Favourable Electrostatic Interactions for HCP

Table 19: -Potential Unfavourable Electrostatic Interactions for HCP

Table 20: -Potential Short Contacts for HCP

Table 21: -Interactions and Total Stabilizing Energy for RR

Table 22: -Potential Hydrogen Bonds for RR

Table 23: -Potential Hydrophobic Interactions for RR

Table 24: -Potential Salt Bridges for RR

Table 25: -Potential Favourable Electrostatic Interactions for RR

Table 26: -Potential Unfavourable Electrostatic Interactions for RR

Table 27: -Potential Short Contacts for RR

List of Figures

Fig 1: - Phylogeny Tree of 59 Homologs Similar Sequence of HCP

Fig 2: -Phylogeny Tree of 59 Homologs Similar Sequence of RR

Fig 3: - Residue Representation for HCP

Fig 4: - Variation of Residue for HCP

Fig 5: - Residue Representation for RR

Fig 6: - Variation of Residue for RR

Fig 7 :- Swiss Model Graph Result for HCP

Fig 8:-Swiss-Model Graph Result for RR

Fig 9:- Swiss-Model Graph Results

Fig 10: -Logo of the Motif Found After MEME Analysis for HCP

Fig 11: -Logo of the Motif Found After MEME Analysis for RR

Fig 12:- 1YVI Cartoon Structure

Fig 13:-4q7e Cartoon Structure

Fig 14:- Relevant Information Transferred from Other Organisms

Fig 15:- Network Stats

Fig 16: -Gene Co-Expression

Fig 17: -Interaction Relationship Shown with Various Coloured Line

Fig 18: - Tabular Description to Define Colour and Figure Arrangements of Fig 17

Fig 19:- Gene Co-Occurrence Description


BAC: Bacterial Artificial Chromosome;

DNA: Deoxyribonucleic acid

FN: False Negative;

FP: False Positive;

GFF: general feature forma

GO: Gene Ontology;

HCP: Histidine-containing phosphor transfer protein

IRGSP: International Rice Genome Sequencing Project

IRGSP: International Rice Genome Sequencing Project;

MDR: Mathematically Defined Repeat;

MYA: Million Years Ago;

NTI: Nuclear Threat Initiative

PAC: Pl-based artificial chromosomes

RAP: Rice Annotation Project

RNA: Ribonucleic acid

RR: Two-component response regulator

SNP: Single Nucleotide Polymorphism;

TE: Transposable Element;

WGD: Whole-Genome Duplication;

WGS: Whole-Genome Shotgun;

WH: With Homolog in Arabidopsis


Rice is a perennial claim crop of the world. Besides satisfying the eager of energy rice, has also been known to support worlds trade economy. Hence, being a crop of such crucial importance its examinational study at genome level will serve in multiplying its production and quality to irrigate the burning crave of humanity. Likewise, the senescence gene of rice is responsible for its age duration. Hence, understanding its property at 360° will help us to modify or to alter its function in positive portion.

Using Insilco analysis mode, the present study is an attempt to examine various characteristics conformation of senescence causing gene in rice. The two gene chosen were HCP and RR because, the interaction in between these two led to the onset of senescence in rice. Two gene that is HCP (Histidine-containing phosphor transfer protein 1) and RR (Two-component response regulator) are responsible for attaining the stage of senescence in rice. Understanding their molecular and structural property will be going to let us closer to perform successful adjustments. Moreover, their specific property is also responsible for their specific interaction which led to generation of such signals that triggers senescence. Therefore, this analysis was aimed to understand the features of the two genes as well as their interaction by the means of computational technique.

Understanding the features, function and flow of gene will lead us to stabilized effective measure in order to get a beneficiary outcome while going for alteration in its characters. As the pure data for the structure conformation of the selected genes are not available so, we have at first, searched the most similar homolog of the query sequence and the search was based on similar sequence homology on the platform of local alignment tool. And further analysis was carried out on the base conformation of the most relevant homologs (structure/sequence) found.

We have analyze the query gene sequence by various dry lab analysis tool to explore its structural and molecular features with the motive to contribute a little knowledge for the sake of further studies to delay senescence in rice plant in order to increase grain productivity.

Keyword s: senescence, interaction, signals, plant genome annotation, plant ontologies, plant gene family data bases, genome annotation pipelines, Functional annotation, Annotation Repetitive Sequence Functional Description.

Annotation is the process of identifying and describing the regions of biological interest within a genome. The location and structure of protein-coding genes is the most common form of annotation, but other types of important sequence annotation include the identification of noncoding RNAs (tRNAs, rRNAs, snoRNAs, miRNAs, siRNAs), repetitive sequences such as transposable elements, and the location of genetic markers. Annotation of plant genomic sequences can be separated into structural and functional annotation. Structural annotation is the foundation of all genomics as without accurate gene models understanding gene function or evolution of genes across tax a can be impeded. It is dependent on sensitive, specific computational programs and deep experimental evidence to identify gene features within genomic DNA.

Functional annotation describes the biological context of gene sequences. It is highly dependent on sequence similarity to other known genes or proteins as the majority of initial "first-pass" functional annotation on a genomic scale is transitive. Coupling structural and functional annotation across genomes in a comparative manner promotes more accurate annotation as well as an understanding of gene and genome evolution. With the increasing availability of plant genome sequence data, the value of comparative annotation will increase. As with any new field, methodologies are evolving for genome annotation and will improve in the future.

Genome annotation is generally a long and recursive process, the difficulty of which increases with the size and complexity of the genome. It relies on a successive combination of software, algorithms, and methods, as well as the availability of accurate and updated sequence databanks. To manage the large amount of data generated by >1 Gb genome size sequencing projects, sequence annotation needs to be automated, i.e., performed through a pipeline that combines all different programs and minimizes subsequent manual curation which is long and laborious. Four categories of pipelines are available to support plant genomes annotation, as follows:

(1) Simple commercial software such as Vector NTI and DNASTAR. Usually, these pipelines are not available on the web and they are not free of charge, even for academic research. Most importantly, they cannot be easily customized for specific needs.
(2) Suites of scripts that generate computational evidence for further manual curation. For example, DAWGPAWS (Estill and Bennetzen, 2009) – has been developed for annotating wheat BAC contigs and works as a series of command line programs that result in GFF output files. Such a type of pipeline is not available on the web and can only be used by skilled bio-informaticians.
(3) “In-house” pipelines. A number of these have been developed by communities to annotate model plant genomes, e.g., rice (Ouyang and Buell, 2004; International Rice Genome Sequencing Project, 2005) or by major genomic resource centers such as the DOE/JGI, the MIPS, Gramene (Liang et al., 2009), GenBank, and EBI (Curwen et al., 2004). Although these pipelines are of high quality and are generally based on massive informatics resources, they are not directly accessible to users from outside. In general, these genomic and bioinformatics platforms have their own projects and priorities.
(4) Automated annotation pipelines available on the web. The first pipeline of this kind, Rice GAAS (Sakata et al., 2002) was developed originally for the annotation of the rice genome. Since then a few others have been established such as DNA subway (iPlant, USA), FPGP (Amano et al., 2010) and MAKER (Cantarel et al., 2008). They all have web user-friendly interfaces; however, the online access limits the capacity to perform annotation of large genomes within a reasonable time. Thus, until now, none of the publicly available, online pipelines enables a thorough annotation of large genome sequences.

Almost all genome annotation is performed using semi-automated computational pipelines and is subject to some degree of interpretation and error. Therefore, researchers must understand the methods used to create annotation in order to assess the quality of that annotation.

About Oryza Sativa Japonica

The majority of the world’s population depends on cereal crops as their primary source of carbohydrate. Among the cultivated cereal crops, rice makes up to 20 per cent of the total calorific intake for the human population as a whole (http://www.irri.org/science/ricestat/index.asp). In order to cope with increasing global demand for food and because of its importance as a staple, many agro-biological studies have been performed with the aim of developing more efficient rice cultivars.

Rice has been cultivated since ancient times and oryza is a classical Latin word for rice. Sativa means "cultivated". Oryza sativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the non sticky, long-grained indica rice [ja] variety. Japonica varieties are usually cultivated in dry fields (it is cultivated mainly submerged in Japan), in temperate East Asia, upland areas of Southeast Asia, and high elevations in South Asia, while indica varieties are mainly lowland rices, grown mostly submerged, throughout tropical Asia. Rice occurs in a variety of colors, including white, brown, black, purple, and red rices. Black rice (also known as purple rice) is a range of rice types, some of which are glutinous rice. Varieties include Indonesian black rice and Thai jasmine black rice.

A third subspecies, which is broad-grained and thrives under tropical conditions, was identified based on morphology and initially called javanica, but is now known as tropical japonica. Examples of this variety include the medium-grain 'Tinawon' and 'Unoy' cultivars, which are grown in the high-elevation rice terraces of the Cordillera Mountains of northern Luzon, Philippines.

Glaszmann (1987) used isozymes to sort O. sativa into six groups: japonica, aromatic, indica, aus, rayada, and ashina.

Garris et al. (2004) used simple sequence repeats to sort O. sativa into five groups: temperate japonica, tropical japonica and aromatic comprise the japonica varieties, while indica and aus comprise the indica varieties.

Oryza sativa Japonica (rice) is the staple food for 2.5 billion people. It is the grain with the second highest worldwide production after Zea mays. Oryza sativa (rice) is a monocotyledonous flowering plant of the family Poaceae and is one of the most important crop plants in the world, providing the principal food source for half of the world's population. Oryza sativa subsp. japonica is one of three major sub-species of rice, the others being indica and javanica. Oryza sativa subsp. japonica is short-grained and high in amylopectin so that the grains stick together when cooked, which distinguishes it from subsp. indica which is long grained and not sticky. Oryza sativa subsp. japonica is grown in dry fields, mainly in temperate or colder climates.

In addition to its agronomic importance, rice is an important model species for monocot plants and cereals such as maize, wheat, barley and sorghum. O. sativa has a compact diploid genome of approximately 500 Mb (n=12) compared with the multi-gigabase genomes of maize, wheat and barley.

Oryza sativa has a haploid chromosome number of 12, containing 370 Mb with about 36,000 protein-coding genes. Rice was an obvious choice for the first whole genome sequencing of a cereal crop. It is the smallest of the major cereal crop genomes and is the easiest to transform genetically. The cultivar sequenced from the japonica subspecies was Nipponbare (www.thericejournal.com/content/6/1/4/abstract).

Since 2002 when genome assemblies of the two major rice varieties (Oryza sativa L. ssp. indica 93-11 and Oryza sativa L. ssp. japonica Nipponbare) were published, efforts to construct better rice reference genomes continue even to this date. Comparative analyses on rice and other large plant genomes have been promoting the application of genomic research activities to agricultural practice, such as marker-assisted breeding for the improvement of biotic and abiotic stress resistances. Although there have been a number of databases or web servers constructed for rice and related plant genomes, a comprehensive database or knowledgebase for general rice genomic information is still necessary, especially when data are still being generated in a fast rate for this much treasured crop.

With the completion of the rice genome (Oryza sativa L. ssp. japonica cultivar Nipponbare) by the international consortium on rice genome sequencing (International Rice Genome Sequencing Project 2005), it has become possible to elucidate the layers of information encoded by the sequence. Analyses of rice full-length cDNAs and the rice proteome are in progress (Kikuchi et al. 2003; Komatsu et al. 2004; Komatsu and Tanaka 2005). Additionally, construction of integrative annotations for the rice genome, transcriptome, and proteome is being undertaken. In order to standardize the annotation of the genome data for Nipponbare, we organized an international consortium for rice genome annotation, the Rice Annotation Project (RAP), with the aim of allowing more efficient analysis of genomic information and accelerating post-sequencing research activities. It is also hoped that the annotation will provide a comparative data resource for cereal genomics researchers working on other species and contribute to their endeavors.

To cope with the enormous amount of information produced by large-scale sequencing, several automated annotation methods have been developed for the purpose of efficient data processing. However, it is acknowledged that automated annotation alone tends to result in a high proportion of erroneous annotations, and therefore annotation data results should be carefully curated by experts before any public release in order to cut down on the amount of these erroneous annotations. Currently, manual curation remains a necessary process for developing an accurate biological database (Misra et al. 2002; Camon et al. 2003). With this in mind, we brought together a large group of specialists to curate the results of our automated rice gene functional assignment. By bringing individuals from complementary disciplines together, the amount of time required to perform the manual curation was significantly reduced.

There are a large number of full-length cDNAs and expressed sequence tags (ESTs) available for rice and other cereals (Fernandes et al. 2002; Wu et al. 2002; Kikuchi et al. 2003; Gardiner et al. 2004; Lai et al. 2004; Zhang et al. 2004; Jantasuriyarat et al. 2005). This wealth of information is a boon for genome annotation because it provides excellent support for transcribed regions, which, in turn, allows more precise predictions than current ab initio methods can provide. By using the annotation data set based on our curation and mapping of cDNAs to the genome, we were able to approximate the number of genes in the rice genome, classify transcribed sequences by probable function, and identify other features pertinent to the rice genome.

Arabidopsis thaliana is one of the most well-studied model organisms. Comparison of rice with the dicotyledon may assist in developing a greater understanding of intrinsic mechanisms among cereals at the molecular level. Use of knowledge accumulated about A. thaliana genes to quantify their counterparts in rice is one example of such a comparative study (Izawa et al. 2003; Yamaguchi et al. 2006). Additionally, clues as to the evolution of these two flowering plants could be obtained. Here we describe a comparative analysis of the genomes and protein sequences of O. sativa and A. thaliana on the basis of manually curated data. This analysis focuses on the number of genes, composition of functional domains, and patterns of gene duplication.


Scientists from the MSU Rice Genome Annotation Project (MSU), the International Rice Genome Sequencing Project (IRGSP) and the Rice Annotation Project Database (RAP-DB) generated a unified assembly of the 12 rice pseudo molecules of Oryza sativa Japonica Group cv. Nipponbare.

The pseudo molecule for each chromosome was constructed by joining the nucleotide sequences of each PAC/BAC clone based on the order of the clones on the physical map. Overlapping sequences were removed and physical gaps were replaced with Ns. Updated pseudomolecules were constructed based on the original IRGSP sequence data in combination with a BAC-optical map and error correction using 44-fold coverage next generation sequencing read. The nucleotide sequences of seven new clones mapped on the euchromatin/telomere junctions were added in the new genome assembly. In addition, several clones in the centromere region of chromosome 5 were improved and one gap on chromosome 11 was closed.

Kawahara et al (2013) describe the integrated Os-Nipponbare-Reference-IRGSP-1.0 pseudo molecules, also known as MSU7. Gene loci, gene models and associated annotations were independently created by each group, but can be easily compared using the common reference.


International Rice Genome Sequencing Project (IRGSP) gene models were imported from the Rice Annotation Project (RAP-DB). The most recent update was from its 26th November 2018 release. This version corrected gene models with manual curation, also deprecated some bad models. In total, 35,666 protein-coding genes were included. Feature annotation and comparative analysis pipelines have been run and variations have been projected from the old annotation to the new one.

MSU-7 gene models were also loaded for visual comparison to the IRGSP set. Cross references between the two gene sets provided by RAP-DB allow searching and querying using either identifier space, but only the IRGSP/RAPDB models are used in our gene trees.

Review Of Literature

Achieving a robust structural and functional genome sequence annotation is essential to provide the foundation for further relevant biological studies. Genome annotation consists of identifying and attaching biological information to sequence features. It represents one of the most difficult tasks in genome sequencing projects (Elsik et al., 2006), particularly today where the advent of high-throughput next generation sequencing (NGS) technologies enables genome sequences to be produced at a high pace. The reality at present is that new genomes are being sequenced at a faster rate than they are being fully and correctly annotated (Cantarel et al., 2008). It took about 7 years and a large community effort to sequence and fully annotate the Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000) and rice genomes (International Rice Genome Sequencing Project, 2005) at a quality that none of the other genome sequenced after have reached yet. In the past 5 years, the production of plant genome sequences has grown exponentially (for a review sees Feuillet et al., 2011). On August 2011, the NCBI Entrez Genome Project web site listed 135 land plant genome sequencing projects including 36 completed or assembled genomes and 101 in progress. Out of the 36 sequenced genomes, 23 have been released in the past 2 years. Among those, only two genomes larger than 1 Gb, maize (Schnable et al., 2009) and soybean (Schmutz et al., 2010), have been sequenced and annotated.

Rice is the progeny of grass family Oryzasativa commonly known as Asian rice. It is count under widely preferred energy giving cereal crop, at about 758.8 million tonnes (503.6 million tonnes, milled basis), world paddy production in 20171. And as per FAO now sees India producing 165.5 million tonnes (110.4 million tonnes, milled basis) in 2017, which is about 1 per cent more than the 2016 all-time high .

As we know rice is a distance progeny of grass family and hence its portion is mainly composed of paddy and leaf, which is being pigmented by chlorophyll completely all over. As we know photosynthesis utilizes water (H2O), carbon-di-oxide (CO2), oxygen (O2) and minerals from soil to convert energy from sunlight into eatable organic components grain. Since the requirement of water and carbon-di-oxide is more in rice as compare to is morphology, which produce a complex carbohydrate cereal (rice). Plant had gone through years of acclimatization by nature which makes them evolve in present days’ crop, which are capable enough in energy conversion until they live. Hence, the production of food is limited. Limited! As crop have a specific age of fruiting and limited time to carry out photosynthesis as their genome was pre-programmed, while going through a long course of evolution.

In case of rice plant numerous factors induce the response of light toward leaf photosynthesis. The first factor taken into account is elevated leaf temperatures that come with extensive irradiance which come up with metabolic imbalances (Pastenes and Horton, 1996b), which have an injurious effect on thylakoid function (Pastenes and Horton, 1996a), photo inhibition gets heightened (Fuse et al., 1993), and raise in photorespiration (Leegood and Edwards, 1996). The second factor to consider is leaf angle. It has been detected as provoking upper leaves level of light saturation (Yoshida, 1981b).

The production of rice is also being effected by leaf senescence. The onset of Leaf senescence in rice arouses from the low bottom leaves and moves upward as the growth and development in plant take place. The decline of assimilatory efficiency as leaf senescence goes on investing to limited grain yield (Nooden, 1988b), and delaying the leaf senescence process may tend to lift up the crop productivity. Leaf senescence process that finally leads to organ demise is an endogenously controlled degenerative process. Moreover, the process of age-dependent manner is also being affected by complex interaction of developmental age relevant factors which are categorise as internal and external factors (Nooden, 1988a; smart, 1994). At the outset of leaf senescence, the activity of photosynthesis decreases more quickly than chlorophyll p; this outcome in lowering of leaf position, which is directly proportional to lowering of the rate of oxygen which was determined by evolution detected on the required content of chlorophyll. The content of Chlorophyll a/b ratio also falls off with rising senescence. The composition of antenna system of photosynthesis is also being affected by Senescence. Moreover, it has been evidenced continuously that chlorophyll a is more likely to be vanish sooner than chlorophyll b during senescence, and this led to reduce in the chlorophyll a/b ratio (Youn and Ota 1973, Patterson and Moss 1979, Jenkins et al. 1981, Maunders and Brown 1983). Oxygen evolution rates were recognized on the foundation of chlorophyll a, latch to the reaction centre complexes, which stay stable throughout the procedure of senescence. Hence, the shutting down of the process of photosynthesis is firmly in relation to the loss of the reaction centre complexes at the time of leaf senescence of rice seedlings. The inference shows that decrease in photosynthesis is results of loss of a functional unit of photosynthesis or by reduce in amount of whole chloroplasts. [Relationship between Photosynthesis and Chlorophyll Content during Leaf Senescence of Rice Seedlings; Mariko Kura-Hotta, Kazuhiko Satoh and Sakae Katoh] It was also detected in breeding attempt of the last 50 years, in order to delaying leaf senescence and stretching the period of dynamic photosynthesis has come up with rise in the sudden photo assimilation of raw substrate and led to enhance in the overall grain mass production (Richards 2000; Long et al. 2006).

Rice yield take place by the deposition grain in cereal, which is count upon the sources of carbon (CO2) and nitrogen and this is carried out by photo by a photosynthetic dynamic leaf, and those remobilized from the vegetative tissues (Yang and Zhang 2006). In small-grained cereals such as rice, 10%–40% of the output grain weight was induced by pre-anthesis photo assimilation (Gebbing and Schnyder 1999; Yang and Zhang 2006). But Leaf senescence can turn over the final output of grain in either leading to increment in mass or decrement in mass. Study over modelling work in recent past by tracking nine years of satellite track record of wheat growth in northern India to control and monitor the rates of wheat senescence by the discloser of it to temperatures higher than 34 °C, shows a continuous acceleration of senescence under high heat climate (Lobell et al. 2012). Thus, acclimating crop under extreme high temperature will be the requirement of upcoming studies of crop success. It has been evidenced in various stay-green varieties showing delayed leaf senescence results in accruing number of positive effects, which involves initiating extra root growth, providing extra carbon, and decreasing the time period between anthesis and silking, as reviewed by Davies et al. (2011). Hence, the stages of arrival of leaf senescence play a major role in crop yield.

There exists a composite kinship in between the arrival of leaf senescence and utilization of nitrogen by plant (Chardon et al. 2010; Masclaux-Daubresse et al. 2010; Masclaux-Daubresse and Chardon 2011). Nevertheless, un controlled and non- monitored delay in senescence can cause low output in grain filling. Result of holding up of leaf senescence on grain protein concentration and on grain mass output count upon the presence of nitrogen at the stage of post-anthesis period (Bogard et al. 2011). Thus, the variation in post-anthesis leaf senescence should be carried out by keeping an eagle eye on genetic and management control.

The two-component His kinases (HK) or two-component system (TCS) is known to play important roles in the regulation of prokaryotic and lower eukaryotic cellular responses to environmental stimuli (Grebe and Stock, 1999; Stock et al., 2000; Urao et al., 2000a, 2000b; Besant et al., 2003). These include antibiotic resistance, chemotaxis, differentiation, and nitrogen metabolism. The existence of a bacterial-type HK in plants was first reported by Chang et al. (1993). Since then, many plants have been documented to possess genes encoding two-component regulators, and their participation in the perception and integration of various extracellular and intracellular signals has been shown (Lohrmann and Harter, 2002; Oka et al., 2002; Grefen and Harter, 2004; Hass et al., 2004; for review, see Stock et al., 2000; Foussard et al., 2001).

In bacterial systems, the typical machinery, such as that of Escherichia coli Env Z protein (an osmo-sensor), consists of a membrane-localized HK that senses the input signal and a response regulator (RR) that contains a conserved regulatory domain. In this system, the signaling is initiated when the HK, modulated by the environmental stimulus, autophosphorylates the conserved His residue. The phosphoryl group is then transferred to a conserved Asp residue in the RR, which results in modulation of its activity (Fig. 1A). Autophosphorylation of the HK is a bimolecular reaction between homodimers in which one HK monomer catalyzes the phosphorylation of the His residue in the second monomer (Pan et al., 1993; Surette et al., 1996). Besides directing the forward phosphorylation reaction, many HKs possess a phosphatase activity, enabling them to dephosphorylate their cognate partner (Keener and Kustu, 1988; Lois et al., 1993). The phosphotransfer pathways, which need to be shut down quickly, operate with such bifunctional HKs. More elaborate HKs, also known as hybrid kinases, are typical of eukaryotes and a few prokaryotes and contain multiple phosphodonor and phosphoacceptor sites. Signaling via these pathways uses multiple His phosphotransfers (Hpt) and receiver domains (RDs) or proteins that connect to final RRs or other signaling outputs (Fig. 1B). Examples of such HKs include the SLN1/YPD1/SSK1 phosphorelay in yeast (Saccharomyces cerevisiae), which controls the HOG pathways in response to osmotic stress (Wurgler-Murphy and Saito, 1997). Such a multistep phosphorelay reaction is believed to have the advantage of providing multiple regulatory checkpoints for signal cross talk or negative regulation by certain phosphatases. The existence of multiple phosphorelay reactions in both prokaryotes and eukaryotes suggests that similar mechanisms might be more widely used (Urao et al., 2000a).

Methods And Methodology

Selection of Sample Sequence

To know the processes of senescence in Oryza sativa, two query sequence belonging to a protein Histidine-containing phosphotransfer (HSP) which on interaction with the Response Regulator (RR) carry out the process of senescence. The analysis was carried on this two protein sequence to understand their characters. The two sequences are as:


Abbildung in dieser Leseprobe nicht enthalten


Abbildung in dieser Leseprobe nicht enthalten

Both the sequence was collected from Uniprot (The UniProt Consortium UniProt: the universal protein knowledgebase Nucleic Acids Res. 45: D158-D169 (2017)) as per the requirement of the analysis.

Performing Sequence Similarity Search:

A homology search was performed using Blast (Basic local alignment search tool). It is based on the heuristic search algorithm which is used to perform sequence similarity search, that is, to find homolog’s of the query sequence. And for doing this, blast has its different program. From those we have chosen blastp (for query protein to search against protein database). This search will provide us with most similar homologs for our query sequence based on their query coverage, ident and E-value. To perform this, we went with default search parameters as for blastp the database was NR (non-redundant protein sequence), the algorithm followed was blastp, the matrix was BLOSUM62, the threshold value was 21 and expected value was 10. The results with indent (>=49%) and query cover (>=93%) were tabulated. Once again we have performed blastp but this time the database was switch from NR to PDB (protein databank). The algorithm followed was blastp, the matrix was BLOSUM62, the threshold value was 21 and expected value was 10.

Tabulating the Sequence

To carry out further analysis, the sequence of all possible blast hits was replicated down and arranged. So, by taking the reference of the Blast search result (accession no.) we have tabulated the sequence from the database, such as Uniprot and NCBI (https:// www.ncbi.nlm. nih.gov /nuccore/). After collecting the sequence results, we have moved toward further analysis

Performing Multiple Sequence Alignment (MSA)

The sequence from the blast were tabulated and were watched for correction such as to elimination of repeated similar sequence and to organized them in a table to increase readability. Afterward MSA is perform using ClustalW which uses seeded guide trees and HMM profile-profile techniques to generate alignments between three or more sequences. Operation was performed by using it on default setting which is enough for fulfillment of result. Hence, MSA was performed using ClustalW with pair wise alignment parameters for fast/approximate are as K-tuple(word) size:1, window size:5, gap penalty:3, number of top diagonals:5, scoring method: percent whereas for slow /accurate, gap open penalty:10.0, gap extension penalty:0.1, select weight matrix: BLOSUM (for PROTEIN). Whereas for multiple alignment parameters, gap open penalty: 10, gap extension penalty: 0.05 and selected weight matrix: BLOSUM (for PROTEIN)

Establishing Photogenic Relationship

To relate different sequence results with each other, we have performed photogenic analysis using a web interface Phylogeny.fr, of all the tabulated sequence. These analyses have been carried out with phylogenetic development among the sequence and create an evolutionary relationship.

Prediction of Structure

While dealing with protein, the structure is where all our attention lies. As its well know that protein functionality, specificity, mode of operation, and working efficiency are grounded on its structure. Hence every protein is unique in its constructional conformation. So, in modality of understanding and defining any protein we have performed certain structure prediction.

Hence to study the structure of protein we have initially performed blastp against pdb database (step 2 above). As PDB (protein databank) contains structural information about those proteins whose structure have been analyzed by using either NMR or x-ray crystallography. So, after performing this we have gained the result on the basis of similar structure. To perform this, we went with default search parameters as for blastp the database was PDB (protein data bank), the matrix was chosen to be BLOSUM62, the threshold value was 21 and expected value was 10. The results with ident (>=44%) and query cover (>=93%) were replicated down.

A secondary structure prediction using GOR4 is performed which has given the structural detail of the amino acid sequence in terms of Alpha helix (Hh), 310 helix (Gg), Pi helix (Ii), Beta bridge (Bb), Extended strand (Ee), Beta turn (Tt), Bend region (Ss), Random coil (Cc), Ambiguous states (?) and describing their count of presence in the sequence too.

Constructing a 3D Model

To define a structure there is need of a similar kind of structure(homologs) whose construction was already describe. So, to accomplish that we have perform 3D modelling of the sample protein sequence by using SWISS-MODEL. SWISS-MODEL is an online sever based on protein structure homology modelling. It read the sequence find the homologs template of its and by talking reference to protein structure database its create a model of the query protein sequence.

Scanning of Motif

Motifs have a high level of importance in the protein structure as a sequence motif is a nucleotide or amino-acid sequence form that is far-flung or is supposed to have a biological implication. Scanning of motif of the most relevant hits of blastp(against pdb) was performed using MEME (Multiple Em for Motif Elicitation). MEME discovers novel, un-gapped motifs (recurring, fixed-length patterns) in your sequences (sample output from sequences). MEME splits variable-length patterns into two or more separate motifs.

Mapping of Motifs

We have performed mapping by using PyMol. PyMOLis a molecular examination tools based on virtual visualization and by the application of animations, high−quality rendering, crystallography, and other common molecular graphics activities.

Among the detected motifs we have most relevant one as per its characters and were marked and coloured on the most relevant protein structure collected from pdb, based on their similar structure homology(blastp). The two structure were 1yvi (For HCP) which is the Chain A, X-RAY STRUCTURE OF PUTATIVE HISTIDINE-CONTAINING PHOSPHOTRANSFER PROTEIN FROM RICE having 99% Query cover and 71% of Ident. The second structure was 5lxu (For RR) which is Chain A, Structure of The Dna-binding Domain of Lux Arrhythmo having 8 per cent of Query cover and 68per cent of ident. Afterward, structure around 6 Å were selected and marked with different colour. This was to show the interaction of the motif toward its neighbours structure. Polar contact was also being mapped around the region of 6 Å of the selected motif.

Creating Protein-Protein Interaction Prediction


To show the interaction between HCP and RR we have used STRING. Its function is to generate the protein-protein interaction critically, by using direct (physical) as well as indirect (functional) associations. More than 2000 organisms come under STRING coverage. It has necessitated scalable algorithms for transferring interaction information between organisms. It also has a completely redesigned prediction pipeline for inferring protein-protein associations from co-expression data, and an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.

In order to predict the Interaction, its uses different type of input such as protein name, id, or direct fast a sequence of the query protein sample. And by taking the reference of non-redundant protein database it generates its result.

To carry our work, we have chosen the flavour of multiple protein sequence input and both the sequence (HCP, RR) were pasted on the prescribe portal. The organism selected was Oryza sativa japonica.

By using PP Check

PP Check is a web server which can be employed to measure various perspective of interactions between any two given proteins/chains. The measure of interactions is analyzed by using standard energy calculations involving non-bonded interactions like van der Waals, electrostatic and hydrogen bonds. The constitute of these interactions is described as pseudo energy, whose ranges have been standardized using known sets of protein-protein complexes. (Anshul and Sowdhamini, Molecular Bio-systems, 2013)

A zip pdb file of the selected structure was grabbed and uploaded to check all possible interaction, performed by PP check.

Abbildung in dieser Leseprobe nicht enthalten

Results And Discussion

1) Sequence Similarity Search outcome.

Initially the results were searched by using blastp and were performed under nr (Non-redundant protein sequence) database. The motive of using nr database was to get result on the basis of similar sequence homology. Once again blastp was performed but this time the database was switch from nr to pdb (protein databank). The motive of using pdb database was to get results on the base of similar structure homology.

Conserved domain was also being detected in both database (nr and pdb) shorting and they are found to be contributing common inference. Rest other output sequences which are not listed below were tabulated and checked for any repetitive sequence and their elimination, if present.

The table in green constitute the output of nr database, the table in blue represent pdb database while the one in orange represent results for conserved domain which are common for search in both databases.


After performing blastp against nr database we have got 100 blast hits on 100 subject sequence. Among those 100 hits 13 have alignment score (>=200) and rest other ranging in between (80-200). We have tabulated 10 outputs. They occupy query cover (>=91per cent) and ident (>=73per cent).

For RR

After performing blastp against nr database we have got 100 blast hits on 100 subject sequence. And all 100 hits have alignment score (>=200). We have tabulated 10 outputs. They occupy query cover (>=99per cent) and ident (>=69per cent).


We have got 6 blast hits on 6 subject sequence. Among those 6 hits 1 have alignment score (>=200) and rest 5 ranging in between (80-200). We have tabulated all 6 outputs. They occupy query cover (>=93per cent) and ident (>=44per cent).

Table 1: - 10 Relevant Sequences Producing Significant Alignments for HCP (Under nr Database)

Abbildung in dieser Leseprobe nicht enthalten

Articipation in signaling is done by Histidine Phospho transfer domain through a two-part component system. And in this system the part of phosphoryl donor to a regulator protein is assist by an autophosphorylating histidine protein kinase; the phosphorylation and dephosphorylation which belongs to conserved aspartic acid residue are responsible for the modulation of response regulator protein; In most of eubacteria, the two-component protein are found in ample amount; There are about 62 two-component protein which participates in number of processes which involves chemotaxis, metabolism, osmoregulation, and transport 1. This are also found in gram positive and negative pathogenic bacteria and here they are employed in performing regulation task of basic housekeeping functions moreover, they assure the expression of toxins and other protein which are most relevant to pathogenesis; however, in archaea and eukaryotes, two-component pathways appoint all signalling systems in a very small number; in fungi they are responsible for arbitrate environmental stress responses and, in pathogenic yeast, hyphal development. In Dictyostelium and in plants, they participate in crucial processes such as osmoregulation, cell growth, and differentiation; till now the presence of two-component proteins have not been recognized in animals; in most of the prokaryotic arrangements, RR directly effects the output response, which functions as a transcription factor while in other side that is in eukaryotic arrangements, two-component proteins always recognized at the onset of signaling pathways where they port with more conventional eukaryotic signaling schemes such as MAP kinase and cyclic nucleotide cascades.

Pssm-ID: 238041 Cd Length: 94 Bit Score: 54.31 E-value: 2.74e-10

Abbildung in dieser Leseprobe nicht enthalten

The histidine-containing phosphor transfeer(HPT) domain is one of the fresh protein module that have an active residue of histidine which is responsible for arbitrates phosphotransfer reactions in the two-component signalling arrangement. A multistep phosphorelay includes the HPt domain has been prescribed for these signalling pathways. The anaerobic sensor kinase (ArcB h) has a HPt domain whosecrystal structure has been determined. The HPT domain comprises of six alpha helices which is holding a four-helix bundle-folding. The pattern of sequence similarity of the HPt domains of ArcB and few other components in the signalling arrangement can be translated in light of the three-dimensional structure and affirms the decision that the HPt domains have a similar structural motif both in prokaryotes and eukaryotes. In S. cerevisiae ypd1p this domain has been found which depicts to contain a binding surface for Ssk1p (response regulator receiver domain containing protein pfam00072).

Pssm-ID: 307655 Cd Length: 84 Bit Score: 53.51 E-value: 4.69e-10

Abbildung in dieser Leseprobe nicht enthalten

Signal receiver domain; previously are pretended to be only under the range of bacterial community (CheY, OmpR, NtrC, and PhoB), but now they were detected in eukaryotes ETR1 Arabidopsis thaliana; this domain was detected of receiving signals from the sensor partner in a two-component systems;they carries a phosphor acceptor site that is phosphorylated by histidine kinase homologs; usually found N-terminal to a DNA binding effect or domain; forms homodimers myb-like DNA-binding domain, SHAQKYF class; This model report a DNA-binding domain confined to (but common in) plant proteins, among these many of which also contain a response regulator domain. The domain come out to be related to the Myb-like DNA-binding domain reported by pfam00249. It is discerned in part by a well-conserved motif SH[AL]QKY[RF] at the C-terminal end of the motif.

Pssm-ID: 130620 Cd Length: 57 Bit Score: 93.24 E-value: 1.54e-22

Abbildung in dieser Leseprobe nicht enthalten

Multiple Sequence Alignment (MSA) Output.

After elimination of repeated sequence and those whose sequence are not found in the database (among blast p hits) we are left with 59 sequence material to carryout MSA. The results were replicated down as text material. Among those 58 groups of results 10 groups with highest alignment score were chosen, from both HCP as well as from RR MSA calculation.


Excerpt out of 60 pages


A Case Study of Oryza sativa. Annotation of Plant Genome
Catalog Number
ISBN (eBook)
case, study, oryza, annotation, plant, genome
Quote paper
IDSAsr Study (Author), 2020, A Case Study of Oryza sativa. Annotation of Plant Genome, Munich, GRIN Verlag, https://www.grin.com/document/925069


  • No comments yet.
Read the ebook
Title: A Case Study of Oryza sativa. Annotation of Plant Genome

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free