Nuclear transport of proteins is a basic cellular mechanism preceding a lot of biological processes. The classical transport mechanism for nuclear proteins involves karyopherins importing and exporting the proteins. The karyopherins recognize typically nuclear transport signals in the protein sequence. Three main types of nuclear localization signals (NLS) are focused in the scientific field of nuclear protein transport: monopartite, bipartite and PY-NLS. In studies on nuclear export signals (NES) the specific type of leucine-rich signals is often investigated.
The first goal of this thesis was to update NLSdb, a database containing 114 experimental and 194 potential NLS, to the current state of available data. Towards this end, a set of 2452 novel signals with published experimental evidence was extracted from the literature and used as development set. An in silico mutagenesis approach was applied to this set to detect 4301 novel potential NLS in nuclear proteins. We matched these potential NLS in protein sequences of unannotated subcellular localization to identify nuclear proteins. We were able to confirm the predicted localization using our potential NLS in literature.
Additional to the collection of data, an extensive analysis on protein sequences containing NLS and NES was performed to provide insights into subcellular localization of proteins and their occurrence in various organisms. A clustering of sequences of NLS led to the separation of signals into distinct sub-groups with a clear definition of a consensus sequence for each sub-group. Aligning potential NLS against the sub-groups resulted in a refinement of the consensus sequences.
The results from this study reflect the scientific progress, lead to further knowledge in the field of nuclear transport and highlight the usability of bioinformatics methods for the discovery of new insights in biology. Nuclear transport is related to many interesting researches, for example allergic reactions, cancer and other diseases. The outcome of this work provides a good fundament for other studies with nuclear transport signals.
Table of Contents
1. Introduction
1.1. Cellular compartmentalization
1.2. Nuclear localization signal (NLS)
1.2.1. Monopartite NLS
1.2.2. Bipartite NLS
1.2.3. PY-NLS
1.3. Nuclear export signal (NES)
1.4. NLSdb - Database of nuclear localization signals 1.0
1.5. Motivation
2. Materials and Methods
2.1. Collection of experimentally verified nuclear transport signals
2.1.1. NLSs
2.1.1.1. The database NLSdb1.0
2.1.1.2. Publication of Lange et al.
2.1.1.3. Prediction tool SeqNLS
2.1.1.4. The Swiss-Prot database
2.1.1.5. PY-NLS sources
2.1.1.6. Others
2.1.2. NESs
2.1.2.1. The database ValidNESs
2.1.2.2. The NESdb
2.1.2.3. NESbase database
2.1.2.4. The Swiss-Prot database
2.1.2.5. The prediction tool NESMapper
2.1.2.6. Others
2.1.3. Test set – unannotated Swiss-Prot proteins
2.2. In silico mutagenesis
2.2.1. Sets of nuclear and non-nuclear proteins
2.2.2. Mutagenesis approach
2.3. Data analysis
2.3.1. Data pre-processing tools
2.3.2. Protein function and NLS prediction tools
3. Results and Discussion
3.1. Experimental development dataset
3.2. Sequence properties of nuclear localization signals and their proteins
3.2.1. Signal length
3.2.2. Organism of origin
3.2.3. Sequence similarity
3.2.4. Subcellular localization
3.2.5. Clustering of signals
3.3. 4301 novel potential NLSs through mutagenesis
3.3.1. Characterization of potential NLSs
3.3.2. Increasing coverage from 9% to 43%
3.4. Benchmark - NLSdb1.0 vs. NLSdb2.0
3.4.1. 38% of proteins with novel potential NLSs in NLSdb1.0
3.4.2. 100% overlap between NLSdb1.0 and NLSdb2.0
4. Conclusion
5. Outlook
Objectives & Topics
This thesis aims to update the NLSdb database by integrating newly collected experimental data and generating novel potential nuclear localization signals (NLSs) through in silico mutagenesis, thereby improving the coverage and predictive capability for identifying nuclear proteins. Furthermore, the study performs an extensive analysis of sequence properties and sub-group classification of transport signals to provide biological insights.
- Updating the NLSdb database with current experimental data
- Application of in silico mutagenesis to discover novel potential NLSs
- Analysis of sequence properties, signal length, and organism distribution of transport signals
- Refinement of consensus sequences through clustering and alignment of signal sub-groups
- Benchmarking the updated database (NLSdb2.0) against the previous version and evaluating predictive coverage
Excerpt from the book
2.2.2. Mutagenesis approach
The development set of 2452 experimentally verified NLSs was used as training set for the iterative in silico mutagenesis approach. The algorithm was divided into three main steps:
Firstly, the size of the development set was decreased for keeping only experimental NLSs that can be found in proteins with annotated nuclear location in Swiss-Prot. Only the signals that did not occur in protein sequences of the non-nuclear set were taken. These signals were then tested to occur in the protein sequences of the nuclear dataset.
Secondly, we performed a mutational step, using the signals of the reduced development set as input. Figure 2 visualizes the in silico mutation with an example. Every signal was mutated at each position into all 20 amino acids. All possible mutations of every signal were tested again for their occurrence in the protein sequences of the non-nuclear and the nuclear dataset.
The last step was an iteration on the mutated signals. Only mutated signals matching in the nuclear proteins, but not in the non-nuclear proteins, were sorted into the result set and shortened by one position at the end of the signals. The shorter signals still matching exclusively in the sequences of the nuclear protein set were further shortened. This was repeated until the created sequence matches either in none or both of the two protein sets. All resulting signals formed the set of potential NLSs.
Summary of Chapters
1. Introduction: This chapter covers fundamental cellular concepts, defining nuclear localization and export signals (NLS/NES) and presenting the motivation for updating the NLSdb database.
2. Materials and Methods: This section details the data collection from literature and databases, the criteria for reliable evidence, the in silico mutagenesis algorithm, and the bioinformatics tools used for sequence analysis and clustering.
3. Results and Discussion: This central chapter presents the comprehensive sequence analysis of transport signals, the generation of 4301 new potential NLSs, and benchmarking results showing the improved coverage of the updated NLSdb2.0.
4. Conclusion: The conclusion summarizes the successful update of the database and highlights the utility of the generated data for predicting nuclear localization and understanding protein transport mechanisms.
5. Outlook: This section discusses future directions, including the planned analysis of nuclear export signals (NESs) and potential improvements to the database user interface.
Keywords
Nuclear transport, Protein localization, NLSdb, Bioinformatics, Monopartite NLS, Bipartite NLS, PY-NLS, In silico mutagenesis, Sequence analysis, Consensus sequence, Protein sequences, Subcellular localization, Karyopherins, Swiss-Prot, Database update
Frequently Asked Questions
What is the primary focus of this research?
The research focuses on the bioinformatics analysis of nuclear transport signals and the update of the NLSdb database to improve the identification of proteins imported into the nucleus.
What are the central themes of this work?
The central themes include the categorization of nuclear localization signals (monopartite, bipartite, and PY-NLS), in silico signal discovery, and the statistical analysis of protein sequences containing these signals.
What is the main objective of the thesis?
The primary objective is to update the 2003 version of NLSdb to incorporate recent research, resulting in a more comprehensive database that increases the coverage for detecting nuclear proteins.
Which computational methods are employed in this study?
The study utilizes in silico mutagenesis, sequence clustering via UPGMA, pattern matching, PSI-Blast for homology inference, and redundancy reduction using Cd-hit and Uniqueprot.
What is covered in the main section of the work?
The main section covers the collection of experimental data, the algorithmic discovery of 4301 new potential NLSs, the analysis of signal properties, and the benchmarking of the new database version against the old one.
Which keywords best characterize this work?
Key terms include bioinformatics, NLSdb, nuclear localization signal, in silico mutagenesis, sequence clustering, and protein transport.
How does the in silico mutagenesis process define potential signals?
Potential signals are generated by mutating experimental signals at every position and iteratively shortening them until they match exclusively within the sequences of a verified nuclear protein dataset.
What does the validation using randomly chosen proteins demonstrate?
The validation confirms that proteins predicted to have an NLS via the updated database often exhibit known nuclear functions or localization in literature, providing support for the predictive accuracy of NLSdb2.0.
Why are viruses a significant topic in the context of protein transport signals?
The analysis revealed a high frequency of NLS-containing proteins of viral origin, which relates to the biological necessity of viruses to enter the host cell nucleus for replication.
- Quote paper
- Silvana Wolf (Author), 2015, Analysis of Nuclear Transport Signals, Munich, GRIN Verlag, https://www.grin.com/document/365482