Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time-consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectation.
Automated pattern matching- the ability of a program to compare known patterns and determine the degree of similarity –forms the basis for automated sequence analysis, modeling of protein structures, locating homologous genes, data mining, Internet search engines etc. in bioinformatics. Data mining relies on algorithm pattern matching to locate patterns in online and local databases, using a variety of technologies, from simple keyword matching to rule based expert system and artificial neural networks.
In this dissertation, the basic problems related to pattern reorganization and pattern matching for nucleotide and protein sequence alignment are discussed. The main techniques used to solve these problems and a comprehensive survey of most influential algorithms that were proposed during the last decay is described.
Table of Contents
1 Biological Prospective of Data mining
1.1 Introduction
1.2 Scope of Thesis
1.3 Biology Primer
1.4 Data Mining Operations for Knowledge Discovery Process
2 Data Mining and Neural Network
2.1 Introduction
2.2 Neural Networks Overview
2.3 Feed forward Neural Networks
2.4 Time Delay Neural Networks
2.5 Bi-Directional Neural Networks
2.6 Recurrent Neural Networks
2.7 Back-Propagation Through Time
2.8 Constructive Neural-Network Learning Algorithms for Pattern Classification
2.8.1 Introduction
2.8.2 Preliminaries
3 Sequence Alignment
3.1 Introduction
3.2 Sequence Description
3.3 Sequence Alignment
3.3.1 Fundamentals of Sequence Alignment
3.3.1.1 Pair wise Sequence Alignment
3.3.1.2 Local versus Global Alignment
3.3.1.3 Multiple Sequence Alignment
3.3.2 Alignment Algorithms
4 Sequence Similarity Identification
4.1 Introduction
4.2 Existing Methods
4.2.1 FASTA
4.2.2 BLAST
4.2.3 Dynamic Programming
5 Fuzzy Logic in Pattern Recognition
5.1 Introduction
5.2 Unsupervised Clustering
5.3 Fuzzy c-Means Algorithm
5.3.1 Cluster Validity
5.3.1.1 Membership- based Validity Measures
5.3.1.2 Geometry-based Validity
5.4 Knowledge-based Pattern Recognition
5.5 Hybrid Pattern Recognition System
5.6 Fuzzy Hidden Markov Models
5.6.1 Hidden Markov Models
5.6.2 Fuzzy Measures and Fuzzy Integrals
6 Conclusion
Research Objectives and Themes
The primary objective of this dissertation is to examine how computational methods, specifically data mining and pattern recognition techniques, can be utilized to extract biologically useful information from massive datasets of genetic and protein sequences. The research addresses the challenge of identifying sequence similarities and functional homologies that are otherwise difficult to discern due to the vast complexity and rapid accumulation of biological data.
- Biological data mining and sequence analysis
- Advanced neural network architectures (TDNN, Bi-directional, Recurrent)
- Sequence alignment algorithms and sequence similarity identification
- Fuzzy logic applications in pattern recognition and clustering
- Development of novel fuzzy Hidden Markov Models for multiple sequence alignment
Excerpt from the Book
3.3.1.1 Pair wise Sequence Alignment
Pair wise sequence alignment involves the matching of two sequences, one pair of elements at a time. The challenge in pair wise sequence alignment is to find the optimum alignment of two sequences with some degree of similarity. This optimum condition is based on a score that reflects the number of paired characters in two sequences and number and length of gaps required to adjust the sequences so the maximum number of characters are in alignment. For example, consider the ideal case of identical nucleotide sequences, (A) and (B)
A) ATTCGGCATTCAGTGCTAGA
B) ATTCGGCATTCAGTGCTAGA
Assuming that the alignment scoring algorithm counts one point per pair of aligned characters, then the score for each of the 20 pairs, or 20 points. Now, consider the case when several of character pairs aren’t aligned:
Summary of Chapters
1 Biological Prospective of Data mining: Discusses the characteristics of living entities, DNA/protein structures, and the iterative process of data mining for knowledge discovery.
2 Data Mining and Neural Network: Examines various neural network architectures, including feedforward, time-delay, and recurrent networks, and their application to pattern classification.
3 Sequence Alignment: Explores the fundamentals of pair wise and multiple sequence alignment, detailing alignment scores, gap penalties, and various algorithms for identifying patterns.
4 Sequence Similarity Identification: Describes methodologies for similarity searching, specifically focusing on FASTA, BLAST, and dynamic programming approaches for sequence alignment.
5 Fuzzy Logic in Pattern Recognition: Details unsupervised clustering, the Fuzzy c-Means algorithm, and introduces novel fuzzy Hidden Markov Models for multiple sequence alignment.
6 Conclusion: Summarizes the dissertation's contributions, highlighting the utility of data mining and advanced computational models in solving pattern matching problems in bioinformatics.
Keywords
Data mining, Bioinformatics, Neural networks, Sequence alignment, Pattern recognition, Fuzzy logic, Fuzzy c-Means, Hidden Markov Models, Sequence similarity, Biological sequences, Computational biology, Pattern matching, Clustering, Genome projects, Evolutionary biology
Frequently Asked Questions
What is the core focus of this dissertation?
The dissertation focuses on applying computational data mining and pattern recognition techniques to extract significant biological information, particularly sequence similarities, from large genetic and protein databases.
What are the central thematic areas covered?
The work covers bioinformatics, neural network architectures, sequence alignment algorithms, and the integration of fuzzy logic into pattern recognition systems.
What is the primary research goal?
The main goal is to explore and develop automated means, such as data mining and fuzzy Hidden Markov Models, to reduce the complexity of biological databases and discover meaningful patterns and relationships.
Which scientific methods are primarily employed?
The study utilizes advanced machine learning methods, including Artificial Neural Networks (ANNs), Fuzzy c-Means (FCM) clustering, and innovative fuzzy Hidden Markov Models (HMMs).
What topics are discussed in the main body?
The main body covers the biological foundation of sequences, neural network designs for time series and pattern classification, algorithms for sequence alignment, and fuzzy logic-based pattern recognition techniques.
How would you characterize this work with keywords?
Key characterizations include data mining, bioinformatics, neural networks, sequence alignment, and fuzzy logic applications.
What is the significance of the fuzzy Hidden Markov Model described in Chapter 5?
The fuzzy HMM generalizes classical stochastic HMMs by relaxing independence assumptions, providing a novel approach to multiple sequence alignment with potentially improved performance and computation time.
How does dynamic programming contribute to sequence alignment?
Dynamic programming provides a structured recursive approach to calculate optimal alignments by recording intermediate results, thereby solving computationally intensive alignment problems efficiently.
- Citar trabajo
- Dr Binod Kumar (Autor), 2006, Data Mining for Pattern Recognition and Pattern Matching in Bioinformatics, Múnich, GRIN Verlag, https://www.grin.com/document/537265