Data Mining for Pattern Recognition and Pattern Matching in Bioinformatics


Master's Thesis, 2006

79 Pages


Excerpt


CONTENTS

Declaration i Acknowledgement ii List of Figures v Abstract

1 Biological Prospective of Data mining
1.1 Introduction
1.2 Scope of Thesis
1.3 Biology Primer
1.4 Data Mining Operations for Knowledge Discovery Process

2 Data Mining and Neural Network
2.1 Introduction
2.2 Neural Networks Overview
2.3 Feed forward Neural Networks
2.4 Time Delay Neural Networks
2.5 Bi-Directional Neural Networks
2.6 Recurrent Neural Networks
2.7 Back-Propagation Through Time
2.8 Constructive Neural-Network Learning Algorithms for Pattern Classification
2.8.1 Introduction
2.8.2 Preliminaries

3. Sequence Alignment
3.1 Introduction
3.2 Sequence Description
3.3 Sequence Alignment
3.3.1 Fundamentals of Sequence Alignment
3.3.1.1 Pair wise Sequence Alignment
3.3.1.2 Local versus Global Alignment
3.3.1.3 Multiple Sequence Alignment
3.3.2 Alignment Algorithms

4. Sequence Similarity Identification
4.1 Introduction
4.2 Existing Methods
4.2.1 FASTA
4.2.2 BLAST
4.2.3 Dynamic Programming

5. FuzzyLogic in Pattern Recognition
5.1 Introduction
5.2 Unsupervised Clustering
5.3 Fuzzy c-Means Algorithm
5.3.1 Cluster Validity
5.3.1.1 Membership- based Validity Measures
5.3.1.2 Geometry-based Validity
5.4 Knowledge-based Pattern Recognition
5.5 Hybrid Pattern Recognition System
5.6 Fuzzy Hidden Markov Models
5.6.1 Hidden Markov Models
5.6.2 Fuzzy Measures and Fuzzy Integrals

Conclusion

References

Abbildung in dieser Leseprobe nicht enthalten

First, and foremost , I am very much grateful to my guide Prof. Dr. O.S. Srivastava for his guidance throughout my dissertation studies and all the time the effort he put in the development of my work and me. His help, encouragement and involvement in my study resulted in this work. I am very much indebt to Prof. Dr. O.S. Srivastava.

I also thank to the Head of the Department of Computer Science, Madurai Kamaraj University for creating an opportunity.

I also thank the Management, Principal and authorities of ISTAR for providing an opportunity to complete this course in time.

Last but not least, I am very mush thankful to my family members for the constant support and encouragement they have given during my study.

BINOD KUMAR

List of Figures

1.1 Evolution of a primordial protein into its present-day instances

1.2 Amino acids and formation of proteins via peptide bonds

1.3 The chemical make up of the amino acids Alanine and Leucine

1.4 Formation of DNA Polymeric chain

1.5 Three dimensional representation of the DNA double helix

1.6 Data mining operations in context of a Knowledge Discovery Process

1.7 Data Mining Methods

2.1 A Multi-layer Feed forward Neural Network with l layers of units

2.2 A Time Delay Neural Network (TDNN) Representation

2.3 Outline of the signal flow in the bi-directional computing architecture for time series prediction

2.4 A Bi-directional Neural Network Model

2.5 Elman RNN, Jordan RNN and a Fully RNN

2.6 MPyramid-real network

3.1 Local (top) versus Global (bottom) Alignment

3.2 Example of Multiple sequence alignment

4.1 Example of FASTA table for two sequences

4.2 Distribution of FASTA scores

4.3 Dynamic Programming Problem

4.4 Solution Matrix for MaxValue

5.1 2D data sets containing different clusters

5.2 An example profile HMM with four match states

ABSTRACT

Data mining , the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectation.

Automated pattern matching- the ability of a program to compare known patterns and determine the degree of similarity –forms the basis for automated sequence analysis, modeling of protein structures, locating homologous genes , data mining, Internet search engines etc. in bioinformatics. Data mining relies on algorithm pattern matching to locate patterns in online and local databases, using a variety of technologies, from simple keyword matching to rule based expert system and artificial neural networks.

In this dissertation, the basic problems related to pattern reorganization and pattern matching for nucleotide and protein sequence alignment are discussed. The main techniques used to solve these problems and a comprehensive survey of most influential algorithms that were proposed during the last decay is described.

In Chapter 1 general characteristics of a “living” entity is discussed like metabolism, growth and reproduction. The most elementary unit exhibiting these properties is the cell. Then internal structure of Protein and DNA structure are discussed. The DNA of all organisms suffers a number of mutations. Such mutations can occur at level of single nucleotides or at the level of entire chromosomes. They can occur during the DNA duplication process. In second part of this chapter Data Mining Operations for Knowledge Discovery Process are discussed. Data mining is an iterative process in which preceding processes are modified to support new hypothesis suggested by the data.

In Chapter 2 different types of Neural Network are discussed. A Constructive Neural- Network Learning Algorithms for Pattern Classification is discussed. Artificial neural networks have been successfully applied to problems in pattern classification, function approximation, optimization, and pattern matching and associative memories.

In Chapter 3 Pair wise Sequence Alignment and Multiple Sequence Alignment is discussed. In this chapter alignment score and Gap Penalty between sequences is calculated. The gap penalty formula can be extended to include a penalty for alignments for the gaps at the end of a sequence of equal length. Multiple sequence alignment is useful in finding patterns in nucleotide sequences and for identifying structural and functional domains in protein families.

In Chapter 4 an approach for identifying sequence similarity between a query sequence and a database of proteins is discussed. Two well established tools for comparing query sequences with a database: FASTA and BLAST are discussed. Example of Global Sequence Alignment using Dynamic Programming is discussed.

In Chapter 5 supervised pattern reorganization and knowledge-based pattern reorganization using fuzzy logic has been described. Finally a novel method for aligning multiple genomic or proteomic sequences using a fuzzy Hidden Markov Model (HMM) has been described. In this chapter, the fuzzy HMM model for Multiple Sequence Alignment (MSA) is mathematically defined. New fuzzy algorithms are described for building and training fuzzy HMMs, as well as for their use in aligning multiple sequences.

Chapter 1

Biological Prospective of Data mining

1.1 Introduction

Molecular Biology studies the composition and interactions of life's agents, namely the various molecules (e.g. DNA, proteins) sustaining the living process. Recent advances in sequencing technology have allowed the rapid accumulation of DNA and protein data. As a result a gap has been created on the one side there is a rapidly growing collection of data containing all the information upon which life is built; and on the other side we are currently unable to keep up with the study of this data, impaired by the limits of existing analysis tools.

In this work we examine how computational methods can help in extracting the information contained in collections of biological data. In particular, we investigate how sequence similarity among various macromolecules (e.g. proteins) can be exploited towards the extraction of biologically useful information.

1.2 Scope of the Thesis

My work aims to examine how mining massive data sets of genetic sequences can reveal hidden information of substantial biological importance. My work aims to look for (the object of the mining process) is sequence similarities, i.e. stretches of residues which are shared by several different sequences. Similarities of an unexpected nature are then used to draw relationships between the sequences containing them. Ever increasing store of sequence and protein data from several genome projects, data mining the sequences has become a major research focus in bioinformatics. The aim of this work is to explore data mining techniques as an automated means of reducing the complexity of data in large bioinformatics databases and of discovery meaningful, useful patterns and relationships in data.

My work aims to examine how mining massive data sets of genetic sequences can reveal hidden information of substantial biological importance. My work aims to look for (the object of the mining process) is sequence similarities, i.e. stretches of residues which are shared by several different sequences. Similarities of an unexpected nature are then used to draw relationships between the sequences containing them.

In bimolecular sequences (DNA, RNA or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity. Analysis methodologies exploit the sequence homology between compared biosequences in order to infer useful information about them. The principle here is that of “information transfer”. If, for example, a protein Q under examination is sufficiently similar to another protein A of known functionality then by using the fact quoted above, one can transfer the functionality of A to Q and guess that Q must also play a related (if not the same) biological role ( such as BLAST and FASTA).

The dominant theory in Biology today maintains that the species evolve by diversification. This process occurs through mutations accumulated on the DNA of individual members of a species and the inheritance of such mutations from one generation to another. Mutations in coding regions of the DNA transmit (through the transcription/translation mechanism) to the proteins that the regions code for. As a result, it is possible to describe the evolutionary history of a given protein strain as a tree (Figure 1.1). At the leaves of that tree we find all the proteins of the given strain surviving today across various organisms; these are the proteins forming a protein family like the hemoglobin proteins mentioned above. The root of the tree is occupied by that primordial protein (the mother protein) which gave rise to all the proteins of the tree. Intermediate nodes contain the various forms of the protein through evolution. The branches originate from a node indicate diversification events where mutations gave rise to new proteins (the children of the node) 1. The result of this process (Figure 1.1) is an ability to find remnants of the mother protein in the descendant sequences. Although mutations happen randomly over the entire length of a protein, the results of individual mutations can vary wildly.

Abbildung in dieser Leseprobe nicht enthalten

F igure 1.1: The evolution over time of a primordial protein (the “mother protein” occupying the root of the tree), through a series of mutations, into its present day instances (the leaves of the tree). Mutated amino acids are indicated by lower case letters. The amino acids which are vital for the operation of the protein are marked in bold face. Mutations that affect these amino acids are shown boxed, indicating that the resulting proteins do not survive.

1.3 Biology Primer

Nowadays, the main approach is to characterize life by its properties. In particular, a system is thought to be “living” if it has the following three general characteristics:

- metabolism
- growth
- reproduction

The most elementary unit exhibiting these properties is the cell. Cells come in two basic varieties. Procaryotic cells are the simplest form, composed of a single compartment which is protected from the outside with a membrane called the cell wall. Eucaryotic cells have a central, all surrounding compartment which contains many smaller compartments. The most important among these compartments is the nucleus, which contains the genetic material of the eucaryotic cells. Procaryotic cells are found only in bacteria while higher organisms are composed of one or more eucaryotic cells.

Most of the cell (about 90%) is composed of water. The remaining 10% contains two types of molecules:

Elementary molecules: these are small molecules created by a variety of chemical reactions in the cell. Most important among them are the nucleotides and the amino acids. Elementary molecules provide the building material for polymeric molecules

Polymeric molecules: these are the structural components of the cell. They are formed when elementary molecules bind together in to long chains. The polymeric molecules that we will be focusing on are (i) the nucleic acids (DNA, RNA) and (ii) the proteins. Nucleic acids are polymeric assemblies of nucleotides while the proteins are chains of amino acid. Nucleic acids and proteins are sometimes collectively referred to as macromolecules or biosequences or biological sequences.

Proteins

Proteins are the most important macromolecules. They are responsible for almost the entire collection of biochemical reactions taking place inside the cell. Proteins come in many flavors and with a variety of functionalities. Some of them include:

- Structural proteins: They are the building blocks of the various tissues.
- Enzymes: they catalyze vital chemical reactions that would otherwise take too long to complete.
- Transporters: they carry chemical elements form one part of the organism to another (e.g. hemoglobins carry oxygen).

Proteins are chains of amino acids. Their length can range from just a few tens of amino acids up to several thousand; a typical size is between three and four hundred amino acids. There are 20 different amino acids. All of them have the basic structure depicted in Figure 1.2. (a).

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.2: (a) All amino acids have the same underlying structure. There is a central carbon atom, called alpha-Carbon (symbolized here as Cα) and 4 chemical groups attached to Cα: a hydrogen atom, the nitrogen of an amino group, the carbon of a carboxyl group and an amino acid-specific side chain (symbolized here with the letter R).

(b) A protein formed by n amino acids bound together with peptide bonds. Peptide bonds are formed when the carboxyl group of one amino acids interacts with the amino group of another amino acid. A by-product of this interaction is the release of a water molecule (the carboxyl looses an oxygen and a hydrogen while the amino group contributes a hydrogen). Peptide bonds are shown in the figure with solid gray lines.

More specifically there is a central carbon atom (denoted as Cα and called the alpha-carbon) to which four chemical groups attached via covalent bonds: a hydrogen atom (H), an amino group (NH2), a carboxyl group (COOH) and a side chain (R). Figure 1.3 shows some amino acids with their corresponding side chains.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.3: The chemical make up of the amino acids Alanine and Leucine.

Proteins are formed by amino acids binding together to form polymeric chains. Binding occurs when the carboxyl group of an amino acid interacts with the amino group of the next amino acid. The result of this process is (i) the formation of a bond between the two amino acids and (ii) the generation of a water molecule from atoms released from the carboxyl group and the amino group (see Figure 1.2.(b)). The bond holding together the two amino acids is called a peptide bond. For this reason, a stretch of amino acids is often called a polypeptide.

DNA

The second macromolecule of interest is the deoxyribonucleic acid, better known as DNA. DNA molecules are very long polymeric chains of nucleotides: a single DNA molecule can be millions of bases long (nucleotides are alternatively called bases). There are four different nucleotides called Adenine (A), Guanine (G), Cytocine (C) and Thymine (T).

Long chains are formed when series of nucleotides bind together through sugar-phosphate bonds. These bonds hold together the basic parts of successive nucleotides, in the same way that peptide bonds hold together the amino acids of a protein (Figure 1.4.(a)). While proteins have only one chain, DNA comprises two interconnected parallel strands of nucleotides (Figure 1.4. (b)).

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.4: (a) Polymeric chains of nucleotides are formed when the basic units of nucleotides are bound together through sugar-phosphate bonds. The Ri indicates the side groups of the nucleotides.

(b) The actual DNA molecule is comprised of two parallel chains, connected through hydrogen bonds between side groups. Not all possible nucleotide pair combinations are possible: the rule is that A can only bind with T and C can only bind with G.

These strands are connected through hydrogen bonds between the side groups of facing nucleotides.An important point is that there are rules deciding which pairs of nucleotides can face each other in the DNA double strand: in particular, Adenine is always paired with Thymine while Guanine is always paired with Cytocine. Because of these rules, either of the two strands uniquely defines the other and can be used to describe the corresponding DNA molecule. As a result, a DNA molecule can be represented as a single string over the alphabet of the nucleotides.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.5: Three dimensional representation of the DNA double helix. The two parallel DNA strands from a right-handed helix, held together by the hydrogen bonds between complementary nucleotides.

Finally, Figure 1.5 gives a three dimensional picture of the arrangement of the DNA strands in space.

The entire DNA of an organism comprises that organism's genome. The human genome, for example, has an estimated 3 × 109 base pairs.

RNA (ribonucleic acid) is another polymer, also made up of nucleotides. Although RNA and DNA are chemically very similar, there are a few differences. First, RNA uses the nucleotide Uracil (U) where DNA would have used Thymine (T). Second, the backbone-forming basic unit of the RNA nucleotides is slightly different from the corresponding basic unit of the DNA nucleotides. Finally, RNA is single stranded. The RNA used for gene copying is called messenger RNA (or mRNA for short).

Mutations

The DNA of all organisms suffers a number of mutations. Such mutations can occur at level of single nucleotides or at the level of entire chromosomes. They can occur during the DNA duplication process. Translocations, for example, occur when DNA is interchanged between chromosomes. Another type of mutation occurs when a part of DNA turn over its orientation relative to the chromosome it belongs to. Such problems arise because of the mechanics of the DNA replication process. Because of the level at which they happen, macroscopic mutations are usually called chromosome rearrangements. The results of such wide ranging mutations can be shocking: entire genes can be lost or become non-functional. In such cases, the result is that the new cell that comes out of the cell division process dies shortly after it is created.

A second type of mutation involves changes of only one or just a few bases. Such mutations can be categorized as follows:

Replacements which occur when one DNA base is changed into another. Mutations of this type are also called point mutations.

Insertions can introduce one or more nucleotides between any pair of successive DNA bases.

Deletions remove consecutive bases from a DNA molecule.

The mutations described above can happen either during the replication of DNA because of mistakes made by DNA polymerases, or, as a result of normal cellular operations which sometimes create such mutations as a side effect.

1.4 Data Mining Operations for Knowledge Discovery Process

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.6: Data mining operations in context of a Knowledge Discovery Process

Data mining is an iterative process in which preceding processes are modified to support new hypothesis suggested by the data. In the given figure a data warehouse involve following process:

1. Selection and sampling of the appropriate data from the databases
2. Preprocessing and cleaning of the data to remove redundancies and errors
3. Transforming and reducing data to a format more suitable for the data mining
4. Data Mining Methods
5. Evaluation of mined data
6. Visualization of the evaluation results
7. Designing new data queries to test new hypotheses and returning to step 1.

Selection and Sampling: Initial data mining is restricted to computationally justifiable samples of holding in an entire data warehouse because of millions of records involved. The evaluation of the relationships that are exposed in these samples can be used to determine which relationship can be mined further using complex data warehouse.

Prepr ocessing and Cleaning: The major preparatory activities, listed in Table, are normally performed in the creation of data warehouse.

Table 1: Data Mining Preparatory Activities

Abbildung in dieser Leseprobe nicht enthalten

1. Data Characterization involves creating a high level description of the nature and the content of the data to be mined. It provides a form of documentation that can be referred to by those who may not be familiar with underlying biology represented by data.
2. Consistency Analysis is the process of determining the variability in the data, independent of domain. It is a statistical assessment of data, base on data values.
3. Domain Analysis involves validating the data values in the larger context of biology. It is statistically consistent with other data on the same parameter, to ensure that it makes sense in the context of biology.
4. Data Enrichment involves drawing from multiple data sources to minimize the limitation of a single data source.
5. Frequency and Distribution Analysis places weights on the values as a function of their frequency of occurrences. The effect is to maximize the contribution of common findings while minimizing the effect of rare occurrences on the conclusion made from data mining output.
6. The Normalization process involves transforming data values from one representation to another using predefined range of final values.
7. The final preprocessing and cleaning activity, missing value analysis, involves detecting, characterizing and dealing with missing data values.

Transformation and Reduction: In the transformation and reduction phase of the knowledge discovery process, data sets are reduced to the minimum size possible through sampling or summary statistics. For example, tables of data may be replaced by descriptive statistics, such as mean and standard deviation.

Data Mining Methods: The process of data mining is concerned with extracting patterns from the data, using classification, regression, link analysis or segmentation.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.7: Data Mining Methods. Classification- Mapping to class or group. Regression- Statistical Analysis. Link Analysis- Correlation of Data. Segmentation- Similarity function

1. Classification involves mapping data into predefined or newly discovered classes. In the figure, there are two groups or classes of data, (A) and (B). The classification rule may specify minimum closeness to the centre of a particular group, as defined by numerical ranges.
2. Data mining based on regression methods involves assigning data a continuous numerical variable based on statistical methods. In the figure, the extrapolation formula is a simple linear function of the form: Where x and y are coordinates on the plot, m is slope of the line, and c is constant.
3. Link analysis evaluates links between data in the database. In the figure, two pairs of data points are apparently linked, in that value of one data element in the pair can be predicted by the value of other data point in the pair.
4. Data mining based on segmentation identifies classes or groups, of data that behave similarly, according to some metric. Segmentation is link analysis applied to groups of data instead of individual data points. In Figure groups (A) and (B) behave similarly.

These methods of data mining are used in combination with each other, either in parallel or part of sequential operation. For example segmentation requires classes to be defined through classification process.

Evaluation: In this phase of knowledge discovery, the patterns identified by the data mining analysis are interpreted. Typical evaluation ranges from simple statistical analysis and complex numerical analysis of sequences and structures.

Visualization: Visualization of evaluation results is an optional stage in knowledge discovery process. Visualization can range from converting tabular listings of data summaries to pie charts and similar business graphics, to using real-time data to 3D virtual reality displays that can be manipulated by controllers.

Designing New Queries: Data mining is an iterative continual activity, in that there are always new hypotheses to test. Sometimes the new hypotheses are suggested by the data returned by mining process.

Chapter 2

Time Series Modeling with Artificial Neural Networks

2.1 Introduction

Human beings are capable of gathering vast amounts of sensory data from their environment which in turn enables them to formulate logical decisions [13]. This data can be represented as a time series which the brain organizes and performs complex operations on that allow us to predict and classify sequences in nature. Artificial neural networks (ANNs) are simple mathematical models devised in an attempt to emulate some of these human functions. Recurrent neural networks (RNNs) are ideally suited for modeling dynamical systems with hidden states. The outputs of such processes are typically recorded in the form of time series.

Time series modelling is concerned with the analysis of temporal relationships in a sequence of values [1]. The goals of time series modelling include analysis and simulation in order to understand the underlying dynamics of the series, recognition, pattern classification, dynamic process control and prediction of future values in the series.

Various machine learning methods have been applied to time series prediction and classification tasks. For example, electroencephalogram classification [3] and dynamic gesture recognition [3]. Time can be continuous or discrete. We give a brief introduction to neural networks, followed by various neural network architectures that have been applied to time series modelling tasks. We conclude with an explanation of the long-term dependency problem that is inherent in traditional RNNs.

2.2 Neural Networks Overview

There are two major classes of ANNs, namely feed forward neural networks (FNNs) and recurrent neural networks (RNNs). In feed forward networks, activation is “piped” through the network from input units to output units. They are also referred to as static networks [4]. FNNs contain no explicit feedback connections [5]. Conventional FNNs are able to approximate any finite function as long as there are enough hidden nodes to accomplish this [6]. RNNs however, are dynamical networks with cyclic path of synaptic connections which serves as the memory elements for handling time-dependent problems.

ANNs have the capability to learn from their environment, and to improve their performance through learning. Learning is achieved by the ANN through an iterative process of adjustments applied to its synaptic weight and bias level.

There are diverse varieties of learning algorithms for the design of ANNs. They differ from each other in the way in which the adjustment to a synaptic weight of a neuron is formulated. Learning algorithms can be described as a prescribed set of well-defined rules for the solution of a learning problem [6]. Basic learning algorithms include error-correction learning, memory-based learning, Hebbian learning, competitive learning, and Boltzmann learning.

Two fundamental learning paradigms exist. Learning paradigms for ANNs are described in terms of the way and manner by which the interconnected neurons relate to its environment. There exist supervised or associative learning and unsupervised or self-organizing learning paradigms. The former requires an input pattern along with matching output patterns which is given by an external teacher whereas the latter requires input patterns from which it develops its own representation of the input stimuli.

2.3 Feed Forward Neural Networks

A feedforward network has a layered structure (Figure 2.1). Each layer consists of processing units (or neurons). The layers are arranged linearly with weighted connections between each layer. All connections point in the same direction – a previous layer feeds into the current layer and that layer feeds into the next layer. Information flows from the first, or input, layer to the last, or output, layer. The input layer is distinguished by its lack of incoming connections. Similarly, the output layer is deprived of outgoing connections. The remaining layers possess both types of connectivity and are referred to as hidden layers. Hidden layers contain hidden units which are able to extract higher-order statistics [6]. This is particularly valuable when the size of the input layer is large. Common activation functions in the network may be those whose output is a nonlinear differential function, e.g. sigmoid function, of its input and hence, suitable for gradient descent learning. The error back propagation learning algorithm and the generalized delta rule - both algorithms are examples of error-correction learning algorithms - are common gradient descent approaches that are used to train these static networks. The back propagation algorithm is defined according to [7] as follows:

Training examples are presented to a neural network in the form (x, t) where x is a vector of network input data, and t is the vector of desired network output signals.

- We construct a feedforward neural network with nin inputs, nhidden units, and nout output units.
- We initialize the networks synaptic weights to a small random value
- We repeat the following steps until the termination condition is met:

Abbildung in dieser Leseprobe nicht enthalten

Figure2.1: A Multi-layer Feed forward Neural Network with l layers of units

The input layer Ni is distinguished by its lack of incoming connections. Similarly, the output layer No is deprived of outgoing connections. The remaining layers possess both types of connectivity and are referred to as hidden layers.

– For each (x, t) in the training set, propagate the input forward through the network
– Back propagate the error through the network:

- For each network output unit k, calculate its error term δk

Abbildung in dieser Leseprobe nicht enthalten

- For each hidden unit h, calculate its error term δh

Abbildung in dieser Leseprobe nicht enthalten

- We then update each network weight wji

Abbildung in dieser Leseprobe nicht enthalten

Where

Abbildung in dieser Leseprobe nicht enthalten

It is important to note that two basic methods exist for back propagation learning [6]. These are sequential or on-line or stochastic mode (presented above) and batch mode back propagation. The former learning approach updates weights after each training example is presented to the network, while the latter approach requires that the entire training set be presented to the network followed by the weight updates.

2.4 Time Delay Neural Networks

Time delay neural network (TDNN) is a popular neural network that uses time delays to perform temporal processing. TDNN is a multilayer feedforward network whose hidden neurons and output neurons are replicated across time (see Figure 2.2).TDNN was specifically developed for speech recognition. The purpose of the architecture is to have a network that contains context which is able to represent sequences.

The number of steps that a TDNN is able to process depends entirely on the time window that stores the input sequence. [4] Noted that the buffer size limits the length of longest sequence which can successfully be differentiated by these networks. When dealing with time series problems, we must have a good idea of what the size of the longest sequence in the dataset will be. The reason for this is simply that, in TDNNs, a fixed input window size has to be selected based on the longest sequence length. The number of input to hidden layer weight increases with increasing window size. The increased number of parameters increases the time complexity of the learning algorithm.

The input layer of a TDNN consists of a sliding window whose weight vector is shared amongst other inputs. The output of the activation units is computed by taking a weighted sum of the nodes residing in the input window over a time period and then applying a squashing function to it. TDNNs are trained with the conventional error back propagation learning algorithm.

[...]

Excerpt out of 79 pages

Details

Title
Data Mining for Pattern Recognition and Pattern Matching in Bioinformatics
Author
Year
2006
Pages
79
Catalog Number
V537265
ISBN (eBook)
9783346162083
Language
English
Keywords
bioinformatics, data, matching, mining, pattern, recognition
Quote paper
Dr Binod Kumar (Author), 2006, Data Mining for Pattern Recognition and Pattern Matching in Bioinformatics, Munich, GRIN Verlag, https://www.grin.com/document/537265

Comments

  • No comments yet.
Look inside the ebook
Title: Data Mining for Pattern Recognition and Pattern Matching in Bioinformatics



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free