S H s earch: a method for fas t remote
M. B addar
A b s trac t--R emote homology detection is the problem of detecting homology in cas es of low s equence s imilarity. It is a hard
computational problem with no approach that works well in all cas es . Methods bas ed on profile hidden Markov models (HMM) often
exhibit relatively higher s ens itivity for detecting remote homologies than commonly us ed approaches . However, calculating s imilarity
s cores in profile HMM methods is computationally intens ive as they us e dynamic programming algorithms . In this paper we
introduce S Hs earch: a new method for remote protein homology detection. O ur method is implemented as a modification of
HHs earch: a remote protein homology detection method bas ed on comparing two profile HMMs . T he motivation for modification was
to reduce the run time of HHs earch s ignificantly with minimal s ens itivity loss . S Hs earch focus es on comparing the important
s ubmodels of the query and databas e HMMs ins tead of comparing the complete models . Hence, S Hs earch achieves a s ignificant
s peedup over HHs earch with minimal los s in s ens itivity. O n S C O P 1.63, S Hs earch achieved 88X s peedup with 8.2% loss in
s ens itivity with res pect to HHs earch at error rate of 10% , which deemed to be an acceptable tradeoff.
Index Terms --biological s equence clas s ification, hidden Markov models
NT R O D UC T IO N
Analysis of large scale sequence data has become an important task in computational biology and comparative
genomics, inspired in part by numerous scientific and technological applications such as the biomedical
literature analysis or the analysis of biological sequences. Protein homology detection has attracted particular
interest due to its critical role in many biological applications. One of these applications is protein function
prediction  which is an important task in drug design  .The process of drug design can be divided into
two phases. First phase is searching for a target protein whose molecular function is to be moderated, in many
cases blocked, by a drug molecule binding to it. Second phase is selecting a suitable drug that binds to the
protein tightly, is easy to synthesize, is bio-accessible and has no adverse effects such as toxicity. The
knowledge of protein function can be of significant help in both phases. However, protein annotation with
functional information lags behind in the rapidly increasing amount of sequence data resulting from the
numerous ongoing genome sequencing projects. Consecutively, the manual analysis and annotation of protein
function via laboratory experimental procedures, which is a low throughput process, became no longer
effective. The availability of a large amount of protein sequences data and the vitality of the problem motivated
the development of high throughput computational methods for protein function prediction.
A family of state-of-the-art protein function prediction approaches relies on gathering information about
protein functions from different sources . These sources of information include protein homology  ,
gene expression analysis , protein interaction networks analysis , phylogenic trees and profiles analysis
, and literature text mining . One of the most effective means of inferring the function of a newly
sequenced protein is to detect what functions are performed by homologous proteins . Two proteins are
homologous if they share a common ancestor. Since the actual sequence of the common ancestor is unavailable,
sequence homology can only be inferred by statistical means.
Dynamic programming algorithms, such as the Needleman-Wunsch  and Smith-Waterman algorithms
, or related heuristic algorithms, such as BLAST  and FASTA , can be used to assign to each
sequence in the database a score indicating the likelihood that this sequence is homologous to the query based
on their similarity score. However, these approaches have low sensitivity in remote homology detection .
Consequently, other approaches have been developed targeting the detection of remote homologies. For
example, motif based approaches  are developed based on comparing query and database sequences'
motifs. Other approaches works by building a profile for a query sequence which is constructed from a
multiple alignment of sequences closely related to the query . Then this profile is scored against all
database sequences. A significant advance is achieved by comparing query profiles to database profiles 
  . Database profiles are built from the multiple alignments of clusters of closely related sequences in
a database. This type of profile-profile comparison approaches gives relatively higher sensitivity in detecting
homologies than other methods in the literature .
The rest of the paper is organized as follows. In section 2 we review background information and related
work. In section 3 we propose our method. In section 4 we illustrate the experimental setup and performance
measures. In section 5 we show and discuss results. In Section 6 we present our conclusion.
B A C K G R O UND S
R E L AT E D
WO R K
In this section, the background of homology detection methods is reviewed. We illustrate the evolution of
different methods for solving different issues related to homology detection.
A S IC HO MO L O G Y DE T E C T IO N ME T HO DS
Over the past decades, various methods have been proposed for the task of homology detection. The key idea
behind most of these methods is to calculate a score to measure the similarity between a given query and all
databases sequences. The query and database sequences are classified as homologous if the score exceeds some
The most commonly used score for comparing two sequences is the Smith-Waterman score . It is
calculated as the score of the local optimal pairwise alignment of two sequences. For retrieving sequences
similar to a newly sequenced protein (a query), Smith-Waterman algorithm , BLAST  and FASTA 
are the most commonly used methods. These methods use the Smith-Waterman score to measure the similarity
between sequences in different ways. BLAST and FASTA are heuristic algorithms which use certain
assumptions and approximations. Both programs first identify very short exact matches between the query and
database sequences. Next, the best short hits from the first step are extended to look for longer stretches of
similarity. Finally, the best hits are optimized with some form of dynamic programming. On the other hand,
the Smith-Waterman approach is a completely dynamic programming tool which effectively makes all possible
pairwise comparisons to all of the database sequences. Hence, it is a much more sensitive technique as
compared to BLAST and FASTA, but it is much more computationally expensive and slower than any of them.
However, all mentioned methods' sensitivity degrades significantly in remote homology detection as they
depend only on pairwise similarity score between the query and database sequence .
2.2 R emote homolog y detec tion methods
For the remote protein homology detection task, numerous methods have been developed to achieve relatively
higher sensitivity than the methods mention in Sec. 2.1. For example, the method proposed in  is based on
the presence of sequence motifs. The motif content of a pair of sequences is used to define a similarity that is
used as a kernel for a classifier. Motifs represent limited, highly conserved regions of proteins . By focusing
on comparing motifs, important clues to a protein's function can often be revealed even if it is not globally
similar to any known protein in databases. Hence, this approach works well in remote homology detection as it
alleviates depending on low global pairwise similarity score and focus on conserved segments which usually
have relatively higher similarity in distantly related sequence.
An improvement in remote homology detection is achieved by developing methods based on profile-
sequence comparison . A profile is usually built from query sequence(s) either from the multiple alignment
of the family of closely related sequences to the query or directly from an input multiple alignment. The profile
allows one to distinguish between conserved positions that are important for defining members of this family
and non-conserved positions that are variable among the members of the family. Furthermore, it describes
exactly what variation in amino acids is possible at each position by recording the probability for the
occurrence of each amino acid along the multiple alignment. Therefore, a profile contains more information
about the sequence's family of closely related sequences than a single sequence. Hence, profile-sequence
methods provide higher sensitivity for detecting remote homologies than BLAST, FASTA and similar methods
based on comparing query and database sequences.
A typical example of profile-sequence methods is PSI-BLAST . Given a query sequence, PSI-BLAST first
performs a quick BLAST search to retrieve database sequences closely related to the query. Then a multiple
alignment is created from this initial search's hits. After that, a Position Specific-Scoring Matrix (PSSM) is
generated from this alignment. This PSSM is used as a profile which is scored against all database sequences
and a new PSSM is created by combining search results to original PSSM. The last step is repeated until no
significant new search results are retrieved. A PSI-BLAST variant; Cascade PSI-BLAST detects remote
homologies by performing cascade propagation of PSI-BLAST search results through all hits identified for a
query. This approach allows effective detection of remote homologies for that query sequence.
2.3 P rofile hidden Mark ov model proc edures for remote homolog y detec tion
A significant breakthrough in profile-sequence methods is achieved by using a hidden Markov model (HMM)
as a profile for modeling a family of related sequences instead of a PSSM or other types of standard profiles
 . SAM  and HMMer   are the most popular software packages that use HMMs as profiles.
Krogh introduced an HMM architecture well suited for representing a profile for multiple sequence alignments
. For each consensus column in the multiple alignment, a `match' state models the distribution of residues
allowed in this column. An `insert' state and `delete' state at each column allow for insertion of one or more
residues between this column and the next, or for deleting the consensus residue. An emission probability
distribution is associated with each match and insert state to model the residues emitted from that state. Also, a
transition probability is assigned to transit form a state to another based on the model architecture.
A profile HMM can be considered as an abstraction model which by following its states' emission and
transition probabilities, a set of sequences is generated that are closely related to the sequences used to build
the model. Figure 1a shows a sample multiple sequence alignment, and a profile HMM built form this
alignment is shown in figure 1b.
(a) P rotein multiple s equence alignment
(b) P rofile HMM architecture
F ig. 1. T he P rofile HMM in (b) is built from the multiple s equence alignment s hown in (a) with three cons ens us
columns . T hes e three cons ens us columns are modeled by three match s tates (rectangles : M1-M3), each has
20-res idue emis s ion probabilities vectors . Ins ertions in the multiple alignment are modeled by ins ert s tates
(diamonds : I0-I3) which als o have 20-res idue emis s ion probabilities vectors each. D elete s tates (circles : D 1-D 3)
are s ilent s tates with no emis s ion probabilities. T rans itions are repres ented by arrows with a trans ition
probability as s igned for each arrow, for example
. T he model length is meas ured by the number of match
s tates ; hence this model is of length 3.
A profile HMM has several advantages over a PSSM and other types of standard profiles . It has a
formal probabilistic basis and a consistent theory behind match, insertion and deletion scores, in contrast to
other types of profiles which use heuristic methods. Also, a profile HMM applies a statistical method (Henikoff
and Henikoff ) to estimate the true frequency of a residue at a given position in the alignment from its
observed frequency. These estimated frequencies are used to build match and insert states' emission
probabilities vectors. On the other hand a PSSM uses the observed frequency itself to assign an emission
probability for that residue .This means that a profile HMM derived from a relatively small number of
sequences can be of equivalent quality to a PSSM created from a larger number of aligned sequences .
Hence a profile HMM provides better modeling capability and homology detection sensitivity.
However, the methods relying on profile HMMs are computationally intensive as dynamic programming
algorithms are used for building a HMM and scoring its similarity against database sequences. These
algorithms called Forward-Backward (for building model) and Viterbi (for scoring a model against a sequence)
have a worst case time complexity of
, where is the number of sequences , is the number of
HMM states and
is sequence length . Another approach is used to build a profile HMM from a
multiple alignment called Maximum A-Posteriori algorithm (MAP)  is of time complexity
is the number of columns in the multiple alignment.
Many variants to HMMer have been proposed to either increase computational speed or sensitivity. One
variant is proposed in  which extracts submodels from a query profile HMM, called sub-HMMs. These sub-
HMMs represent parts of the HMM that model the most conserved parts (motifs) of the sequences used to
build the original HMM. Sub-HMMs are shorter, information rich models with the same architecture of a
profile HMM. Based on that definition, extracting sub-HMMs works as follows. First the Kull-back-Leibler
divergence (KL-divergence or relative entropy)  is calculated for each mach state, then a series of
normalization and smoothing steps is performed and the most information rich HMM regions are excised from
the original profile HMMs. KL-divergence is calculated as in (1)
where is the KL-divergence value of match state
is the estimated emission probability of residue
in the emission distribution vector of
. is the background distribution of residue . KL-divergence
measures the amount of information carried by this state's emission distribution relative to the background
distribution. The result of these extraction steps is an ordered set of extracted sub-HMMs. These extracted
submodels are then scored against each database sequence. This set of submodels is ordered from left to right
based on their position in the original model. Figure 2 shows sample sub-HMMs and sample emission vectors.
F ig. 2. (a) T wo s ample s ub-HMMs (
) extracted from a profile HMM with match s tates belonging to
that s ub model highlighted in red. (b) S ample protein res idues emis s ion dis tributions for two mach s tates
dis tribution on the left) and
(the dis tribution on the right).
has high K L -divergence value hence it belongs to
a s ub-HMM. O n the other hand
has low K L -divergence value so it doesn't belong to a sub-HMM.
The target of this sub-HMM based homology detection approach is to extend the use of profile-HMMs to be
used in highly localized sequence similarity searches that focus on shorter conserved parts of sequence rather
than entire domains or global similarities. Another variant to HMMer is HMMERHEAD  which filters a
query profile to the most important parts for homology detection. HMMERHEAD initially generates and
identify significant "words". This step consists of identifying ungapped four residue words from a profile
HMM's match state emission vectors that possess a probability above some threshold. A word score is
calculated as the sum of the log-odds emission probabilities of the word's residues, as determined from the
profile-HMM. These words are then identified in the database sequences using a deterministic finite
automaton. After that, each word identified in a database sequence is the seed for an ungapped alignment
between the sequence and the profile-HMM. By focusing on important short words, HMMERHEAD achieves a
20X speed over HMMer with 4% loss in sensitivity.
A major advance in profile based remote homology detection methods was achieved by comparing profiles
to profiles. Profile-profile comparison approaches can be considered as an extension to profile-sequence
approaches which leverages profiles' advantages in similarity search for both database and query sequences.
Several programs for homology detection have recently been developed based on this idea: PROF_SIM ,
COMPASS , LAMA  and HHsearch . These programs were shown to be significantly more sensitive
than PSI-BLAST and have been applied for identifying evolutionary links between protein families previously
thought to be unrelated   . HHsearch is based on comparing two profile HMMs by calculating the
co-emission probability (
)  for these two models. Co-emission probability is the probability that two
HMMs independently generate the same sequence, that is for models
over and alphabet we compute
according to (2)
is the probability that
generates sequence . A detailed dynamic programming
algorithm for computing
is introduced in .
expresses how two HMMs are similar which in turn
reflects the similarity of the groups of sequences represented by these two HMMs . By combining the
advantages of profile-profile methods and profile-HMMs, HHsearch achieves relatively higher sensitivity in
detecting remote homologies than other homology detection methods. In the following section we present how
we modify HHsearch for achieving higher computational speed with minimal loss in sensitivity.
3 P ropos ed Method
The experimental results in  show that HHsearch's sensitivity for detecting remote homologies is
relatively higher than other homology detection methods in the literature. However, HHsearch has a high
computational complexity and takes significant run time since it uses a dynamic programming algorithm
similar to the Smith-Waterman algorithm for scoring the similarity of two HMMs. The high computational
complexity of HHsearch was the main motivation for developing our method.
The main idea behind the modification is focusing on scoring short "key" submodels of database profile
HMMs against submodels extracted from the query profile HMM, instead of scoring two complete models.
These key submodels are the most important submodels for homology detection, as they carry most of the
information needed for classifying database profile HMMs as homologous or non-homologous to a given
query. The work we introduce in this paper is similar to HMMERHEAD (see Sec. 2.2) in an effort to reduce
search time significantly with minimal loss in sensitivity.
Since the time complexity for scoring two HMMs depends on the lengths of these models (will be discussed
in Sec. 3.3), our modification significantly reduces the run time of HHsearch with a minimal loss in sensitivity.
Also by focusing on selecting the most important submodels to use in comparison for homology detection, the
sensitivity loss becomes minimal as shown in results in Sec. 4. In section 3.1 we start with rules and definitions.
We describe the details of SHsearch method in section 3.2.
3.1 R ules and definitions
In this section we present rules and definitions used in SHsearch method. In rule 1 we present conditions for
aligning two sets of submodels extracted from two HMMs. In definition 1 we introduce : a new score for
measuring the similarity of two profile HMMs. In definition 2 we introduce a novel relevancy score that is used
for extracting key submodels. We present a formal definition for hierarchal clustering in databases of
sequences and homology and non-homology clustering levels in definition 3
Rule 1 (submodels alignment):
be two different profile hidden Markov models, and
be the lengths (number of match states)of these models. Let
) be the co-emission probability
of the two HMMs as described in . A similarity score of the two HMMs based on
introduced in 
and is calculated as shown as in (3)
is normalized i.e.
. Recall from Sec. 2 that sub-HMM is a shorter model with the same
architecture and parameters set of a profile HMM. Hence, can be used to score pairs of sub-HMMs and to
construct an alignment for the two sets of extracted submodels. Let
be a sub model extracted from
with index , then the alignment rules are as follows:
1. Every submodel in the first set is aligned with a submodel in the other set iff both have the highest
score amongst all possible pairings. This rule can be stated formally as follows:
is aligned with
, where , is the number of extracted submodels from
2. No cross alignment is allowed, that is:
For any submodels:
is aligned with
can only be aligned with
Figure 3a shows an alignment of two sets of submodels that satisfies the two conditions, while the
alignment in figure 3b has a cross alignment and hence violates condition 2.
Excerpt out of 23 pages
- Quote paper
- Mohamed Baddar (Author)Noha Yousri (Author), 2014, SHsearch. A Method for Fast Remote Homology Detection, Munich, GRIN Verlag, https://www.grin.com/document/375278