Excerpt

## Abstract

*DNA microarray aims at extracting useful information that can be applied in medical and bio- logical studies. Clustering is one of the most utilized data mining technique to analyze the gene expres- sion data. Generaly, there are subsets of genes that have similar behave or under subsets of con- ditions. Therefore, biclustering term is introduced to identify the subgroups of genes and subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression ma- trix, rather than clustering these two dimensions separately. In this comprehensive survey, we tend to analyze a large number of existing methods to biclustering, and classify them in accordance with the methods used to perform the search.*

Keywords: Biclustering, microarray, genes, optimization, expression data.

## 1. INTRODUCTION

DNA microarray technologies have made it feasi- ble to monitor transcription levels of tens of thou- sands of genes in a single expriement. A typical DNA microarray experiment involves a multistep procedure: fabrication of microarrays by fixing properly designed oligonucleotides representing specific genes; hybridization of cDNA populations onto the microarray; scanning hybridization sig- malization of data; and analyzing data to identify differentially expressed genes as well as sets of genes that are co regulated^{1}. A gene expression data set from a microarray experiment can be represented by a real-valued expression matrix. Rows and columns represent genes and conditions of the expression matrix respectively. Conditions may belong to different timepoints. Each entry nals and image analysis; transformation and nor- represents the expression level of a gene under a condition. The expression levels for a gene across different experimental conditions are cumulatively called the gene expression profile, and the expres- sion levels for all genes under an experimental condition are cumulatively called the sample ex- pression profile. Figure 1 shows the gene expres- sion matrix.

Fig. 1 Gene expression matrix

illustration not visible in this excerpt

At the beginning of the process, clustering algo- rithms have often been used to reduce the com- plexity of humongous expression data^{2}. Cluster- ing techniques have proven to be helpful to under- stand gene function, gene regulation, cellular processes, and subtypes of cells. The traditional clustering algorithms are not suitable for all appli- cations, especially gene expression data analysis. These clustering algorithms group the genes over all the conditions, whereas cellular processes are active only under a subset of conditions. Also, a single gene may belong to more than one group as a gene may be involved in more than one biologi- cal process. Consequently, biclustering algorithms have been offered as another approach to standard clustering techniques to identify local patterns from gene expression data sets. Biclustering refers to the “simultaneous clustering” of both rows and columns of a data matrix. Let *G* be a set of genes, *C* a set of conditions, and *A(G, C)* the expression matrix, where * G={1,2, … ,m}* and * C={1,2, … ,n}*. The element *GExi,j* of *A(G, C)* represents the ex- pression level of gene *‘ i ’* under condition *‘ j ’*. The aim of biclustering is to extract the sub-matrix *A(G', C')* of *A(G, C)* meeting homogeneity criteria, which is identified by gene subset *G'* of *G* and condition subset *C'* of *C*.

## 2. BICLUSTERING METHODS

At present, there are plenty of biclustering tech- niques available for gene expression data analysis. We make out two main classes of biclustering algo- rithms: systematic search approaches and stochastic search approaches, also called metaheuristic algo- rithms. In general, the heuristic search algorithms are used to approximate the problem by finding sub-optimal solutions. Metaheuristics are stochastic optimization it finds a solution in a reasonable time. Metaheuristics have usually an iterative behavior.

## 2. Systematic Biclustering Algorithms

#### 2.1.1 Divide and Conquer Approach

Generally, a divide and conquer approach works by recursively breaking down a problem into two or more sub-problems of the same (or related) type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the origi- nal problem. With this approach, initially a biclus- ter representing the whole data matrix then it di- vides into two submatrices to obtain two biclus- ters. Then reiterate recursively this process until to get a certain number of biclusters verifying a spe- cific set of properties.

The biclustering algorithm called direct clustering is proposed by^{3} was one of the first works ever published on biclustering, although it was not ap- plied to gene expression data. The algorithm is based on the use of a divide and conquers ap- proach, in which the input matrix is iteratively partitioned into a set of sub-matrices, until k ma- trices are obtained, where k is an input parameter for the number of desired biclusters.

The Bimax algorithm is a method for finding sub- groups of 1 value in a binary matrix^{4}. All values above the threshold will be set to one, all those below to zero. The discretization scheme defines if only down or up-regulated genes (or both) will be considered. Although being very fast, however, its biggest disadvantage is that it may ignore good biclusters by partitioning them before identifying them.

Zhao et al presented a new geometric biclustering algorithm based on the Hough Transform^{5}. Based on the linear structures in column-pair spaces and divide them into different patterns us- ing the Additive and Multiplicative Pattern Plot (AMPP). This method reduces the computational complexity considerably and makes it possible to analyze large-scale microarray data.

Yang et al proposed a novel transform technique based on Singular Value Decomposition (SVD) to extract significant biclusters^{6}. The whole process includes three steps: property of SVD, the Bidirectional Mixed Clustering algorithm and Lift algorithm. Even though extracting negative corre- lation gene pairs, however biggest issue is that it may ignore good biclusters by partitioning them and quality of biclusters fully depends on the more number of initial parameter value.

#### 2.1.2 Greedy Iterative Search Approach

This approach builds a solution in a step-by-step way using a given quality measure. Decisions made at each step are based on information at hand without worrying about the effect these deci- sions may have in the future. Moreover, once a decision is taken, it becomes permanent and is never reconsidered. By applying this approach to the biclustering problem, at each iteration, subma- trices of the data matrix are constructed by add- ing/removing a row/column to/from the current submatrix that maximizes/minimizes a certain function. This process reiterate until no other row/column can be added/removed to/from any submatrix.

Ben-Dor et al defined a bicluster as an Order- Preserving Sub-Matrix (OPSM)^{7}. This way, a sub-matrix is said to be order preserving if there is a permutation of its columns under which the se- quence of values in every row is strictly increas- ing. This algorithm can also be used to discover more than one bicluster in the same dataset, even when they are overlapped. However, in this model concerns only the order of values and thus makes the model quite restrictive.

An Iterative Signature Algorithm (ISA) provides a definition of biclusters as transcription modules to be retrieved from the expression data proposed by ^{8}. It starts with a set of input seeds and the fixed points find corresponding to each seed through iterations. These distinct fixed points are collect in order to decompose the expression data into mod- ules. The structure of this decomposition depends on the choice of thresholds. The first normaliza- tion step in ISA may cause increased overlap de- gree. Because the range of expression values after normalization becomes narrower with increased overlapping, the differences between normal and significant expression values blur and are more difficult to separate.

To address random masking of the values in the data matrix issue and to further accelerate the bic- lustering process, the authors presented a new model of bicluster to incorporate null values. Yang et al proposed an algorithm named FLexible Over- lapped biclustering (FLOC) able to discover a set of k possibly overlapping^{9} biclusters simulta- neously based on probabilistic moves. Bicluster volume is taken into account within the possible actions, where bigger biclusters are preferred, and the variance is used to reject constant biclusters. The whole process ends when no action that im- proves the overall quality can be found.

Liu & Wang proposed the Maximum Similarity Bicluster (MSB) algorithm^{10}. It starts by con- structing a similarity matrix based on a reference gene. Then a process of iteratively remove the row or column in the bicluster with the worst similarity score is perform, until there is one element left in the bicluster. MSB performs well for overlapping biclusters and works well for additive biclusters. But, it works for the special case of approximately squares biclusters. In order to overcome this issue, an extension algorithm named Randomized MSB Extension (RMSBE) algorithm is also presented. DiMaggio et al presented an approach based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize dissi- milarity metric^{11}. Converse to OPSM, this ap- proach allows for monotonicity violations in the reordering, but penalize their contributions ac- cording to a selected objective function. However, the main drawback of this method is that objective function extracts trivial solutions.

Angiulli et al. presented a biclustering algorithm based on a greedy technique enriched with a local search strategy to escape poor local minima named Random Weight Biclustering (RWB) which produces one bicluster at a time^{12}. In order to avoid getting trapped into poor local mi- nima, the algorithm executes random moves ac- cording to a probability given by the user. Moreo- ver, degree of overlapping rate is controlled for genes and conditions independently by using two different frequency thresholds.

Li et al presented as a QUalitative BIClustering algorithm (QUBIC), in which the input data ma- trix is first represented as a matrix of integer val- ues, either in a qualitative or semi-qualitative manner^{13}. Edge weights are computed in the base of the similarity level between the two cor- responding rows. After the graph has been created, biclusters are identified one by one, starting for each bicluster with the heaviest unused edge as a seed. However it delivers approximative solutions without optimality guarantees.

Ayadi et al presented an algorithm BicFinder for extracting biclusters from microarray data which constructs a Directed Acyclic Graph (DAG) to combine a subset of genes under a subset of condi- tions iteratively^{14}, by adopting the evaluation function Average Correspondence Similarity In- dex. BicFinder, do not require fixing a minimum or a maximum number of genes or conditions, enabling a generation of diversified biclusters. However, it places restrictive constraints on the structure of the biclustering solutions.

**[...]**

^{1} D.J. Lockhart, and E.A. Winzeler, “Genomics, gene expression and DNA arrays,” Nature, Vol. 405, 2000, pp. 827-836.

^{2} D. Jiang, C. Tang and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey”, IEEE Transactions on Know- ledge & Data Engineering, Vol. 16, No. 11, 2004, pp. 1370- 1386.

^{3} J. A. Hartigan, “Direct clustering of a data matrix”, Journal of the American Statistical Association, Vol. 67, No. 337, 1972, pp. 123-129.

^{4} A. Prelic, S. Bleuler, P. Zimmermann, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler, “A sys- tematic comparison and evaluation of biclustering me- thods for gene expression data”, Bioinformatics, Vol. 22, No. 9, 2006, 1122-1129, 2006.

^{5} H. Zhaoa, A.W. Liewb, X. Xie and H. Yan, “A new geometric biclusteringe algorithm based on the Hough transform for analysis of large-scale microarray data”, Journal of Theoritical Biology, Vol. 251, 2008, pp. 264- 274.

^{6} W.H. Yang, D.Q. Dai, H. and Yan H, “Finding corre- lated biclusters from gene expression data”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, 2011, pp. 568-584

^{7} A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini, “Disco- vering local structure in gene expression data: The or- der-preserving submatrix problem,” Journal of Compu- tational Biology, Vol. 10, No. 4, 2003, pp. 373-384.

^{8} S. Bergmann, J. Ihmels, and N. Barkai. “Iterative signature algo- rithm for the analysis of large-scale gene expression data,” *Physics Review E*, Vol. 67, 2003, pp. 1-18.

^{9} J. Yang, H. Wang, W. Wang, and P. Yu, “Enhanced biclustering on expression data”, In Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngi- neering, Washington, DC, USA, 2003, pp. 321-327.

^{10} X. Liu, and L. Wang, “Computing the maximum simi- larity bi-clusters of gene expression data,” Bioinformat- ics, Vol. 23, No. 1, 2007, pp. 50-56.

^{11} P. DiMaggio, S. McAllister, C. Floudas, X. Feng, J. Rabinowitz, and H. Rabitz, “Biclustering via optimal re- ordering of data matrices in systems biology: rigorous methods and comparative studies,” Bioinformatis, Vol. 9, No. 1, 2008, pp. 458-467.

^{12} F. Angiulli, E. Cesario, and C. Pizzuti, “Random walk biclustering for microarray data”, Journal of Informa- tion Sciences, Vol. 178, No. 6, 2008, pp. 1479-1497.

^{13} G. Li, Q. Ma, H. Tang, A.H. Paterson, and Y. Xu, “Qubic: a qualitative biclustering algorithm for analyses of gene expression data”, Nucleic acids research, Vol. 37, No. 15, e. 101.

^{14} W. Ayadi, M. Elloumi, and J.K. Hao, “Bicfinder: a bic- lustering algorithm for microarray data analysis. Know- ledge and Information Systems “, Vol. 30, No. 2, 2009, pp 341-358

- Quote paper
- Rengeswaran Balamurugan (Author), 2017, Biclustering Algorithms for Microarray Data, Munich, GRIN Verlag, https://www.grin.com/document/376903

Publish now - it's free

Comments