Genetic-enabled EM motif-Finding Algorithm

INTRODUCTION: Motif discovery in biological sequence analysis remains a challenge in computational biology. The Expectation Maximization (EM) algorithm is one of the most popular methods used in motif discovery. However, EM heavily depends on initialization and suffers from local optima. There are several techniques developed to ameliorate the situation. Here are some examples among others: (i) a quick remedy would be to run EM multiple times; (ii) preselect a set of smart seeds by word matching in the given sequences as leveraged in MEME; (iii) integrate EM with intelligent computing methods such as genetic algorithms (GA) [1]; (iv) provide a stochastic version as studied in Monte Carlo EM [3]. In addition, Markov chain Monte Carlo methods are often used to deal with the issue [4].

GEMFA: The genetic algorithm (GA) is one of the prevalent intelligent computing methods able to tackle the local optimal issue. Notably, GA is a global optimization procedure that performs adaptive (heuristic) search. It is well suited for finding optimal or near-optimal solutions in large scale optimization problems with multiple local maxima and/or minima. In fact, GA was previously tried in motif-finding, see a review in [1]. The GEMFA algorithm, originally presented in the IEEE CIBCB'07 conference [1], is an EM variant iteratively driven by GA. In each iteration, GA generates a population of new local alignments as start seeds, and then uses these seeds as input to the EM motif algorithms and produces refined local alignments. Obviously, GEMFA tightly integrates GA with EM, and cooperatively evolves a population of alignments towards a global optimal solution. GEMFA is a de novo motif-finder designed to perform multiple local alignment of DNA or protein sequences. In this web version, GEMFA adds the WEM motif-finder (very fast than regular EM, i.e. DEM) as an option. For details on WEM, refer to [2,4].

Author: Chengpeng Bi, Ph.D. (comments to cbi at cmh dot edu)
Computer languages: C++ (in EM) and Perl (in GA)
Defaults: DEM algorithm, 10 generations, a population of 20 chromosomes
Output: online or email text to users
Last modified: January 1, 2010
Copyright: (C) 2007-2010 Children's Mercy Hospitals, USA, All Rights Reserved

[1] A genetic-based EM motif-finding algorithm for biological sequence analysis. Proceedings of 2007 IEEE Symposium on Computational Intelligence in Bioinformatics & Computational Biology, pp. 275-282 (2007).
[2] Data augmentation algorithms for detecting conserved domains in protein sequences. Journal of Proteome Research, 7, 192-201 (2008).
[3] A Monte Carlo EM algorithm for motif discovery in biomolecular sequences. IEEE Transactions on Computational Biology and Bioinformatics, 6 (3), 370-386 (2009)
[4] Comparison of optimization techniques for sequence pattern discovery by maximum-likelihood. Pattern Recognition Letters (in press)
[5] Determinisitc local alignment methods improved by a simple genetic algorithm. Neurocomputing, 73, 2394-2406 (2010).

Last modified: Jan 1, 2010 by C. Bi