Information for Instant Help
This research is supported by an internal grant from the Children's Mercy Hospitals.
Please cite the website and the IEEE paper as:
1. A genetic-based EM motif-finding algorithm for biological
sequence analysis. Proc. 2007 IEEE CIBCB, pp. 275-282.
2. Web server for GEMFA: http://gemfa.cmh.edu
The email address is NOT required to run the server.
However, a valid email address is needed if one wants to send the output link via email.
DNA sequences in FASTA format
Here is an example extracted from CRP binding sites (partly),
>CE1CG 17 61
>trn9cat 1 84
Protein sequences in FASTA format
Here is an example (one set from BAliBASE, partly shown),
Each FASTA formulated sequence (DNA or protein) record consists of two components: the first starts with ">" followed by
the sequence name and other annotations related to the sequence properties such as genomic location or
motif start position. The first part only allows one line.
The second part is the actual sequence (DNA or protein or RNA) and it can be in multiple lines.
Note that insertion of blank lines is allowed in FASTA sequences.
In this server, the longest single sequence allowed in a submission is 5000 residues,
and the maximum number of sequeces allowed is 2000.
Therefore, the maximum total length of a sequence data set allowed is 10,000,000 residues.
Although these limitations would work for most submissions, they could be relaxed upon users' request.
The web input is organized into two panels: (1) motif-finding
input and (2) genetic algorithms input.
Motif-finding Input: one is first asked to give
email address, but NOT required. If one needs the output link to be
sent via email, a valid address is required.
GEMFA is able to align either DNA or protein sequences, so specify them in
the molecular type field. However, the server provides a function
capable of automatically detecting the sequence type.
Two EM-based motif-finding algorithms (DEM or WEM) are available, and DEM is the
GEMFA sorts the final alignments in descending order and
output the top-k motifs, so one can specify k ranging from 1 to 200.
For DNA sequences, one can choose to align them on both strands, or forward or reverse
only. However, protein sequences are only allowed on forward
strand, which is enforced by the server.
The motif width must be specified before each run.
The sequence data can be either uploaded or directly copied and pasted into a text area.
The FASTA format is required for the input sequences.
The following limitations imposed: the motif width range allowed is from 4-120 (residue);
the top motifs range is from 1 to 200; the maximum sequences input is 2000.
Genetic algorithms Input: the standard GA parameters are given
as default values: population size
is 20 chromosomes, generations is 10, mutation rate (probability)
is 0.01, and both crossover and tournament selection rates are
0.75. Nevertheless, one can use a different scheme. Notice that, the
more generations the longer running time. Larger population may
facilitate a more thorough search of alignment space, however, it
consumes a longer time to finish a task. A tougher motif-finding
requires a larger population and more generations.
The following limitations imposed: the population size range allowed is from 10 to 50 (even number enforced);
the nunmber of generations allowed is from 5 to 20.
Comparison notice: If comparing GEMFA with other motif-finders, one has to pay attention to
the number of runs (multiple-run). For example, GEMFA sets 20 chromosomes and 10 generations, which
equal to 200 runs in MEME or others. In other words, if MEME is run 300 or 100 times, that is NOT a
fair comparison with GEMFA. To outperform an old algorithm, new algorithms should follow this important rule:
try the best to make fair comparison on the same ground.
Results can be obtained online after the submitted job is done or one can ask the output link
sent to his/her email address provided that a vilad email address is entered.
Output: The output webpage from the GEMFA server consists of
a summary of input parameters and a hyperlink to a text file
containing the alignment results. Click the link to download the
results: the top-k motif alignments are displayed in ascending order
of likelihoods, and each alignment first shows its likelihood and then
the full local alignment where each row consists of sequence name,
motif start location, strand and motif sequence.
Here is an example of the output:
Results from GEMFA Web Server at CMH, Missouri, USA
Time generated: Mon Aug 10 16:34:45 CDT 2009
Number of sequences: 18
Total sequence length: 1890
Motif Alignment Input
Motif-finding algorithm: DEM
Aligned sequence: dna
Motif aligned on: both strands
Motif length: 22
Output top 10 motifs
Genetic Algorithms Input
Number of generations: 10
Population size: 20
Crossover prob.: 0.75
Tournament prob.: 0.75
Mutation prob.: 0.01
OUTPUT: seq_name, motif_start, strand, motif_seq
ce1cg 17 61 , 59, +, TTTTTTTGATCGTTTTCACAAA
ara 17 55 , 53, +, ATTATTTGCACGGCGTCACACT
bglr1 76 , 76, -, ACTGTGAGCATGGTCATATTTT
crp 63 , 63, -, TATGCAAAGGACGTCACATTAC
cya 50 , 48, +, AAGGTGTTAAATTGATCACGTT
deop2 7 60 , 5, +, ATTATTTGAACCAGATCGCATT
gale 42 , 42, -, AATTTATTCCATGTCACACTTT
ilv 39 , 39, -, AACGTGATCAACCCCTCAATTT
lac 9 80 , 9, -, AATGTGAGTTAGCTCACTCATT
male 14 , 12, +, ATTCTGTAACAGAGATCACACA
malk 29 61 , 59, +, ATTTCGTGATGTTGCTTGCAAA
malt 41 , 39, +, GAATTGTGACACAGTGCAAATT
ompa 48 , 46, +, TATGCCTGACGGAGTTCACACT
tnaa 71 , 71, -, ATTGTGATTCGATTCACATTTA
uxu1 17 , 17, -, GTTGTGATGTGGTTAACCCAAT
pbr322 53 , 51, +, GCGGTGTGAAATACCGCACAGA
trn9cat 1 84, 82, +, AAAATGAGACGTTGATCGGCAC
tdc 78 , 78, -, TTTGTGAGTGGTCGCACATATC
ce1cg 17 61 , 61, +, TTTTTGATCGTTTTCACAAAAA
ara 17 55 , 55, +, TATTTGCACGGCGTCACACTTT
bglr1 76 , 76, +, ACTGTGAGCATGGTCATATTTT
*Note that each output will be kept in our server for only one day.
Users should download their results as soon as possible.
The CRP binding sequences are in the text area as the default sequences, and
one can simply click on the "Submit" button located between two
panels to start a trial run.
Last modified: August 12, 2009 by C. Bi