Information for Instant Help

This research is supported by an internal grant from the Children's Mercy Hospitals. Please cite the website and the IEEE paper as:

1. A genetic-based EM motif-finding algorithm for biological 
   sequence analysis. Proc. 2007 IEEE CIBCB, pp. 275-282.
2. Web server for GEMFA: http://gemfa.cmh.edu

Email address
The email address is NOT required to run the server. However, a valid email address is needed if one wants to send the output link via email.

DNA sequences in FASTA format
Here is an example extracted from CRP binding sites (partly),

>CE1CG 17 61 
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGT
TTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG
>trn9cat 1 84
CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCT
GGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTT
TTGGCGAAAATGAGACGTTGATCGGCACG
>ECOARABOP 
GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTT
GCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG

Protein sequences in FASTA format
Here is an example (one set from BAliBASE, partly shown),

>grhr_claga 1
msgnttlllsnptnvldnssvlnvsvsppvlkwetptfttaaRFRVAATLV
LFVFRAASNLSVLLSVTrgrgrrlashlRPLIASLASADLVMTFVVMPLDAVWNvtvqwyagdam
CKLMCFLKLFAMHSAAFILVVVSLDRHHAILhpldtldagr ...

>v1ar_rat 1
msfprgsqdrsvgnsspwwplttegsngsqeaarlgegdsplgdvrneelaKLE
IAVLAVIFVVAVLGNSSVLLALHrtprktsrmHLFIRH
LSLADLAVAFFQVLPQLCWDitssfrgpdwlCRVVKHLQVFAMFASAYMLVVMTADRYIAVCh
plktlqqparrsRLMIATSWVLSFILSTpqyfifsvieievnn
gtktqdcwatfiqpwgtraYVTWMTSGVFVAPVVVLGTCYGFICyhi ...

>oxyr_mouse 1
megtpaanwsieldlgsgvppgaegnltagpprrnealaRVEVAVLCLILF ...

Each FASTA formulated sequence (DNA or protein) record consists of two components: the first starts with ">" followed by the sequence name and other annotations related to the sequence properties such as genomic location or motif start position. The first part only allows one line. The second part is the actual sequence (DNA or protein or RNA) and it can be in multiple lines. Note that insertion of blank lines is allowed in FASTA sequences.

In this server, the longest single sequence allowed in a submission is 5000 residues, and the maximum number of sequeces allowed is 2000. Therefore, the maximum total length of a sequence data set allowed is 10,000,000 residues. Although these limitations would work for most submissions, they could be relaxed upon users' request.

Web Input
The web input is organized into two panels: (1) motif-finding input and (2) genetic algorithms input.

Motif-finding Input: one is first asked to give email address, but NOT required. If one needs the output link to be sent via email, a valid address is required. GEMFA is able to align either DNA or protein sequences, so specify them in the molecular type field. However, the server provides a function capable of automatically detecting the sequence type. Two EM-based motif-finding algorithms (DEM or WEM) are available, and DEM is the default.

GEMFA sorts the final alignments in descending order and output the top-k motifs, so one can specify k ranging from 1 to 200. For DNA sequences, one can choose to align them on both strands, or forward or reverse only. However, protein sequences are only allowed on forward strand, which is enforced by the server. The motif width must be specified before each run. The sequence data can be either uploaded or directly copied and pasted into a text area. The FASTA format is required for the input sequences. The following limitations imposed: the motif width range allowed is from 4-120 (residue); the top motifs range is from 1 to 200; the maximum sequences input is 2000.

Genetic algorithms Input: the standard GA parameters are given as default values: population size is 20 chromosomes, generations is 10, mutation rate (probability) is 0.01, and both crossover and tournament selection rates are 0.75. Nevertheless, one can use a different scheme. Notice that, the more generations the longer running time. Larger population may facilitate a more thorough search of alignment space, however, it consumes a longer time to finish a task. A tougher motif-finding requires a larger population and more generations. The following limitations imposed: the population size range allowed is from 10 to 50 (even number enforced); the nunmber of generations allowed is from 5 to 20.

Comparison notice: If comparing GEMFA with other motif-finders, one has to pay attention to the number of runs (multiple-run). For example, GEMFA sets 20 chromosomes and 10 generations, which equal to 200 runs in MEME or others. In other words, if MEME is run 300 or 100 times, that is NOT a fair comparison with GEMFA. To outperform an old algorithm, new algorithms should follow this important rule: try the best to make fair comparison on the same ground.

Web Output
Options: Results can be obtained online after the submitted job is done or one can ask the output link sent to his/her email address provided that a vilad email address is entered.

Output: The output webpage from the GEMFA server consists of a summary of input parameters and a hyperlink to a text file containing the alignment results. Click the link to download the results: the top-k motif alignments are displayed in ascending order of likelihoods, and each alignment first shows its likelihood and then the full local alignment where each row consists of sequence name, motif start location, strand and motif sequence. Here is an example of the output:

Results from GEMFA Web Server at CMH, Missouri, USA
Time generated: Mon Aug 10 16:34:45 CDT 2009

Seqfile: LtO20090810163436.fa
Number of sequences: 18
Total sequence length: 1890

Motif Alignment Input
----------------------------
Motif-finding algorithm: DEM
Aligned sequence: dna
Motif aligned on: both strands
Motif length: 22
Output top 10 motifs

Genetic Algorithms Input
----------------------------
Number of generations: 10
Population size: 20
Crossover prob.: 0.75
Tournament prob.: 0.75
Mutation prob.: 0.01

OUTPUT: seq_name, motif_start, strand, motif_seq
Top*1    	-376.227
ce1cg 17 61 ,    59, +,	TTTTTTTGATCGTTTTCACAAA
ara 17 55   ,    53, +,	ATTATTTGCACGGCGTCACACT
bglr1 76    ,    76, -,	ACTGTGAGCATGGTCATATTTT
crp 63      ,    63, -,	TATGCAAAGGACGTCACATTAC
cya 50      ,    48, +,	AAGGTGTTAAATTGATCACGTT
deop2 7 60  ,     5, +,	ATTATTTGAACCAGATCGCATT
gale 42     ,    42, -,	AATTTATTCCATGTCACACTTT
ilv 39      ,    39, -,	AACGTGATCAACCCCTCAATTT
lac 9 80    ,     9, -,	AATGTGAGTTAGCTCACTCATT
male 14     ,    12, +,	ATTCTGTAACAGAGATCACACA
malk 29 61  ,    59, +,	ATTTCGTGATGTTGCTTGCAAA
malt 41     ,    39, +,	GAATTGTGACACAGTGCAAATT
ompa 48     ,    46, +,	TATGCCTGACGGAGTTCACACT
tnaa 71     ,    71, -,	ATTGTGATTCGATTCACATTTA
uxu1 17     ,    17, -,	GTTGTGATGTGGTTAACCCAAT
pbr322 53   ,    51, +,	GCGGTGTGAAATACCGCACAGA
trn9cat 1 84,    82, +,	AAAATGAGACGTTGATCGGCAC
tdc 78      ,    78, -,	TTTGTGAGTGGTCGCACATATC

Top*2    	-376.233
ce1cg 17 61 ,    61, +,	TTTTTGATCGTTTTCACAAAAA
ara 17 55   ,    55, +,	TATTTGCACGGCGTCACACTTT
bglr1 76    ,    76, +,	ACTGTGAGCATGGTCATATTTT
.
.
.
*Note that each output will be kept in our server for only one day. Users should download their results as soon as possible.

Submission
The CRP binding sequences are in the text area as the default sequences, and one can simply click on the "Submit" button located between two panels to start a trial run.


Last modified: August 12, 2009 by C. Bi