Background The discovery of cis-regulatory motifs still remains a challenging task

Background The discovery of cis-regulatory motifs still remains a challenging task even though the number of sequenced genomes is constantly growing. frequencies near the transcription or translation start sites, but are not strictly position dependent (e.g. T/A rich motifs), the length of a significant stretch is recorded as a “gap”. The script pSUMscan v.3.4.0 was written to automatically catalogue the peaks and valleys and gap data with the default set to 4 standard deviations (SD) above or below the background average; gaps were set to a default length of 25 bp. This entire automated analysis performed by the pSUMscan script can also be manually done in common spread sheet analysis programs. Cis-regulatory Elements (CREs) Known CREs were retrieved using the well maintained databases PLACE ([56]; 450 elements) for plant motifs and SCPD ([57]; 51 elements) for motifs from Saccharomyces. In order to gather CREs for Drosophila or Caenorhabditis, literature describing various eukaryotic cis-elements was collected and catalogued into a Mouse monoclonal antibody to LIN28 list of 87 elements. This list carries the original description of the cis-element, the IUPAC conversion used in this work and the original reference source describing the element. Apart from palindromes, we conducted motif distribution curves for sense and the antisense orientation of the motifs independently. The list of cis-elements used from PLACE, SCPD and other eukaryotes is found in Additional File 5. Motif permutations All possible permutations of the hexamers TATAAA and TATATA for each nucleotide were generated with the MAIN_AllCombinations v.3.4.1 script. Antisense motifs were generated by the MAIN_List_w_RevComps v.1.2 routine and all redundant motifs were removed with the script MAIN_ClearRedundancy v.1.2. Shared motifs Shared motifs were identified by conducting text file lists of overrepresented motifs for each dataset. The list content was compared using the SetGrouping script to identify Jujuboside B IC50 shared motifs in both lists. Details Jujuboside B IC50 can be found in Additional File 6. Abbreviations TSS, transcription start site; ATG, translation start site; CRE, cis-regulatory element; SD, Standard Deviation; At, Arabidopsis thaliana; Sc Saccharomyces cerevisiae; Dm, Drosophila melanogaster; Ce, Caenorhabditis elegans. Authors’ contributions All authors have read and approved the manuscript. DW and KS initiated the development of frequency-distribution analysis of cis-element distribution, KS and KB developed algorithms for analysis, KB programmed the Motif Mapper open source package, KH provided continuous support and Jujuboside B IC50 research space for completion of this work. Supplementary Material Additional File 1: Distribution curves of the TATA-box motifs TATAAA and TATATA. The motif distribution curves of the TATA-box hexanucleotides TATAAA and TATATA were constructed on automatically assembled datasets of the Arabidopsis, Caenorhabditis, Drosophila and Saccharomyces genome sequences. Relative number of motifs per site (in percent) was mapped to their respective position [see Additional file 9]. The grey box indicates the region used to calculate the background average and its SD. Click here for file(131K, pdf) Additional File 2: Size distribution of 5′ UTRs. The 5’UTR lengths from each TSS dataset were plotted against their relative frequency. Click here for file(62K, pdf) Additional File 3: Mononucleotide frequencies at the transition initiation sites TSS and ATG. The motif distribution curves of the four nucleotides were constructed on automatically assembled datasets of the Arabidopsis, Caenorhabditis, Drosophila and Saccharomyces genome sequences. Relative number of nucleotides per site (in percent) was mapped to their respective position. Click here for file(184K, pdf) Additional File 4: Motif Mapper analysis flowchart. A flow chart of the sequence extraction and analysis procedure using Motif Mapper. Click here for file(282K, pdf) Additional File 5: List of cis-regulatory elements and references. List of all cis-regulatory elements, their IUPAC version used in this work and their references. Click here for file(285K, xls) Additional File 6: Hexanucleotides positioned at the transcription or the translation start site and shared between the four model organisms. The standard deviation fold difference for the most significant peak is shown for each shared Jujuboside B IC50 motif of the transcription or the translation start site datasets. Motif distribution curves were conducted and analyzed for all possible hexanucleotide motifs that have frequency disequilibria which exceeded 6 fold SD from the background average. Click here for file(95K, xls) Additional File 7: Raw data for Table ?Table1.1. Contains the .psum-data for all of the motifs used in the D. melanogaster dataset comparisons between Ohler et al., (2002, [21]) and GenBank (2005). Click here for file(451K, xls) Additional File 8: Peak analysis for.