| Home | About ERIC | Events | Links | Genome Tools | Genomes | Publications | Training | Account |
ERIC Newsletter Please e-mail us if you would like to be added to our mailing list. In this issue...Easier Access to ERIC-BRC AnnotationsDid you know that you can use ERIC to generate lists of orthologs shared across genomes? Or download sequences or annotations in multiple formats? Links to these, and many other useful analysis tools are available through ASAP, ERIC's online annotation database. ERIC annotations can also be accessed through the Genome Tools menu on the ERIC site. Quickly find your favorite gene (e.g., yopH) through the Search ASAP search box or find annotations based on more detailed criteria such as keyword or genome region using the Annotations link. If you are interested in exporting sequences or annotations for later viewing and analysis, follow the Downloads link to for access to DNA or protein sequences, lists of annotations, and much more. In addition to well curated annotations, ERIC's ASAP also provides powerful tools for Comparative Genomics. This tool generates a table of orthologs shared across genomes. These comparisons may be qualified by relationship, feature type, and/or curation status. The table may be downloaded as displayed, or in a database-friendly format. The ERIC Annotations database provides annotations that are accurate, detailed, up-to-date and consistent across genomes. It is open for contribution by the research community to encourage annotation by domain experts, and applying for an annotation account takes only a few minutes. The entry page to ASAP is pictured below.
Figure 1. ERIC-ASAP offers enhanced access to Annotations. New Mauve Portlet for Access to Whole Genome Alignments of EnteropathogensWhole genome multiple alignment is an important tool for understanding and exploiting the variation between organisms. ERIC now offers a set of webpages that enable easy access to the pre-computed whole genome alignments constructed using the recently improved Mauve version 2.0. This "Portlet" is online and pictured below.
Figure 2. View of the Mauve Portlet. This portlet provides filtering to rapidly search for alignments carrying a specific genome, as well as online viewing of a selected multiple genome alignment. Users preferring to reference alignments repeatedly or to work offline may download alignments for import into the desktop version of Mauve. Whether you prefer to download alignments or work with them online, Mauve still provides valuable hyperlinks from genomic features to related annotations in ERIC's database. The Mauve Portlet also provides instructions on how to create new alignments to provide comparative genomic insights into your research. EnteroFams - Protein Families for EnterobacteriaAs the number of genome sequences from enterobacteria spirals upwards, "EnteroFams" provide a valuable resource, specifically tuned for this family of bacteria, to aid genome annotation and analysis. EnteroFams are a collection of protein families derived from the predicted proteomes of enterobacteria with complete genome sequences. Each family serves as a point of unification for information that can be projected across all members of a protein family. The families are defined by an initial "seed alignment" of proteins from representative genomes of enterobacteria and are represented by profile-Hidden Markov Models (HMMs) that can be used to identify additional members of a family from new genome sequences. The first release of the EnteroFams is a collection of 1,564 protein families that are nearly ubiquitous in enterobacteria. Each family is represented by a HMM constructed from an alignment of putative orthologous full-length proteins from eight "seed" genomes (see Figure 3). Orthology among proteins was determined by analysis of reciprocal best BLAST matches and additional manual curation. The eight genomes include two members each of four genera of enterobacteria, selected to sample diverse lineages of this family and include two different representative genomes for each ERIC pathogen lineage.
Figure 3. Seed Alignment for EnteroFam0000082 (RpsT), 30S ribosomal subunit protein S20. The HMMs were used to scan the eight seed genomes and eleven additional genomes for new members. The threshold for inclusion of a protein in a family was defined as the lowest score obtained for a protein from one of the eight seed genomes (see Table 1 for an example). The complete alignment of all members of the resulting family are shown in Figure 4. All members of an EnteroFam have a link to the associated EnteroFam page that contains alignments, and annotations for each family (for example, see EnteroFam0000082 ), including the cutoff thresholds. Annotations were selected to be appropriate for all species so that they can be applied to all members of the family enabling the propagation of high quality annotations across features in related enterobacteria.
Table 1. Scores and E-values for seed proteins of EnteroFam0000082 (RpsT), 30S ribosomal subunit protein S20.
Figure 4. Complete Alignment for EnteroFam0000082 (RpsT), 30S ribosomal subunit protein S20. This initial release of EnteroFams covers a substantial fraction (approximately one third) of proteins that we expect to find in any given genome from an enterobacterium, but does not cover proteins that are present at lower frequency among enterobacteria. We anticipate augmentation of the initial set of EnteroFams with additional families that cover some of these less common proteins. Since there are proteins that are unique to particular genomes, we do not expect to generate EnteroFam HMMs that cover every protein in every genome, and will focus new family construction on more widely-conserved proteins. Plans are in place to allow downloading of EnteroFams to enable more effective use of these family models by scientists in the research community. Insertion Sequence Annotations in ERIC-ASAPBackground: why it is important to study mobile DNA elementsInsertion sequences (ISs) are small units of mobile DNA that are able to insert at many sites in a genome (1). They occur throughout the eubacteria, and IS-free bacterial genomes have been found only very rarely. All of the IS families are represented in the enterobacteria (2). Mobile elements have had a significant impact on the evolution of pathogenicity, and have played an important role in bacterial evolution generally (3). ISs provide mobility for drug resistance genes, virulence genes and islands, or genes encoding advantageous phenotypes such as the ability to utilize a particular nutritional substrate or to resist host defenses. Genes and regulation of their expression may be disrupted by insertions and these effects multiply when particularly active ISs proliferate throughout a genome (as in Shigella). ISs mediate genomic rearrangements by providing sequence substrates for homologous recombination. All of the above contribute to strain variations used in diagnostic DNA analysis by mapping restriction fragment length polymorphisms (RFLPs), multi-locus PCR or sequence typing (MLST). Computational analyses of IS occurrence and contribution to evolution depend critically on accurate annotations. IS structure and familiesThe simplest IS families are 700 - 2000 bp long and consist of a gene encoding a transposase enzyme usually flanked by 15-20 bp long inverted repeats (IRs). Not all families have IRs. In some IS families the transposase is encoded by two overlapping reading frames and expression employs a programmed translational frame shift, thought to be a mechanism for regulation of expression. Transposases are classified by their biochemical mechanism and amino acid motifs that form their active sites. Many but not all create direct repeats (DRs) of the target sequence upon transposition, whose lengths vary among elements or even among copies of the same element. More complex families exist, some encode three genes, others include composite structures such as transposons and Insertion Sequence Common Regions (ISCRs) that contain genes whose functions are unrelated to transposition, frequently encoding antibiotic resistance (4). Less complex entities are also found such as miniature inverted-repeat transposable elements (MITES) which appear to be derivatives of ISs (2). Their small size makes them difficult to identify. IS annotationsThe status of IS annotations in genomes as they are published is uneven at best and sometimes very poor, as noted in two recent reviews (2,3). Transposase genes may not be correctly identified, and many published annotations do not include the IS end-coordinates that include the IRs. Frequently, none of the IS fragments are annotated - even an isolated IR sequence is important because it may be part of a composite element. Fragments are also important as these may be scars of insertions or recombinations that indicate genomic changes or strain differences. Some families of elements can be difficult to delineate due to the absence of DRs and IRs, and require much human attention. In ERIC-ASAP we have embarked on the task of comprehensive, uniform IS annotation for completed genomes including end coordinates, transposase genes and fragments. A core set of 20 complete genomes was chosen for the primary effort, including strains from each of the four species E. coli, Shigella, Salmonella and Yersinia. Elements are identified by sequence similarity with the members in the definitive database of IS elements, ISFinder (5). RepeatMasker (6) is used rather than BLASTN to compare genome sequences to the IS database as it was specifically designed to locate repeated sequences within a genome, whereas BLAST performs poorly at this task, especially with shorter match lengths. The resulting list of matches is sorted into full length hits and fragments. Full length elements or those with few differences from a database entry can often be annotated using the coordinates in the output, which are usually accurate. If the genome has existing IS annotations, the newly-derived RepeatMasker-defined coordinates are compared with the published annotation and any discrepancies resolved by visual inspection of individual sequence alignments with the full IS from the database. Various systematic errors (e.g. off-by-one) have been corrected among published annotations. Delineation of fragments also requires inspection of individual sequence alignments; different alignment algorithms produce different results which can only be sorted out by assessment in the context of the flanking sequences and features. RepeatMasker is not always correct in the case of fragments or degenerated elements. The Feature type used in published IS annotations are quite variable (e.g. "misc_feature", "repeat_unit"), sometimes varying within a single genome. In ERIC, the variant forms are replaced by the uniform format that allows retrieval of all the ISs in a single query to the database that does not also capture other non-IS features. Transposons and other complex arrangements are identified by human inspection. They may be "flagged" by drug-resistance genes, but may instead or also carry other sequences of unpredictable length and type. Transposase genes are identified by BLAST or published annotations, which are updated to include correct product field and Gene Ontology (GO) terms, and information on the biochemical mechanism where possible. Many mutated transposase genes and gene fragment annotations will be updated to "pseudogene" features, a complex but important task. To complete the mobile DNA annotations, identification of recently described related elements such as integrons, MITES and ISCRs should also be undertaken, though these will no doubt offer a different set of challenges. More information on ERIC's IS element annotation methods is available in ERIC's online SOP. Distribution of elements in ERIC genomes.All of the genomes in the ERIC database that have been analyzed contain IS elements. E. coli and Salmonella strains have relatively few ISs (<50), while Shigella, and to a lesser extent Yersinia pestis, have undergone an explosion of insertions of up to several hundred per genome, relatively recently. In the extreme case of the Shigella flexneri virulence plasmids, the IS elements account for nearly 50% of the 210 kb DNA sequence. Table 2 shows the numbers of elements annotated in complete genomes. Draft genomes are in progress of being annotated with IS elements.
In summary, IS annotations provide important information to inform the design of diagnostic and therapeutic DNA-based reagents. Patterns of elements can be used to distinguish closely related pathogenic strains. Design of therapeutics must take into account that ISs can be mobile and disseminate or exchange unwanted material among other bacteria in the environment. References
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If you would like to be added to the mailing list, please send us an email. | ||