Login | New Account  |  Help
ERIC Newsletter

ERIC Newsletter
August 2007


Please e-mail us if you would like to be added to our mailing list.





Comparative Genomics Using Mauve 2.0

Nicole Perna, Ph.D.
Nicole Perna is head of the Genome Evolution Laboratory at the University of Wisconsin-Madison and one of the three co-investigators for the ERIC-BRC project.

Multiple genome alignments are now available for each species of enterobacteria represented in ERIC-BRC. These alignments were constructed using a newly released version of a progressive alignment tool that dramatically improves alignment in regions conserved among subsets of genomes, a particularly important feature for recognition of genomic islands. Coupled with improved visualization and navigational tools, Mauve 2.0 provides a powerful new mechanism for comparative genomics of enterobacteria.

You can access the software and alignments through ERIC-BRC to use in your own research and to browse the ERIC-ASAP database from a comparative perspective.

These pre-computed genome alignments are currently available:
  • Six complete Escherichia genomes
  • Ten complete Escherichia and Shigella genomes
  • Five complete Salmonella genomes
  • Seven complete Yersinia genomes
Each alignment includes all complete published genome sequences available at the time of construction, and can be accessed from any annotation page of the relevant genomes. As new genomes become available, these alignments will be updated.

New Functionality in Mauve 2.0
ERIC-BRC curators constructed these alignments using the newly released Mauve 2.0 software application, developed at the University of Wisconsin. The Mauve web site includes extensive documentation, including information on the algorithm and Mauve 2.0 improvements. Key to the utility of Mauve 2.0 for ERIC-BRC organisms is the new ability to identify and align regions shared among a subset of the genomes under comparison, where earlier versions (1) focused on regions shared by all genomes. This is enabled through a progressive approach that adds genomes to the alignment in an order prescribed by a guide tree inferred from overall genome similarity. Many of the genes that contribute to pathogenicity and other niche adaptations lie in genomic islands, and many of these islands are found in only a subset of the sequenced genomes.

Figure 1. Partial screenshot of the ERIC-BRC Mauve 2.0 six E. coli alignment. This region shows the junction between a segment conserved among the two O157:H7 genomes, the well-characterized LEE pathogenicity island (blue), and one found in all six aligned E. coli genomes (pink). The popup menu showing links to ERIC-ASAP and NCBI appears when users click on an annotated gene (white blocks).

Mauve 2.0 also facilitates visualization of these important lineage-specific regions by providing an option to colorize an alignment according to "multiplicity" to show the distribution of homologous segments across genomes. The original color scheme gave a different color to each collinear segment, to facilitate visualizing global genome rearrangements. The new scheme makes it easy to see local transitions between shared and lineage-specific segments. Users can zoom all the way in to the nucleotide sequence alignment, or all the way out to see large-scale events like inversions. (See Figure 2 below). Mauve 2.0 also includes an integrated search function allowing users to query ERIC's annotations and jump to the corresponding region of the alignment.

Figure 2. Two views of the same region of the ERIC-BRC Mauve 2.0 alignment of 6 E. coli genomes. The visualization on the left uses the default color scheme based on homologous segments. For example, the blue regions are collinear blocks that contain homologous sequence. Importantly, these collinear blocks can include islands unique to a single genome or collinear islands common to a subset of genomes. The visualization of the same aligned region shown on the right is colorized by multiplicity. Here, pink (mauve) blocks indicate that the region is conserved across all six genomes. Other colors mark regions found in only a subset of genomes.

Accessing Mauve 2.0 Alignments through ERIC-BRC
The alignments are available from the feature annotation pages in the ERIC-ASAP system. You can launch Mauve 2.0 on your local computer by clicking on the name of the Mauve alignment in the left column on the annotation page. This will automatically download the most up-to-date version of Mauve and the alignment data. Note that these files include the complete alignment and GenBank flat files for each genome, so it can take a while the first time you access the alignments. Your focus when the alignments open in Mauve will be on the gene from which you launched Mauve. To find other genes you can perform a keyword search for functions, specific gene names, or other types of annotations. For example, you could search for all E. coli O157:H7 strain EDL933 genes involved with iron utilization and uptake, then examine the corresponding region in all six sequenced strains to evaluate whether particular subsystems are conserved and co-localized.

Figure 3: Mauve 2.0 Alignment.This alignment includes six E. coli genomes - two O157:H7 genomes (top two), two uropathogens (middle two) and two K-12 strains (bottom two). Each genome is represented as a tier with colored bars showing the level of sequence similarity on top and annotated genes below. The rRNA-encoding genes (red) are in the middle of a region homologous across all genomes. It is easy to distinguish these "backbone" or conserved regions (pink) in the display colored by multiplicity. Other colors mark regions found in a subset of genomes. The vertical bars show the cursor position and corresponding location in other genomes (one example circled in red). Here, the cursor marks the resumption of homology across all genomes (pink) following lineage-specific segments (various colors). The blue region found in both O157:H7 genomes correspond to a cryptic prophage. The orange segment encodes unknown proteins common to both K-12 strains. Each uropathogen genome has a (distinct) prophage in this area. At any time, users can re-center the display, visually synchronizing the genomes onto homologous points in the other genomes. This interactive aspect of the display greatly facilitates examination of areas flanked by homologous sequence, but different in the middle, such as this hotspot for phage insertions.

Building Your Own Mauve Alignments
You may be interested in building your own Mauve 2.0 alignments using a different set of genomes than those included in the pre-built alignments. The Mauve User's Guide provides detailed instructions. In brief, Mauve 2.0 accepts input genome sequences in either GenBank flat file or FASTA formats, both of which can be downloaded from ERIC-BRC. Incomplete genomes in multiple contigs (or complete genomes with multiple replicons) are acceptable (Mauve concatenates the multiple sequences in input order and displays contig breaks in the visualization tool). Alignment of incomplete sequences with complete genomes of close relatives can be useful for exploring contig order and orientation. You will need to choose an alignment method (original or progressive) and select parameters appropriate for the level of sequence divergence. The pre-built ERIC-BRC alignments were constructed using default parameters for the progressive alignment option. The amount of time required for complete alignment will depend on the number, length, and divergence level of genomes under comparison, as well as your hardware. Upon completion, the Mauve visualization will appear, and corresponding alignment files will be saved in the directory you selected during sequence input. These alignments may be retrieved in subsequent sessions by launching the Mauve application from your desktop. ERIC-BRC curators are available to assist you with construction of other enteropathogen alignments. Please contact info@ericbrc.org for assistance.

Training
ERIC-BRC offers Mauve training - we are happy to set up individual training opportunities for our user community and encourage you to contact us to describe your needs. Please contact info@ericbrc.org for assistance.




ERIC-BRC Yersinia Genome Updates

Bradley Anderson, Ph.D.
Bradley Anderson is a genome annotator and curator on the ERIC-BRC Team at the University of Wisconsin-Madison.

Yersinia genomics emerged in 2002 with the publication of the CO92 (2) and KIM (3) Yersinia pestis genomes. Since then, complete and draft genome sequences for additional Yersinia strains and species have been generated and annotated by diverse groups. ERIC-BRC currently contains genome data from seven Y. pestis and two Y. pseudotuberculosis strains. The styles and content of the original GenBank deposits for these genomes vary considerably, and the oldest genome annotations are now out of date. In the interest of creating a fresh standard for comparative annotation of Yersinia genomes, ERIC-BRC selected Y. pestis CO92 as a focal point for a substantial annotation update. Over the next few months, these updates will be propagated across orthologous genes from all Yersinia genomes, and relayed to NCBI for integration. In the interim, we invite you to access the CO92 updates already available through ERIC-BRC.

In brief, this update reflects several synergistic efforts. ERIC-BRC curators reviewed Yersinia primary literature published since 2002, incorporating new names and product descriptions with links to the corresponding papers in PubMed. Other substantial updates derived from comparisons with the 2006 re-annotation of the E. coli K-12 genome (4). Database cross-references were added to various other resources, including UniProt. Insertion Sequence (IS) element and pseudogene boundaries and annotations were also revised. In total, the original deposit contained approximately 41,000 annotation records, of which just over 5,000 were removed or restructured by curators, who also added over 21,000 additional annotations. A more complete description of these updates will appear as a chapter in an issue of Advances in Experimental Medicine and Biology(5)., detailing the proceedings of the October 2006 ASM Yersinia conference.

Of course, keeping annotations current is an ongoing task and we are seeking your assistance. If you publish a paper that contains information that should be reflected in one or more ERIC-BRC genome annotations, there are several things you can do ensure that updates happen in a timely way. ERIC-BRC is designed for direct community input, and we are enthusiastic about signing you up and training you to make annotation updates in the ERIC-BRC database yourself. Alternately, we urge you to communicate your findings to our designated curators. Contact us to pursue either option at info@ericbrc.org.




Annotation of the EPEC plasmid pMAR7 in ERIC-ASAP - A Model for Collaboration in Genomics Projects

Valerie Burland, Ph.D.
Valerie Burland is a Senior Scientist at the University of Wisconsin-Madison on the ERIC-BRC Team, and a specialist in the characterization and annotation of insertion elements.

Project Background
Enteropathogenic Escherichia coli (EPEC) cause acute and persistent diarrhea especially among infants in developing countries, who are particularly vulnerable to infections transmitted by the fecal-oral route. EPEC's virulence factors are encoded by genes on the chromosome and on the large virulence plasmid found in these strains. The plasmid pMAR7 from the prototypic EPEC strain E2348/69 (O127:H6) was sequenced by the NIH-funded University of Wisconsin Pathogen Genomes Initiative in collaboration with Dr. James Kaper of the University of Maryland and his graduate student Carl Brinkley. The complete genome of E. coli (EPEC) E2348/69 (the source of pMAR7) is being sequenced at the Sanger Institute, UK, and will be imported into ERIC-ASAP as soon as it becomes available. When the ERIC-BRC portal was formed, its annotation subsystem ASAP (6) offered the pMAR7 project the means to update the analysis, annotate the DNA sequence, submit the results to GenBank (DQ388534), and provide assistance with preparation of a manuscript describing the results (7). The annotated sequence is accessible in ASAP for the community to update as further information becomes available. We encourage any interested parties to contact us to discuss similar efforts for other plasmids from ERIC organisms, in addition to draft or complete genomes.

Role of pMAR7 in Virulence
pMAR7 is a virulence plasmid derived from the large plasmid pMAR2, the first genetic element implicated in EPEC disease, which was found to be present in the prototypic EPEC strain E2348/69 (O127:H6) (8). Adherence of EPEC to HEp-2 epithelial cells and in vivo adherence to piglet intestinal epithelium was shown to be dependent on this self-transmissible plasmid. Similar virulence plasmids were widely found in other adherent EPEC strains with classical serotypes and they are essential for the distinctive EPEC adherence phenotype, hence the designation EAF (EPEC Adherence Factor) plasmid. Transfer of this plasmid to a non-adherent E. coli K-12 strain will confer the ability to adhere to epithelial cells. The EAF plasmids contain the 14-gene locus bfp, encoding the bundle-forming pilus responsible for the adherence type, and the virulence regulator locus (now called perA) which controls expression of both the bfp gene products (9, 10) and of intimin, an adhesin enabling tight binding to epithelial cells, encoded in the genomic LEE (Locus of Enterocyte Effacement) (11, 12). The importance of the pMAR7 loci to virulence was confirmed by infection of human volunteers with bfpA and perA wild-type and mutant strains (13).

Features Specific to pMAR7
Figure 4 shows a map of the plasmid. pMAR7 (101.5 kb) is self-transmissible and more than 30 kb larger than the related EPEC plasmid pB171 (69 kb) (14). This size difference is accounted for by an intact tra locus in pMAR7. This is a highly conserved 33-kb, 36-gene segment similar to the tra regions of the F and R100 plasmids, encoding conjugation and DNA transfer functions (15). The origin of DNA transfer oriT is also present. Hybridization of 134 EPEC strains showed that a complete tra region is present only in strains of the EPEC1 clonal group. The EAF plasmid's mobility is consistent with its proposed role in the evolution of E. coli virulence (16).

Figure 4. pMAR7 plasmid schematic. Two copies of Insertion Sequence (IS) element ISEc13 flank the pMAR7 tra region to create a transposon-like structure that could potentially mobilize the tra genes, leading to their deletion by recombination between the homologous elements. Insertion sequences (detected using IS Finder) accounted for >18% of the plasmid sequence, but the tra region is IS-free.

 




Microarray Analysis in ERIC with mAdb

John Greene, Ph.D.
John Greene is the Principal Investigator for mAdb at SRA International, and has spent a decade as a bioinformatics scientist - over five of which were spent training scientists on the mAdb system at NIH.

Recently, we added the mAdb microarray database and analysis system as a component of ERIC. mAdb was developed by NIH's Center for Information Technology, with assistance from ERIC's prime contractor SRA International, for the intramural program of the National Cancer Institute. In the nearly eight years of that project, mAdb has proven scalable and reliable enough to handle over 67,000 microarray experiments and has over 1,500 users at the NIH campus, as well as their collaborators world-wide. mAdb is a completely web-based system - all you need is a browser and a good Internet connection. Unlike GEO or Array Express, mAdb is not just a MIAME-compliant repository for array data. A wide variety of analysis tools are incorporated into the system as well as a database, and it is possible to share data with collaborators through mAdb.

mAdb can handle Affymetrix data as well as spotted arrays, and can use the quantitation and composite image files from a number of microarray scanners. The central concept for mAdb is that of creating filtered, reusable datasets for analyzing microarray data. Once the raw data is processed and placed in ERIC's relational database, a user can filter the data for quality, using a variety of quality filters for spot size, signal/background ratio, excluding those spots marked as Bad or Not Found by using the scanner software, as well as a number of other quantitative metrics. Normalization can be done either on the raw data or only on those spots which pass the spot quality filters set by the user. This creates a parent filtered dataset, which can then be filtered in other ways (by expression ratios, by genes (rows), or by arrays (columns)), or used directly in the analysis tools. mAdb's flexibility allows you to try multiple quality filter settings to create multiple parent datasets if you choose, since there are few absolutes to date in microarray analysis.

mAdb's main display page for filtered datasets is highly customizable, allowing you to show only the data you need. Spot images can be viewed for human quality control (so you can spot a hair or smudge on a spot that the quantitation might not directly reveal). You can view Gene Ontology terms, adjust the color contrast, and decide whether to store a dataset transiently (for 24 hours), temporarily (30 days since last access), or permanently. Online help is provided for each tool by clicking on the bee image shown on each page

The analysis tools allow hierarchical, K-means, and self-organizing map (Kohonen) clustering of the data by a number of metrics and linkage methods, as well as other related visualization techniques such as scatter plotting, Principal Components Analysis, and Multidimensional Scaling. One can partition an experiment into subset groups that can be compared against each other by a number of techniques, including averaging the groups; producing the mean, median, and standard deviation for each gene by group; or comparing the groups by statistical tests such as t-tests or Wilcoxon ranking for two groups, or by ANOVA or Kruskal-Wallace method for multiple groups. You can perform Boolean comparisons between two or three datasets - and create yet another subset from the desired results. Finally, you can perform PAM (Prediction Analysis for Microarrays) which allows class prediction or SAM (Significance Analysis of Microarrays), which is used to identify differentially expressed genes in single-, two-, or multiple-class comparisons, and to estimate the False Discovery Rate (FDR).

Although this may seem complex, once you grasp the idea of creating one or a few filtered datasets, and you understand which tools are available that pertain to your experimental design, mAdb is quite easy to use. Each dataset has a history associated with it, so you can see how it was derived. Also, at any stage in an analysis, if you derive a filtered dataset you wish to work with extensively, you can save it as a new dataset, in effect making the child set into a new parent dataset.

We will be offering both online and in person training on the mAdb component of ERIC, and there are a set of slides posted on the ERIC portal to assist you in teaching yourself. We hope your microarray research in enteropathogens will benefit from the use of this flexible and powerful microarray analysis system.




Literature Cited in this Issue

  1. Darling, A. C., B. Mau, F. R. Blattner, and N. T. Perna. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394-403.
  2. Parkhill, J., B. W. Wren, N. R. Thomson, R. W. Titball, M. T. Holden, M. B. Prentice, M. Sebaihia, K. D. James, C. Churcher, K. L. Mungall, S. Baker, D. Basham, S. D. Bentley, K. Brooks, A. M. Cerdeno-Tarraga, T. Chillingworth, A. Cronin, R. M. Davies, P. Davis, G. Dougan, T. Feltwell, N. Hamlin, S. Holroyd, K. Jagels, A. V. Karlyshev, S. Leather, S. Moule, P. C. Oyston, M. Quail, K. Rutherford, M. Simmonds, J. Skelton, K. Stevens, S. Whitehead, and B. G. Barrell. 2001. Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413:523-7.
  3. Deng, W., V. Burland, G. Plunkett, 3rd, A. Boutin, G. F. Mayhew, P. Liss, N. T. Perna, D. J. Rose, B. Mau, S. Zhou, D. C. Schwartz, J. D. Fetherston, L. E. Lindler, R. R. Brubaker, G. V. Plano, S. C. Straley, K. A. McDonough, M. L. Nilles, J. S. Matson, F. R. Blattner, and R. D. Perry. 2002. Genome sequence of Yersinia pestis KIM. J Bacteriol 184:4601-11.
  4. Riley, M., T. Abe, M. B. Arnaud, M. K. Berlyn, F. R. Blattner, R. R. Chaudhuri, J. D. Glasner, T. Horiuchi, I. M. Keseler, T. Kosuge, H. Mori, N. T. Perna, G. Plunkett, 3rd, K. E. Rudd, M. H. Serres, G. H. Thomas, N. R. Thomson, D. Wishart, and B. L. Wanner. 2006. Escherichia coli K-12: a cooperatively developed annotation snapshot--2005. Nucleic Acids Res 34:1-9.
  5. Perry, Robert D.; Fetherston, Jacqueline D. (Eds.) 2007. The Genus Yersinia: From Genomics to Function Series: Advances in Experimental Medicine and Biology, Vol. 603. Springer, New York.
  6. Glasner, J. D., P. Liss, G. Plunkett, 3rd, A. Darling, T. Prasad, M. Rusch, A. Byrnes, M. Gilson, B. Biehl, F. R. Blattner, and N. T. Perna. 2003. ASAP, a systematic annotation package for community analysis of genomes. Nucleic Acids Res 31:147-51.
  7. Brinkley, C., V. Burland, R. Keller, D. J. Rose, A. T. Boutin, S. A. Klink, F. R. Blattner, and J. B. Kaper. 2006. Nucleotide sequence analysis of the enteropathogenic Escherichia coli adherence factor plasmid pMAR7. Infect Immun 74:5408-13.
  8. Baldini, M. M., J. B. Kaper, M. M. Levine, D. C. Candy, and H. W. Moon. 1983. Plasmid-mediated adhesion in enteropathogenic Escherichia coli. J Pediatr Gastroenterol Nutr 2:534-538.
  9. Tobe, T., G. K. Schoolnik, I. Sohel, V. H. Bustamante, and J. L. Puente. 1996. Cloning and characterization of bfpTVW, genes required for the transcriptional activation of bfpA in enteropathogenic Escherichia coli. Mol Microbiol 21:963-75.
  10. Gomez-Duarte, O. G., and J. B. Kaper. 1995. A plasmid-encoded regulatory region activates chromosomal eaeA expression in enteropathogenic Escherichia coli. Infect Immun 63:1767-76.
  11. Jerse, A. E., and J. B. Kaper. 1991. The eae gene of enteropathogenic Escherichia coli encodes a 94-kilodalton membrane protein, the expression of which is influenced by the EAF plasmid. Infect Immun 59:4302-9.
  12. Perna, N. T., G. F. Mayhew, G. Posfai, S. Elliott, M. S. Donnenberg, J. B. Kaper, and F. R. Blattner. 1998. Molecular evolution of a pathogenicity island from enterohemorrhagic Escherichia coli O157:H7. Infect Immun 66:3810-7.
  13. Bieber, D., S. W. Ramer, C. Y. Wu, W. J. Murray, T. Tobe, R. Fernandez, and G. K. Schoolnik. 1998. Type IV pili, transient bacterial aggregates, and virulence of enteropathogenic Escherichia coli. Science 280:2114-8.
  14. Tobe, T., T. Hayashi, C. G. Han, G. K. Schoolnik, E. Ohtsubo, and C. Sasakawa. 1999. Complete DNA sequence and structural analysis of the enteropathogenic Escherichia coli adherence factor plasmid. Infect Immun 67:5455-62.
  15. Frost, L. S., K. Ippen-Ihler, and R. A. Skurray. 1994. Analysis of the sequence and gene products of the transfer region of the F sex factor. Microbiol Rev 58:162-210.
  16. Reid, S. D., C. J. Herbelin, A. C. Bumbaugh, R. K. Selander, and T. S. Whittam. 2000. Parallel evolution of virulence in pathogenic Escherichia coli. Nature 406:64-7.
ERIC Newsletters

If you would like to be added to the mailing list, please send us an email.