Molecular Genetics Resources on the WWW

Compiled by Mary Anne Nelson and Peter Hraber. Please send comments or contributions to us.


Outline:

  1. Before you analyze your unknown sequences
  2. First in silico assignment (the easy one)
  3. Sequence Analysis Sites
  4. RNA Secondary Structure Sites
  5. Visualization Software
  6. Microbial Genomes
  7. Searching Several Genomes at Once
  8. The EcoCyc Database





Before you analyze your unknown sequences

Molecular Genetics class:

Before you analyze your unknown sequences, it might help to consider how another researcher undertook such an analysis. Available on the web:

Sequence to Structure. A short sort-of-tutorial on Secondary Structure Prediction (and one person's cautionary tale of model building...)

http://hornet.mmg.uci.edu/~hjm/projects/biocomp/structure.html

Also, since some of you have been considering proposals involving Neurospora crassa, and the sequences you'll be analyzing are cDNAs from this organism, you might find useful information at this site:

Fungal Genetics Stock Center

http://www.kumc.edu/research/fgsc/

Included is a search engine specific to Neurospora literature. (This is where we get our mutant strains, clones containing Neurospora genes, etc. Also, the cDNA libraries constructed at UNM by the Neurospora Genome Project are available from the FGSC.)

Note: I searched for Neurospora references about telomeres (got two articles) and apoptosis (no articles).

Hope this helps!

Mary Anne



First in silico assignment (the easy one)

Dear Molec. Genetics class,

Please analyze the following cDNA sequence and report your analyses to me (answers to the questions). This is a full-length cDNA sequence. I prefer email answers (let's save some trees), but will accept paper answers.

After you have successfully completed this easy assignment, I will email your individual (and difficult) partial cDNAs for analysis (this is where you'll all have different sequences to analyze). By that time, you should have enough hands-on experience with the programs to attempt your difficult assignments.

cDNA:

TTGGAAGTCAAGAAGGAAGTTGGCAGGCCTTTTTCCCATCATCATCCAACCATCTCATCACGGAACTACCATTCCG ATTGTTTCAGGTTCGAGTTGCCGCTGTTTGTTGGTTTGCTCAGAAATAGGCAAGGTACCAGTGCCGCCAACCATCG AATCGGGTCTTTGCCCTGTTTTTCAGACTACTTTGCGAGACGGAAAAAGGAGTGTATTCGGAGGATAGGCATTCGT TGTCGATCATGAACGGAATCAAGGACCTGGGCCTGGGCTTCTCCAAAGCCTCGACGCGCTTGCATTTCTTTAGTCT GTTGCTTGCTCTTGGCTCATTTGTCTGGGGATACAACGTTGGCGTTCTCGCCTCGGTCCTCGCCCATCCAGGCTTC CGCGAGACCATCCCTGGCTATGGTCCATATAAGCGCGGCCTCATCACATCCATCAACTACCTCGGCACATGGCTGA GTAACATCTTCTTCTCCCGTCCCGCGACGGATCTGCTCGGCCGTCGCTATGCTGCCATGGTGGGCATGTCCGTGTT ATGCGTAGGCCAGGCACTTCAGGCTGCTGCGTCTGGGTCGGTGGCTCTTAGTATGGTGATATTTGGACGGGTTGTT TCTGGCTTGGGCACCGGTATTGTCTCGACGAGTGTGCCATTGTATCAGAGTGAAATTGCCCCGGCCAAACAGCGAG GCAGACTTGTTGTTTTGAACCATGTCGGGTTTGTAGCCGGACTTGCCTCTGGCTTCTGGGTCGGCTATGCCATAAC CTTCTGGAACAGCTCACACGGCTACCTCAGCGGATGGCGCCTGTCCGTGTCCCTCTCCTTCATCCCCGCCCTCATC TTCTTCGTCGGCCTCCCCTTCCTGCACGAGTCTCCGCGGTGGCTCGTCGAGCACGGCCGCCCTGACGAAGCCCTCA AAGCCCTGCAGTTCTACCGCGAAGGATCCTTCACCCCTTCGCAAATCCAAAGCGAACTGACAGACATCAAGCGCAA TGTGTGCGCCTACCAGGCCACTAGCCTGAAGAAATGGACCTCCCTTTTCACCAACCCCAGCTTGTTCACGCGCCTG TGGCGAGCTGCACTGCTCCATTTCATGGCGCAGATGTGCGGCGCTACGGCCATGAAGTATTATCTGCCGGACCTGT TTAGGGTGTTGGGACTGAGCCCGCGCGTGTCGCTGCTGGCGGGCGGGATCGAGAGCACCCTGAAGATTGGGTGTAC GGTGTTGGAGATGTTTGTTATTGATAAGGTGGGAAGGAGGATGACACTGGCTGTTGGGGCGGGGATTATGGCTTTT GCATTGTTGATCAACGGTGCCCTCCCCCTCGCTTACCCCAACAACACCAACCGCGCCTCAGACTACACCTGCGTCG TCTTCATCTTCATCTACTCGCTTGGCTACAGCATGGGTTTCGGCCCGGCAGCTTGGGTCTACGGGTCTGAGATGTT CCCCACTGCCGCTCGCGCGCGTGGTTTGAGCTTCGCTGCTTCTGGTGGCGCGGTCGGGTCAATCATTGTTTCCCAA CTGTGGCCTATCGGGATCGCAGAGCTTGGCTCGAAGATTTACTTCTTCTTCATGGCGGTCAATTTGGCGTGCGTAC CGATTATCTTCTTGCTCTATCCTGAGACCAAGGGACGTCCGCTGGAGGATATGGAGGTGCTGTTTGGCGGGTATGA GGGTGGAACTCCGTCAACTACGTCTTTGCTTTTGGCTGATGAGAGGGAAGATGGGAACGAGGAGGACGAGGAGAAT GAGACACTGGGTAGGCCGTTGCTTGGTGATGGAAGAGGCATGGCAAGGTAGGTCTTGTTTCTCCATAATTTGTTAT GCATGTCATCCGATTATGATTCTGAATGGGCTGCGAGACCCAGGAGTCGCAAGCGTATGTTTGAATGAATAAAGCC TTGTGTACATGGAGGCTGTTCTCGACGCCAAAAAAAGAGACTACAGTTGAAGCATGCAGAAACAACTTGTAGTAAG AAGACAAAAGGCTGCTGCCACTCCGCCACGACTTTAGTAAAAGAGCTCAACGGCATCGAAGATCTGGATCCAAAAA AAAAAAAAAAAAA

Analyze this cDNA and report the following to me:

  1. Identify the longest ORF; what is the length in amino acids?
  2. In which reading frame is this ORF?
  3. For the predicted protein, What is the MW? What is the pI?
  4. Is there anything noteworthy about the amino acid composition?
  5. Is this a soluble or membrane-spanning protein?
  6. Define any signal sequences and/or membrane-spanning regions.
  7. Identify any conserved motifs, and tell how significant you feel they are.
  8. Run BLAST using the cDNA sequence as query sequence and report your results. Which BLAST program did you use?
  9. Run BLAST using the predicted protein as query sequence and report your results. Which BLAST program did you use?
  10. Summarize your conclusions about this cDNA.
  11. Other comments/analyses welcome!
Mary Anne



Sequence Analysis Sites

This list is not comprehensive, and available tools change daily. Some disappear (often as maintenance becomes too time-consuming and/or expensive) and new utilities appear. Please share any useful tools that you find with the rest of the class!

ExPASy Tools Menu - Sequence Analysis Tools

ExPASy leads you to a number of useful sites including:

For Protein Identification

DNA -> Protein Similarity searches Pattern and profile searches Primary sequence analysis Secondary structure prediction Tertiary structure Transmembrane regions detection Alignment

Binary

Multiple



Pedro's Biomolecular Research Tools

This site leads you to a number of useful sites including:

Molecular Biology Search and Analysis



BCM Search Launcher

Human Genome Center, Baylor College of Medicine, Houston TX


The BCM Search Launcher is an on-going project to organize molecular biology-related search and analysis services available on the WWW by function by providing a single point-of-entry for related searches (e.g., a single page for launching protein sequence searches using standard parameters).

Current Launch Pages



EMBL Argos Group Biocomputing

Services and programs provided by the Argos Group:

Individual URLs (that can't be reached from the sites listed above)



Descriptions of selected individual programs



RNA Secondary Structure sites on the web



Visualization Software



Microbial Genomes

Here is a handful of URLs that I gathered from the small genomes meeting. I will post more to the list as I work through them in the coming week. You can click on the links below, or pick them up from the list archive.

WWW interface for searching several Genomes at once

(NEW--Includes E. Coli!)

The E. Coli genome has just been sequenced as of the last week of January. At BMERC, we have added the E. Coli genome to our genome blast page. The E. Coli genome is of particular interests to Molecular Biologists because the vast majority of functions assigned to E. Coli have been done by experimentation. Most other genomes have their functions assigned by homology.

The complete and nearly complete genomes of Saccharomyces Cerevisiae , Methanococcus Jannaschii, E. Coli, and Bacillus Subtilis are now available. Our WWW Blast interface allows you to search, using your sequence, against two subsets of the available putative open reading frames of these genomes using blastp. A Search Against Annotated ORF's and a Search Against Unannotated ORF's of these genomes are the search options available from this page.

Your output will consist of detailed references for the significant blast matches and the raw blast output. The detailed references consist of a reference key, the annotation where available, the protein sequence, and the dna for the ORF's.

We are also providing a tool for function keyword searching. We have built a table that relates E. Coli to the other three main genomes using blast with a Karlan Altschul score of < 10E-17. This keyword searching tool will print a list of every sequence identifier that is close to the E. Coli gene of the cluster where the keyword is found.

  • Blast Genome Analysis blast page
  • E. Coli functional search

    EcoCyc database

    Announcing EcoCyc Version 3.7, released March 7, 1997
    EcoCyc is a database of E. coli genes and metabolic pathways that runs on Unix Workstations, and through the WWW (
    see http://www.ai.sri.com/ecocyc/ecocyc.html). Among the uses of EcoCyc is for the analysis and annotation of microbial genomes by analogy to E. coli. Its graphical user interface creates drawings of metabolic pathways, of individual reactions, and of the E. coli genomic map. Users can call up objects through a variety of queries (such as retrieving an enzyme by a substring search), and then navigate to related entities shown in the resulting display window. For example, a user could zoom in on a region of the genetic map, click on a gene to obtain detailed information about it, and then navigate to the enzyme product of the gene, and then to the metabolic pathway containing the enzyme. Metabolic pathway drawings are produced automatically, and can be drawn in several styles, such as with compound structures present or absent. EcoCyc contains extensive information about each enzyme, including its cofactors, activators and inhibitors (qualified by type), subunit composition, substrate specificity, and molecular weight. Individual values in the knowledge base are extensively annotated with citations to the literature, as are comment fields..

    Changes introduced in this version include:

    Current knowledge base size:

    :::: Ecocyc KB statistics on Thu Feb 20, 1997 ::::

    Reactions: 3241 total; 736 occur in ECOLI; 165 have no EC

    Polypeptides: 1029; Protein complexes: 419; Enzymes: 731

    Genes: 3025 (2571 are mapped)

    Base pathways: 131; Superpathways: 26

    Compounds: 1294 (964 have structures)