Molecular Genetics Resources on the WWW
Compiled by Mary Anne Nelson and Peter Hraber. Please send comments or contributions to us.
Outline:
- Before you analyze your unknown sequences
- First in silico assignment (the easy one)
- Sequence Analysis Sites
- RNA Secondary Structure Sites
- Visualization Software
- Microbial Genomes
- Searching Several Genomes at Once
- The EcoCyc Database
Before you analyze your unknown sequences
Molecular Genetics class:
Before you analyze your unknown sequences, it might help to
consider how another researcher undertook such an analysis. Available on
the web:
Sequence to Structure. A short sort-of-tutorial on Secondary
Structure Prediction (and one person's cautionary tale of model
building...)
http://hornet.mmg.uci.edu/~hjm/projects/biocomp/structure.html
Also, since some of you have been considering proposals involving
Neurospora crassa, and the sequences you'll be analyzing are cDNAs from
this organism, you might find useful information at this site:
Fungal Genetics Stock Center
http://www.kumc.edu/research/fgsc/
Included is a search engine specific to Neurospora literature. (This is
where we get our mutant strains, clones containing Neurospora genes, etc.
Also, the cDNA libraries constructed at UNM by the Neurospora Genome
Project are available from the FGSC.)
Note: I searched for Neurospora references about telomeres (got two
articles) and apoptosis (no articles).
Hope this helps!
Mary Anne
First in silico assignment (the easy one)
Dear Molec. Genetics class,
Please analyze the following cDNA sequence and report your analyses
to me (answers to the questions). This is a full-length cDNA sequence. I
prefer email answers (let's save some trees), but will accept paper
answers.
After you have successfully completed this easy assignment, I will
email your individual (and difficult) partial cDNAs for analysis (this is
where you'll all have different sequences to analyze). By that time, you
should have enough hands-on experience with the programs to attempt your
difficult assignments.
cDNA:
TTGGAAGTCAAGAAGGAAGTTGGCAGGCCTTTTTCCCATCATCATCCAACCATCTCATCACGGAACTACCATTCCG
ATTGTTTCAGGTTCGAGTTGCCGCTGTTTGTTGGTTTGCTCAGAAATAGGCAAGGTACCAGTGCCGCCAACCATCG
AATCGGGTCTTTGCCCTGTTTTTCAGACTACTTTGCGAGACGGAAAAAGGAGTGTATTCGGAGGATAGGCATTCGT
TGTCGATCATGAACGGAATCAAGGACCTGGGCCTGGGCTTCTCCAAAGCCTCGACGCGCTTGCATTTCTTTAGTCT
GTTGCTTGCTCTTGGCTCATTTGTCTGGGGATACAACGTTGGCGTTCTCGCCTCGGTCCTCGCCCATCCAGGCTTC
CGCGAGACCATCCCTGGCTATGGTCCATATAAGCGCGGCCTCATCACATCCATCAACTACCTCGGCACATGGCTGA
GTAACATCTTCTTCTCCCGTCCCGCGACGGATCTGCTCGGCCGTCGCTATGCTGCCATGGTGGGCATGTCCGTGTT
ATGCGTAGGCCAGGCACTTCAGGCTGCTGCGTCTGGGTCGGTGGCTCTTAGTATGGTGATATTTGGACGGGTTGTT
TCTGGCTTGGGCACCGGTATTGTCTCGACGAGTGTGCCATTGTATCAGAGTGAAATTGCCCCGGCCAAACAGCGAG
GCAGACTTGTTGTTTTGAACCATGTCGGGTTTGTAGCCGGACTTGCCTCTGGCTTCTGGGTCGGCTATGCCATAAC
CTTCTGGAACAGCTCACACGGCTACCTCAGCGGATGGCGCCTGTCCGTGTCCCTCTCCTTCATCCCCGCCCTCATC
TTCTTCGTCGGCCTCCCCTTCCTGCACGAGTCTCCGCGGTGGCTCGTCGAGCACGGCCGCCCTGACGAAGCCCTCA
AAGCCCTGCAGTTCTACCGCGAAGGATCCTTCACCCCTTCGCAAATCCAAAGCGAACTGACAGACATCAAGCGCAA
TGTGTGCGCCTACCAGGCCACTAGCCTGAAGAAATGGACCTCCCTTTTCACCAACCCCAGCTTGTTCACGCGCCTG
TGGCGAGCTGCACTGCTCCATTTCATGGCGCAGATGTGCGGCGCTACGGCCATGAAGTATTATCTGCCGGACCTGT
TTAGGGTGTTGGGACTGAGCCCGCGCGTGTCGCTGCTGGCGGGCGGGATCGAGAGCACCCTGAAGATTGGGTGTAC
GGTGTTGGAGATGTTTGTTATTGATAAGGTGGGAAGGAGGATGACACTGGCTGTTGGGGCGGGGATTATGGCTTTT
GCATTGTTGATCAACGGTGCCCTCCCCCTCGCTTACCCCAACAACACCAACCGCGCCTCAGACTACACCTGCGTCG
TCTTCATCTTCATCTACTCGCTTGGCTACAGCATGGGTTTCGGCCCGGCAGCTTGGGTCTACGGGTCTGAGATGTT
CCCCACTGCCGCTCGCGCGCGTGGTTTGAGCTTCGCTGCTTCTGGTGGCGCGGTCGGGTCAATCATTGTTTCCCAA
CTGTGGCCTATCGGGATCGCAGAGCTTGGCTCGAAGATTTACTTCTTCTTCATGGCGGTCAATTTGGCGTGCGTAC
CGATTATCTTCTTGCTCTATCCTGAGACCAAGGGACGTCCGCTGGAGGATATGGAGGTGCTGTTTGGCGGGTATGA
GGGTGGAACTCCGTCAACTACGTCTTTGCTTTTGGCTGATGAGAGGGAAGATGGGAACGAGGAGGACGAGGAGAAT
GAGACACTGGGTAGGCCGTTGCTTGGTGATGGAAGAGGCATGGCAAGGTAGGTCTTGTTTCTCCATAATTTGTTAT
GCATGTCATCCGATTATGATTCTGAATGGGCTGCGAGACCCAGGAGTCGCAAGCGTATGTTTGAATGAATAAAGCC
TTGTGTACATGGAGGCTGTTCTCGACGCCAAAAAAAGAGACTACAGTTGAAGCATGCAGAAACAACTTGTAGTAAG
AAGACAAAAGGCTGCTGCCACTCCGCCACGACTTTAGTAAAAGAGCTCAACGGCATCGAAGATCTGGATCCAAAAA
AAAAAAAAAAAAA
Analyze this cDNA and report the following to me:
- Identify the longest ORF; what is the length in amino acids?
- In which reading frame is this ORF?
- For the predicted protein, What is the MW? What is the pI?
- Is there anything noteworthy about the amino acid composition?
- Is this a soluble or membrane-spanning protein?
- Define any signal sequences and/or membrane-spanning regions.
- Identify any conserved motifs, and tell how significant you feel they are.
- Run BLAST using the cDNA sequence as query sequence and report your results. Which BLAST program did you use?
- Run BLAST using the predicted protein as query sequence and report your results. Which BLAST program did you use?
- Summarize your conclusions about this cDNA.
- Other comments/analyses welcome!
Mary Anne
Sequence Analysis Sites
This list is not comprehensive, and available tools change daily. Some disappear (often as maintenance becomes too time-consuming and/or expensive) and new utilities appear. Please share any useful tools that you find with the rest of the class!
ExPASy Tools Menu - Sequence Analysis Tools
ExPASy leads you to a number of useful sites including:
For Protein Identification
- Swiss-Shop: a sequence alerting system for SWISS-PROT that allows you to automatically obtain (by email) new sequence entries relevant to your field(s) of interest.
- AACompIdent: Identify a protein by its amino acid composition. AACompIdent is a tool which allows the identification of a protein from its amino acid composition. It searches SWISS-PROT for proteins whose amino acid compositions are closest to the amino acid composition given.
- AACompSim: Compare the amino acid composition of a SWISS-PROT entry with all other entries.
- TagIdent: Get the SWISS-PROT proteins closest to a given pI and Mw and identify proteins with a sequence tag (previously GuessProt).
- PeptideMass: Calculate masses of peptides and their post-translational modifications for a SWISS-PROT entry or for a user sequence.
- Compute pI/Mw - Compute the theoretical pI and Mw from a SWISS-PROT entry or for a user sequence.
DNA -> Protein
- Translate: Translate a nucleotide sequence to a protein
- Protein machine: Nucleotide to protein translation at EBI
Similarity searches
- BLAST: Interface to Basic Local Alignment Search Tool: at NCBI - at EPFL
- BLITZ: EBI's ultra-fast protein database searches using MPsearch
- Bic: Weizmann's ultra-fast rigorous (Smith/Waterman) similarity searches using the Bioccelerator
- BioSCAN: Biological Sequence Comparative Analysis at U. North Carolina
- FDF-SW: Smith/Waterman type searches on Paracel's Fast Data Finder (FDF)
- PropSearch: Search for structural homologs using a 'properties' approach
Pattern and profile searches
- ScanProsite: Scan a sequence against PROSITE or a pattern against SWISS-PROT
- ProfileScan: Scan a sequence against the profile entries in PROSITE
- Bipartite NLS Locator: Detection of bipartite nuclear localization sequences
- FPAT: Regular expression searches in protein databases
Primary sequence analysis
- ProtParam: Physico-chemical parameters of a protein sequence (composition,extinction coefficient, etc.)
- ProtScale: Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.)
- SAPS: Statistical analysis of protein sequences
- PSORT: Prediction of protein sorting signals and localization sites
- Signalp: Prediction of the signal peptide cleavage sites
- NetOglyc: Prediction of type O-glycosylation sites in mammalian proteins
- Coils: Prediction of coiled coil regions in proteins (Lupas's method)
- Paircoil: Prediction of coiled coil regions in proteins (Berger's method)
- REPRO: Recognition of protein sequence repeats at EMBL
- Protein Colourer: Tool for colouring your amino acid sequence
- RandSeq: Random protein sequence generator
Secondary structure prediction
- AntheProt: Institute of Biology and Chemistry of Proteins (IBCP) / Lyon
NB: can get to CLUSTAL, for aligment of multiple proteins, from this link
- BCM PSSP: Baylor College of Medicine
- GOR: Garnier, Osgoodthorpe and Robson (GOR) secondary structure prediction method (atSBDS)
- nnPredict: University of California at San Francisco (UCSF) (secondary structure prediction)
- PredictProtein: PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from EMBL
- PREDATOR: Protein secondary structure prediction from single sequence at EMBL (Argos' group)
- PSA: BioMolecular Engineering Research Center (BMERC) / Boston
- SSPRED: Protein secondary structure prediction from aligned sequences at EMBL (Argos' group)
Tertiary structure
- Swiss-Model: an automated knowledge-based protein modelling server
- Swiss-PdbViewer: a program to analyse and superimpose protein 3D structures
Transmembrane regions detection
- SOSUI: Prediction of transmembrane regions from TUAT (Tokyo Univ. of Agriculture & Tech.)
- TMpred: Prediction of transmembrane regions and protein orientation at ISREC
- TMAP: Transmembrane detection based on multiple sequence alignment at EMBL
Alignment
Binary
- SIM + LALNVIEW - Alignment of two protein sequences with SIM, results can be viewed with LALNVIEW
Multiple
- MSA: at Washington University
- Align query: at EERIE
- Multalin: at I.N.R.A.
- AMAS: Analyse Multiply Aligned Sequences
- Bork's alignment tools: Various tools to enhance the results of multiple alignments (including consensus building).
This site leads you to a number of useful sites including:
Molecular Biology Search and Analysis
- AA Analysis: Protein Identification in SwissProt and PIR using Amino Acid Composition at EMBL-Heidelberg W3. This is the PROPSEARCH program (see description below).
- AA CompIdent: See description under ExPASy.
- ALIGN: Optimal Global Alignment of Two Sequences
- BCM Search Launcher at Baylor College of Medicine (see below)
- Biologists Search Palette: This is a collection of the most useful search engines for biological databases on the internet, accessing through either http (WWW) or gopher. You can search Medline, the protein or DNA databases at NCBI, etc.
- BLOCKS: Database of Highly Conserved Regions in Proteins (See description
below)
- Codon Usage: Analysis of Different ORFs in a Gene Sequence
- dbEST: Database of Expressed Sequence Tags at NCBI. dbEST (Nature Genetics
4:332-3;1993) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms. You can compare your unknown cDNA sequence with these sequences using this site.
- Dot Plot: Compare a DNA Sequence with Itself
- FASTA: Compare a Nucleic Acid Sequence to Nucleotide Sequence Databases
- Gene Finder: Predict Gene Structure, Internal Exons and Splicing Sites in DNA and Exon-Exon Junstion in cDNA at Baylor College of Medicine
- GenoBase at NIH: GenoBase Database at the National Institutes of Health (NIH). Through a system of tables and various query capabilities, this server provides access to an NIH copy of GenoBase, which is a Prolog-based, object-oriented molecular biology database. This installation incorporates and links the contents of several large datasets, including EMBL and Swiss-Prot.
- GenQuest: Search Against Protein Databases with the Q server at Johns Hopkins. You can perform BLAST, FASTA or Smith-Waterman searches.
- Grail: Analysis of the Protein Coding Potential of a DNA Sequence
- LALIGN: Calculates the N-Best Local Alignments Between Two Sequences
- LFASTA: Local Similarity Searches Between Two Sequences Showing Local Alignments
- MOTIF: Search Patterns in Protein Sequences
- PHD: Predict Protein Secondary Structure
- PredictProtein: Predict Protein Secondary Structure
- PROSITE: Protein Sites and Patterns Database (see description below)
- SSPRED: Predict the Secondary Structure of Proteins at EMBL-Heidelberg
- Swiss-Shop: Automated SwissProt and Prosite Information Queries at ExPASy, Switzerland
- TFASTA: Match a Protein Sequence Against All Six Frames of GenBank Sequences
- TMAP: Identification of Transmembrane Segments on a Protein Sequence
Human Genome Center, Baylor College of Medicine, Houston TX
The BCM Search Launcher is an on-going project to organize molecular biology-related search and analysis services available on the WWW by function by providing a single point-of-entry for related searches (e.g., a single page for launching protein sequence searches using standard parameters).
Current Launch Pages
- General protein sequence/pattern searches: includes BLASTP+BEAUTY, TBLASTN, BEAUTY, FASTA-SWAP/EC Pattern DB, PROSITE, BLOCKS, BLASTP/SBASE annotated domains, BLITZ/SwissProt, SSEARCH
- Species-Specific protein sequence searches
- Nucleic acid sequence searches: includes BLASTX+BEAUTY/nr protein (BLASTX with RepeatMasker and BEAUTY post-processing that adds annotated domain information), BLASTX/S. cerevisiae, BLASTN/nr dna - with RepeatMasker, Entrez & SRS links, BLASTN/S. cerevisiae Genome, BLASTN/dbest, TBLASTX/dbest, BLASTN/month, BEAUTY-X/CRSeqAnnot (Seq family and domain information added to BLASTX search of BCM CRSeqAnnot db), FASTA/SwissProt (Fasta searches of a sixframe translation against SwissProt), BLASTX/C. elegans, TBLASTX/C. elegans EST (6-frame translation vs. translated C. elegans EST)
- Multiple sequence alignments: includes ClustalW 1.6 (DNA/Protein), MAP (DNA/Protein), PIMA 1.4 (Protein only; Pattern-Induced (local) Multiple Alignment), MSA 2.1 (Protein only; Near-optimal sum-of-pairs global), BLOCK MAKER (Protein only; Finds conserved blocks in seq sets), ClustalW 1.6 (DNA/Protein; Global progressive), MEME 2.0 (Protein/DNA; Multiple EM for Motif Elicitation), Match-Box (Protein only; Blocks alignment with reliability)
- Pairwise sequence alignments: includes SIM (Protein only), ALIGN (optimal global alignment with no short-cuts, LALIGN (calculates the N-best local alignments), LFASTA (local similarity searches showing local alignments)
- Gene feature searches: includes
- Exon, Intron, and Gene Model Prediction: GRAIL-1.3 (exon prediction from genomic sequence), FGENEH (gene model construction from human genomic sequence), Genie (gene finding based on Hidden Markov Models), FEXH (human 5', internal, and 3' exon prediction), HEXON (search for human potential internal exons), HSPL (search for potential human splice sites), NNSSP (human splice site prediction by neural network), RNASPL (search for exon-exon junction positions in human cDNA), POLYAH (recognition of 3'-end cleavage and polyadenylation regions)
- Promoter and Transcription Factor Binding Site Prediction: TESS (search for transcription factor binding sites), TSSG (human PolII recognition using the promotor.dat database), TSSW (Human PolII recognition using the TRANSFAC database), NNPP/Eukaryotic (eukaryotic promoter prediction by neural network), NNPP / Prokaryotic (prokaryotic promoter prediction by neural network), MatInspector/TRANSFAC (transcription factor binding sites), POL3SCAN (eukaryotic PolIII recognition for tRNAs)
- Open Reading Frame Identification: ORF ID / Eukaryotic (any start codon and minimum 25 a.a.), ORF ID / Prokaryotic (Met start codon and minimum 100 a.a.)
- Sequence utilities: ReadSeq (converts nucleic acid/protein sequences to FASTA format), RepeatMasker (identify and mask repeats in DNA sequences), Primer Selection (PCR primer selection), WebCutter (restriction maps using enzymes w/ sites >= 6 bases), 6 Frame Translation (translates a nucleic acid sequence in 6 frames), Reverse Complement (reverse complements a nucleic acid sequence), HBR (finds E.coli contamination in human sequences)
- Protein secondary structure prediction: Coils (prediction of coiled coil regions), nnPredict (uses a 2 layer neural network), PSSP/SSP (segment-oriented prediction), PSSP/NNSSP (nearest-neighbor prediction), SAPS (statistical analysis of protein sequences), TMpred (transmembrane region and orientation prediction), Paircoil (coiled coil regions of pairwise residue correlations), PHDsec (profile network method), PSA (for single domain globular proteins), SOPM (self optimized prediction method), SSPRED (with residue exchange statistics), Swiss-Model (from alignment to crystallographic data)
- BCM Gene Finder
- BCM protein secondary structure prediction
- The Biologist's Control Panel
- YAC data searches
Services and programs provided by the Argos Group:
- TMAP: the prediction of transmembrane segments in proteins using the extended information found in multiple sequence alignments of related proteins
- REPRO: simultaneous identification of different types of repeats within a protein sequence using graph-theoretical methods
- PREDATOR: protein secondary structure prediction from multiple sequences
- SRSWWW: Network Browser for Databanks in Molecular Biology
- SIMPA 96: a secondary structure prediction program
Individual URLs (that can't be reached from the sites listed above)
Descriptions of selected individual programs
- Beauty: Blast Enhanced Alignment Utility; adds domain info to BLAST output
- BLAST: Basic Local Alignment Search Tool
BLAST performs fast database searching combined with rigorous statistics for judging the significance of matches. The BLAST algorithm is a heuristic for finding ungapped, locally optimal sequence alignments. Five BLAST programs search many different combinations of query and database sequences. The BLAST algorithm is described in S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, J. Mol. Biol. 215, 403-10 (1990).
- blastp: compares an amino acid query sequence against a protein sequence database
- blastn: compares a nucleotide query sequence against a nucleotide sequence database
- blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
- BLITZ
BLITZ is an automatic electronic mail server for the MPsrch program of Shane Sturrock and John Collins, Biocomputing Research Unit, University of Edinburgh, Scotland. MPsrch allows you to perform sensitive and extremely fast comparisons of your protein sequences against the Swiss-Prot protein sequence database using the Smith and Waterman best local similarity algorithm.
- Blocks WWW Server
A Fred Hutchinson Cancer Research Center WWW server for the detection and verification of protein sequence homology in Seattle, Washington.
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks, retrieve blocks, and create new blocks, respectively.
The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution of matches. It is these calibrated blocks that make up the BLOCKS database.
The blocks created by Block Maker are created in the same manner as the blocks in the BLOCKS database but with sequences provided by the user. Results are reported in a multiple sequence alignment format without calibration and in the standard BLOCK format for searching. The Prints Database in Blocks Format: The Blocks WWW Server optionally searches a version of Terri Attwood's Prints Database in Blocks format using the Blimps searching program. This service is not available from the Blocks email server. Because Prints includes families not represented in the Blocks Database, we recommend searching both databases.
Ref.: S. Henikoff & J. G. Henikoff, "Protein family classification based on searching a database of blocks", Genomics 19:97-107 (1994).
- FASTA
The FASTA program was developed by Pearson and Lipman. It allows you to perform fast and sensitive comparisons of your nucleic acid or protein sequences against various databases.
Reference: Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444-2448 (1988).
- GRAIL
GRAIL is a suite of tools designed to provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation. The coding recognition portion of the system uses a neural network which combines a series of coding prediction algorithms. There are three basic versions of this neural network, GRAIL 1, GRAIL 1a and GRAIL 2:
GRAIL 1 uses a neural network described in PNAS 88, 11261-11265, which recognizes coding potential within a fixed size (100 base) window. It evaluates coding potential without looking foradditional features (information such as splice junctions, etc).
GRAIL 1a is an updated version of GRAIL 1. It uses a fixed-length window to locate the potential coding regions and then evaluates a number of discrete candidates of different lengths around each potential coding region, using information from the two 60-base regions adjacent to that coding region, to find the "best" boundaries for that coding region. GRAIL 2 uses variable-length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL 2 is therefore not appropriate for sequences without genomic context (when the regions adjacent to an exon are not present).
All three systems have been trained to recognize coding regions in human DNA sequences, although they also work well on a number of other organisms, particularly other mammals.
- PROPSEARCH
The EMBL-Heidelberg Amino Acid Analysis Server Uwe Hobohm, Tony Houthaeve and Chris Sander: "Amino acid analysis and protein database compositional search as a rapid and inexpensive method to identify proteins", Analytical Biochemistry 222 (1994) 202
- Prosite: Dictionary of protein sites and patterns
PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites, patterns and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs.
ScanProsite is a tool which allows to either scan a protein sequence - from SWISS-PROT or provided by the user - for the occurence of patterns stored in the PROSITE database or to scan the SWISS-PROT database - including weekly releases - for the occurence of a pattern that can originate from PROSITE or be provided by the user.
Reference: A. Bairoch, "PROSITE: A dictionary of sites and patterns in proteins" Nucleic Acids Res. 20, 2013-2018 (1992)
- Protein motifs
This page provides an interface to a program that determines if a protein motif is encoded by a DNA sequence or database of DNA sequences. Additional information is available about the program and features related to selecting a query and translation.
This program is configured to accept databases as well as individual sequences as the query. To search the sequenced yeast chromosomes, set the query type to Alces and use the sequence name Yeast-chr (case sensitive). A variant of the program can use protein databases directly.
NB: Since this program allows DNA sequences as input, it'll be easier to use than those that require protein seqeunces, especially for initial analysis of cDNA sequences.
- PSORT WWW Server: Prediction of Protein Sorting Signals and Localization Sites in Amino Acid Sequences
A WWW Server for Analyzing and Predicting Protein Sorting Signals Coded in Amino Acid Sequence
PSORT is an expert system for the prediction of protein localization sites in cells. It receives the information of an amino acid sequence and its source orgin, e.g., Gram-negative bacteria, as inputs. Then, the system analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Then, it reports the possiblity for the input protein to be localized at each candidate site with additional information.
- SCOP: Structural Classification of Proteins
Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The scop database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in Brookhaven National Laboratory's Protein Data Bank (PDB). As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification. Proteins are classified to reflect both structural and evolutionary relatedness.
The different major levels in the hierarchy are:
- Family: Clear evolutionarily relationship
Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.
- Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.
- Fold: Major structural similarity
Proteins are defined as having a common fold if they have same major secondary structures in same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.
- SignalP
The SignalP World Wide Web server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
- TMAP
The TMAP program predicts transmembrane segments in proteins, utilizing the algorithm described in:
Persson, B. & Argos, P. (1994) Prediction of transmembrane Segments in proteins utilising multiple sequence alignments. J. Mol. Biol. 237, 182-192.
- TMpred - Prediction of Transmembrane Regions and Orientation
The TMpred program makes a prediction of membrane-spanning regions and their orientation. The algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins. The prediction is made using a combination of several weight-matrices for scoring.
Reference: K. Hofmann and W. Stoffel, TMbase - A database of membrane spanning proteins segments. Biol. Chem. Hoppe-Seyler 347,166 (1993).
RNA Secondary Structure sites on the web
Visualization Software
- Kinemage is a product of the Richardson lab at Duke, used for viewing molecules in three dimensions.
- RasMol software is a package for molecular visualization. The pdb (Brookhaven Protein Data Bank) file format is used to view protein structures from crystallographic data.
Microbial Genomes
Here is a handful of URLs that I gathered from the small genomes meeting. I will post more to the list as I work through them in the coming week. You can click on the links below, or pick them up from the list archive.
WWW interface for searching several Genomes at once
(NEW--Includes E. Coli!)
The E. Coli genome has just been sequenced as of the last week of January. At BMERC, we have added the E. Coli genome to our genome blast page. The E. Coli genome is of particular interests to Molecular Biologists because the vast majority of functions assigned to E. Coli have been done by experimentation. Most other genomes have their functions assigned by homology.
The complete and nearly complete genomes of Saccharomyces Cerevisiae , Methanococcus Jannaschii, E. Coli, and Bacillus Subtilis are now available. Our WWW Blast interface allows you to search, using your sequence, against two subsets of the available putative open reading frames of these genomes using blastp. A Search Against Annotated ORF's and a Search Against Unannotated ORF's of these genomes are the search options available from this page.
Your output will consist of detailed references for the significant blast matches and the raw blast output. The detailed references consist of a reference key, the annotation where available, the protein sequence, and the dna for the ORF's.
We are also providing a tool for function keyword searching. We have built a table that relates E. Coli to the other three main genomes using blast with a Karlan Altschul score of < 10E-17. This keyword searching tool will print a list of every sequence identifier that is close to the E. Coli gene of the cluster where the keyword is found.
Blast Genome Analysis blast page
E. Coli functional search
EcoCyc database
Announcing EcoCyc Version 3.7, released March 7, 1997
EcoCyc is a database of E. coli genes and metabolic pathways that runs on Unix Workstations, and through the WWW (see http://www.ai.sri.com/ecocyc/ecocyc.html). Among the uses of EcoCyc is for the analysis and annotation of microbial genomes by analogy to E. coli. Its graphical user interface creates drawings of metabolic pathways, of individual reactions, and of the E. coli genomic map. Users can call up objects through a variety of queries (such as retrieving an enzyme by a substring search), and then navigate to related entities shown in the resulting display window. For example, a user could zoom in on a region of the genetic map, click on a gene to obtain detailed information about it, and then navigate to the enzyme product of the gene, and then to the metabolic pathway containing the enzyme. Metabolic pathway drawings are produced automatically, and can be drawn in several styles, such as with compound structures present or absent. EcoCyc contains extensive information about each enzyme, including its cofactors, activators and inhibitors (qualified by type), subunit composition, substrate specificity, and molecular weight. Individual values in the knowledge base are extensively annotated with citations to the literature, as are comment fields..
Changes introduced in this version include:
- An Overview diagram of the entire E. coli metabolic map is now available at http://www.ai.sri.com/ecocyc/ov.html
- We believe that EcoCyc now describes all published pathways of E. coli metabolism. The following pathways were added since version 3.1:
- trehalose biosynthesis
- NAD phosphorylation and dephosphorylation
- betaine biosynthetic pathway
- mannose and GDP-mannose metabolism
- formylTHF biosynthesis
- methylglyoxal metabolism
- nucleotide metabolism
- arginine utilization
- L-serine degradation
- glutamine utilization
- glutamate utilization
- L-cysteine catabolism
- tryptophan utilization
- ornithine degradation
- putrescine degradation
- D-galactarate catabolism
- galactitol catabolism pathway
- mannitol degradation
- sorbitol degradation
- trehalose degradation, low osmolarity
- D-glucarate catabolism
- cobalamin biosynthesis
- glutathione-glutaredoxin redox reactions
- anaerobic respiration, electron acceptors reaction list
- aerobic electron transfer
- aerobic respiration, electron donors reaction list
- anaerobic respiration, electron donors reaction list
- anaerobic respiration
- anaerobic electron transfer
- displays now sort the set of reactions that contain the compound according to the pathways that contain the reactions.
- X-window version of EcoCyc now displays links to other databases; if you click on a link then EcoCyc will invoke Netscape to query the linked object via the WWW.
- the X-window version, most modes that allow you to query objects by
their exact name allow you to enter in several names within one pop-up
window, separated by commas, e.g., "Find Gene by Name" allows you to
enter "hisA, hisB, hisC." The exception is compound mode, because many
compounds have commas within their names.
- have been added from EcoCyc to the Swiss-Model database (thanks
to Manuel Peitsch for assistance).
- second gene-classification system has been added to EcoCyc. This
second system, also developed by Riley, is much simpler than the first
system, and classifies genes according to the type of their product,
e.g., enzyme, regulator, transport protein.
- executables are now available for Solaris but are no longer
available for SunOS.
- does not yet contain the full annotation of the E. coli genome;
that task is next on our list. Thus, the data in slots
centisome-position, left-end-position, and right-end-position, are all
derived from EcoGene7.
Current knowledge base size:
:::: Ecocyc KB statistics on Thu Feb 20, 1997 ::::
Reactions: 3241 total; 736 occur in ECOLI; 165 have no EC
Polypeptides: 1029; Protein complexes: 419; Enzymes: 731
Genes: 3025 (2571 are mapped)
Base pathways: 131; Superpathways: 26
Compounds: 1294 (964 have structures)