FRAGMENT ASSEMBLY LAB EXERCISE
Genomes & Genome Analysis
14 September 2000

Goal: To gain experience with the genome assembly problem and software solutions.

You will work with one of three teams.  Each team will collaborate to assemble one or more contig(s) from the same fragments, but each will use a different assembly algorithm.

You will initially work with BAC sub-clones from Arabidopsis thaliana.  You will not have to be concerned with vector screening, repeat masking, or per-base quality scores for these data.  Contingent on availability of data from a bacterial whole-genome shotgun  sequencing project, you may have an opportunity to assemble a whole genome!

Update: the data are in.  You will assemble the entire genome of Chlamydophila pneumoniae AR39, which casues pneumonitis in mice.  Thank you TIGR!  We do not have repeat information about these data, so please just assemble them as you did the BAC data.  It will take longer for the computer to run, because many more fragments and nucleotides (about 10x) are present.  Work with the BAC data first until you are comfortable running the software and interpreting the output.

Please read carefully the documentation for your assembly algorithm to understand what parameters and input files it expects and what outputs it will produce.  Interpreting the output is likely to be the most challenging task!

Each team is asked to co-author a written report that will be placed on the course web site to share with others.  Please follow the form of Methods and Results sections in a scientific manuscript.  See Read et al., cited below, for a published example.  The due date is to be determined by the Professor.  Students taking the 444 course are not required to co-author the report, but are encouraged to do so, if they choose.

The report is to describe the whole-genome assembly, if possible.  You may want to address how the computational time required scaled with the increasing data volume.

In your reports, be sure to address the following:

NB: The purpose of the exercise is not simply to reproduce the Genbank version, but to know to what extent the outcome is affected by varied algorithms and parameters.


Warning: The data and software we are using for this lab were kindly provided under very strict licensing constraints.  Please do not abuse this privilege.  The data and source code are confidential, not to be redistributed or used without express conset of the copyright holders.  Thank you.


About the Data

Arabidopsis thaliana BAC F16P2

Related Links Chlamydophila pneumoniae AR39


Related Links


About the Algorithms & Software

TIGR Assembler
http://www.tigr.org/softlab/assembler/

Sutton G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1:9-19.

and see ~phraber/READMES/TIGR-ASSEMBLER-README
executables are ~phraber/bin/TIGR_Assembler and ~phraber/bin/run_TA
 

Phred/Phrap/Consed
http://www.phrap.org

General documentation
http://www.phrap.org/phrap.docs/general.html

Phrap overview
http://www.phrap.org/phrap.docs/phrap.html

Consed overview
http://www.phrap.org/consed/consed.html

and see ~phraber/READMES/phredphrap/general.doc, phrap.doc, and swat.doc
executables are ~phraber/bin/phrap,swat,cross_match,phrapview
 

STROLL
http://genetics.med.harvard.edu/~tchen/STROLL/

Chen T, Skiena SS.  A case study in genome-level fragment assembly.
Bioinformatics. 2000 Jun;16(6):494-500.
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10980146&dopt=Abstract

and see ~phraber/READMES/STROLL-README
executable is ~phraber/bin/stroll


LINUX Command References


Good luck, and happy assembling!  Peter Hraber