Goal: To gain experience with the genome assembly problem and software solutions.
You will work with one of three teams. Each team will collaborate to assemble one or more contig(s) from the same fragments, but each will use a different assembly algorithm.
You will initially work with BAC sub-clones from Arabidopsis thaliana. You will not have to be concerned with vector screening, repeat masking, or per-base quality scores for these data. Contingent on availability of data from a bacterial whole-genome shotgun sequencing project, you may have an opportunity to assemble a whole genome!
Update: the data are in. You will assemble the entire genome of Chlamydophila pneumoniae AR39, which casues pneumonitis in mice. Thank you TIGR! We do not have repeat information about these data, so please just assemble them as you did the BAC data. It will take longer for the computer to run, because many more fragments and nucleotides (about 10x) are present. Work with the BAC data first until you are comfortable running the software and interpreting the output.
Please read carefully the documentation for your assembly algorithm to understand what parameters and input files it expects and what outputs it will produce. Interpreting the output is likely to be the most challenging task!
Each team is asked to co-author a written report that will be placed on the course web site to share with others. Please follow the form of Methods and Results sections in a scientific manuscript. See Read et al., cited below, for a published example. The due date is to be determined by the Professor. Students taking the 444 course are not required to co-author the report, but are encouraged to do so, if they choose.
The report is to describe the whole-genome assembly, if possible. You may want to address how the computational time required scaled with the increasing data volume.
In your reports, be sure to address the following:
Warning: The data and software we are using for this lab were kindly provided under very strict licensing constraints. Please do not abuse this privilege. The data and source code are confidential, not to be redistributed or used without express conset of the copyright holders. Thank you.
About the Data
Arabidopsis thaliana BAC F16P2
Related Links
About the Algorithms & Software
TIGR Assembler
http://www.tigr.org/softlab/assembler/
Sutton G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1:9-19.
and see ~phraber/READMES/TIGR-ASSEMBLER-README
executables are ~phraber/bin/TIGR_Assembler and ~phraber/bin/run_TA
Phred/Phrap/Consed
http://www.phrap.org
General documentation
http://www.phrap.org/phrap.docs/general.html
Phrap overview
http://www.phrap.org/phrap.docs/phrap.html
Consed overview
http://www.phrap.org/consed/consed.html
and see ~phraber/READMES/phredphrap/general.doc, phrap.doc, and swat.doc
executables are ~phraber/bin/phrap,swat,cross_match,phrapview
STROLL
http://genetics.med.harvard.edu/~tchen/STROLL/
Chen T, Skiena SS. A case study in genome-level fragment assembly.
Bioinformatics. 2000 Jun;16(6):494-500.
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10980146&dopt=Abstract
and see ~phraber/READMES/STROLL-README
executable is ~phraber/bin/stroll
LINUX Command References
Good luck, and happy assembling! Peter Hraber