Genome Assembly Part I
23 Apr 2026DO NOT START THIS LAB UNTIL GIVEN THE OKAY…WE NEED TO RESIZE YOUR INSTANCES
Sequencing has become inexpensive enough, reads have become long enough, and algorithims have become good enough that it is quite common for a lab to decide to sequence and assembly genomes for organisms that they are working on.
This lab will take you through some basic assembly and evaluation procedures. We will be working with Pacific Biosciences reads generated from genomics DNA extracted from Streptanthus diversifolius (Variable Leaf Jewelflower), a California native wildflower.
Pacific Biosciences technology generates long reads, typically 5-30kb in length. This makes assembly much easier than using short read data such as Illumina.
Intro
The assembly takes about 10 hours to run on a 32 CPU machine. So that we have an assembly to work with on Tuesday we will start the assembly today.
Get the raw data
The PacBio reads take up about 250GB. Instead of having each of you download the data, I have set up a shared drive for data access.
The data is located at /revio-data/Revio. You can use the du command to see how much data there is.
Use du to check it and make you see the following
1.8G /revio-data/Revio/r84066_20230630_201715_2_A01/statistics
159M /revio-data/Revio/r84066_20230630_201715_2_A01/metadata
141K /revio-data/Revio/r84066_20230630_201715_2_A01/pb_formats
8.8G /revio-data/Revio/r84066_20230630_201715_2_A01/fail_reads
299G /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads
310G /revio-data/Revio/r84066_20230630_201715_2_A01
310G /revio-data/Revio/
Start the assembly
We don’t have a repo for this exercise yet. Plus the assembly files will probably be too big to push to github anyway. We will create a new directory in our home directory to store the assembly.
cd # make sure we are in our home directory
mkdir S_div_assembly
cd S_div_assembly
Finally, start the assembly.
time hifiasm -o S_div.asm -t 30 /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/reads.default.fasta.gz