Genome Assembly Part I
25 Apr 2024DO NOT START THIS LAB UNTIL GIVEN THE OKAY…WE NEED TO RESIZE YOUR INSTANCES
Sequencing has become inexpensive enough, reads have become long enough, and algorithims have become good enough that it is quite common for a lab to decide to sequence and assembly genomes for organisms that they are working on.
This lab will take you through some basic assembly and evaluation procedures. We will be working with Pacific Biosciences reads generated from genomics DNA extracted from Streptanthus diversifolius (Variable Leaf Jewelflower), a California native wildflower.
Pacific Biosciences technology generates long reads, typically 5-30kb in length. This makes assembly much easier than using short read data such as Illumina.
Intro
The assembly takes about 10 hours to run on a 32 CPU machine. So that we have an assembly to work with on Tuesday we will start the assembly today.
Get the raw data
The PacBio reads take up about 250GB. Instead of having each of you download the data, I have set up a shared drive for data access. Unfortunately, getting it mounted is a little tricky.
Please do the following steps in the Linux terminal of your instance. PLEASE CUT AND PASTE
Create a mount point for the share
sudo mkdir /revio-data
Set up the access key
sudo nano /etc/ceph/ceph.client.bis180l-2024-students.keyring
Then in the nano editor paste the following and then save the file and exit nano:
[client.bis180l-2024-students]
key = AQAJaypmmtCiEBAA6zNt0O2z6ykwb9zQWJbevA==
Next, adjust permissions of the file you just created:
sudo chmod 600 /etc/ceph/ceph.client.bis180l-2024-students.keyring
Next, edit the file system directory to tell it about the new drive:
sudo cp /etc/fstab /etc/fstab.bak # make a backup copy
sudo nano /etc/fstab
When the file is open, add the following line to the end of the file. Then save the file and exit nano
149.165.158.38:6789,149.165.158.22:6789,149.165.158.54:6789,149.165.158.70:6789,149.165.158.86:6789:/volumes/_nogroup/6536e3bf-97c9-4122-909d-3591503956f0/ec45978d-f934-475c-83d5-b566a3170f1b /revio-data ceph name=bis180l-2024-students,x-systemd.device-timeout=30,x-systemd.mount-timeout=30,noatime,_netdev,ro 0 2
Now mount the filesystem:
sudo mount -a
Check it:
du -h /revio-data/
Should give you:
1.8G /revio-data/Revio/r84066_20230630_201715_2_A01/statistics
159M /revio-data/Revio/r84066_20230630_201715_2_A01/metadata
141K /revio-data/Revio/r84066_20230630_201715_2_A01/pb_formats
8.8G /revio-data/Revio/r84066_20230630_201715_2_A01/fail_reads
233G /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads
244G /revio-data/Revio/r84066_20230630_201715_2_A01
244G /revio-data/Revio
244G /revio-data/
Start the assembly
We don’t have a repo for this exercise yet. Plus the assembly files will probably be too big to push to github anyway. We will create a new directory in our home directory to store the assembly.
cd # make sure we are in our home directory
mkdir S_div_assembly
cd S_div_assembly
Finally, start the assembly.
time hifiasm -o S_div.asm -t 30 /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/reads.default.fasta.gz