Genome Assembly from PacBio HiFi reads

Important: before leaving today run the code at the very end of this lab

Regardless of whether or not you finish the exercises, please run the code at the end of this lab page before leaving today. We need the output for Thursday.

Intro

Sequencing has become inexpensive enough, reads have become long enough, and algorithms have become good enough that it is quite common for a lab to decide to sequence and assembly genomes for organisms that they are working on.

This lab will take you through some basic assembly and evaluation procedures. We will be working with Pacific Biosciences reads generated from genomics DNA extracted from Streptanthus diversifolius (Variable Leaf Jewelflower), a California native wildflower.

Pacific Biosciences technology generates long reads, up to 40kb in length. This length makes assembly much easier than using short read data such as Illumina.

Clone the repo

Click on the assignment link to create a repo, and download it. You will work in Assignment_07_template.Rmd

Read Stats

PacBio includes stats on the reads. Open the file /revio-data/Revio/Report_for_dataset_PB950_Std1-Rep1_JewelFlower_HiFiv3_cell1.pdf And find the page that has stats on read length and quality.

*Hint: You can use the linux command open from the terminal or you can navigate to the file in the GUI file explorer or you can download the file to your PC using sftp, WinSCP, or Filezilla (see “File Transfer to and from Your VM” post)

Note: the quality score is a representation of the probability of an error in the sequence. To convert a score to a p-value, you can use this code in R:

qscore <- 22 # replace this number with the mean quality from the run
10^(-qscore / 10)

I will explain this scoring method in more detail in the Illumina lab.

Exercise One (No code needed)

A Create a markdown table that has the mean, median, and N50 read length, and median quality.

B Does it surprise you that N50 is higher than the mean or median length or does that make sense?

C From these stats does this look like a successful PacBio run to you?

D Based on flow cytometry, we expect the genome size to be around 350MB. If that is true, how much coverage should we have from this sequencing?

Assemble the reads

We already assembled the reads using hifiasm in the previous lab.

If you ran the assembly command as directed, they assembly should be in ~/S_div_assembly

If your run didn’t work, you can download the primary assembly as follows:

cd ~/S_div_assembly
wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/HifiAsm_Assignment/S_div.asm.bp.p_ctg.gfa

There are a bunch of files there. The primary assembly is in S_div.asm.bp.p_ctg.gfa. The files S_div.asm.bp.hap1.p_ctg.gfa and S_div.asm.bp.hap2.p_ctg.gfa contain approximate haplotype phased assemblies of the two haploid genomes. We will use the primary assembly for this lab.

Convert to fasta

hifiasm outputs its assembly in Graphical Assembly Format (gfa). We need to convert this to fasta format.

Use this code to do the conversion:

cd ~/S_div_assembly
awk '/^S/{print ">"$2;print $3}' S_div.asm.bp.p_ctg.gfa > S_div.asm.bp.p_ctg.fa

You aren’t responsible for knowing the code above, but if you are curious:

  • awk is a linux text processing language that specializes in finding and modifying patterns.
  • awk ‘ … ‘ S_div.asm.bp.p_ctg.gfa Runs awk on the file S_div.asm.bp.p_ctg.gfa.
  • /^S/ { ... } The pattern /^S/ matches lines that start with the character S. In GFA, S lines are “segment” records, i.e. contigs; other lines (L, P, etc.) are links/paths and are ignored.
  • print ">"$2 For each S line, print a FASTA header line. $2 is the second tab‑separated field on that line, which in GFA is the segment (contig) name. So this prints something like >contig123.
  • print $3 Then print the third field, which in GFA is the sequence of that segment, to a new line.

QC the assembly

Basic stats

The seqkit package has a ton of useful tools for working with sequencing data and assemblies. I have installed that in the pb-minimap2 conda environment.

Activate the conda environment.

conda activate pb-minimap2

Now run seqkit:

seqkit stats -N 50,80,90  S_div.asm.bp.p_ctg.fa

Exercise Two (No Code Needed)

A How does the total assembly length compare to expectation?

B We expect 14 chromosomes. What would the average chromosome length be?

C With this in mind, interpret the max length, N50, and N80.

D Based on A, B, and C, what is your initial assessment of this assembly?

14 largest sequences

Since we expect 14 chromosomes, let’s see what the size of the largest scaffolds are.

First, sort by length

seqkit sort --by-length --reverse S_div.asm.bp.p_ctg.fa > S_div.asm.bp.p_ctg_sort.fa

Next, create a fasta index file (.fai), which will contain the lengths by default. We do this using samtools another incredibly useful library of tools for working with sequencing data.

samtools faidx S_div.asm.bp.p_ctg_sort.fa

The results will be in S_div.asm.bp.p_ctg_sort.fa.fai

Exercise Three. R Code Needed

A Import the .fai into R and make a bar plot that shows the length of the 25 longest contigs. Check the link above to figure out which column you should be plotting.

Important: the .fai file does not have column headers, be sure to import it appropriately. Read the help file for whichever read function you use in R to figure out how to do this.

B What percentage of the assembly is contained in the 14 longest contigs?

BUSCO Analysis

One was to benchmark genome assembly completeness is by analyzing the presence of universal single-copy orthologs BUSCO analysis.

The authors of busco have created curated lists of genes that are expected to be present as single-copy in all genomes. There are different BUSCO lists for different branches along the tree of life.

The busco tool searches an assembly for the presence of the specified BUSCO list.

You can see a list of available busco lists by entering:

busco --list-datasets

Exercise Four Linux code needed

A Use grep to search the busco data sets for a brassicales entry. (S. diversifolius is a Brassicales). Show your command and indicate what you found.

B Figure out how to run busco using the brassicales data set, the assembly fasta file, and 16 cores. You can get help on busco by typing busco --help or by checking the user manual. Important include the option --skip_bbtools in your command!

Hints: You only need to specify the “Mandatory Options” (Oxymoron anyone?), the “Recommended options”, --skip_bbtools, and the option to specify the number of cores to use.

What command did you use? (This will take about 6 minutes to run)

C Examine the BUSCO results and include them here.

D Interpret the results with respect to completeness. See the user guide for help on interpreting the results.

E Interpret the results with respect to single copy versus duplication. Make a hypothesis regarding the duplication rate.

Exercise Five Linux code needed

Most of our data is in contigs larger than 1MB. If we subset our assembly to only retain contigs > 1MB, would our BUSCO score still be good?

A Use the seqkit seq command to create a new fasta file that contains all scaffolds > 1MB. Name the new file S_div.asm.bp.p_ctg_1MB.fa

B What command can you use to make sure that it worked? Show the command and output below

C Run BUSCO on this 1MB data set, provide the results, and comment on the results relative to BUSCO on the full dataset

Exercise Six Linux Code Needed

As noted above, we expect 14 chromosomes. If we only examine the 14 largest scaffolds, how good is our BUSCO score?

A Use Linux commands to create a file that contains the names of the 14 longest scaffolds. You will want to start with S_div.asm.bp.p_ctg_sort.fa.fai

B
Use the seqkit grep command to create a new fasta file that has just these 14 scaffolds. Name the resulting fiel S_div.asm.bp.p_ctg_14_longest.fa

C Run BUSCO on the 14 largest scaffolds. Show results. Have we retained enough of the gene space just to work with these scaffolds?

Compare to Arabidopsis

One interesting analysis to do is to compare the genomic organization of our assembly to that of a relative such as Arabidopsis thaliana (Both species are Brassicaceae).

This can be done either by aligning chromosomes in nucleotide space using tools like minimap2, D-GENIES, or MUMmer, or by looking for gene homology either in protein or nucleotide space, using tools like Promer, MCScanX, or GENESPACE.

Because we haven’t defined genes in our genome, we will do chromosome alignments using minimap2 and then visualize using D-GENIES. Note that for more distant genomes, the Promer tool in the MUMmer suite would be a good choice because it will allow alignment of 6-frame protein translations.

Install minimap2 and D-GENIES

We need to install the software.

conda install -c conda-forge -c bioconda minimap2 dgenies

Consolidate the Arabidopsis genome into a single file.

Exercise Seven

We need to get all of the Arabidopsis chromosomes into a single file. They are available in separate .gz files on you instance in ~/data/A.thaliana/. There are five gzipped files that start with Chr. Unzip and combine these into a single file. Note that ~/data/A.thaliana is write-protected so you will need to put your new file somewhere else. (S_div_assembly directory is okay). Name the combined file A_thaliana_all_chr.fa

show your code here

End of Exercise Seven

Now run minimap2 to align the genomes. For S. diversifolius use the fasta file with contigs 1MB and larger. (Note: D-Genies can run minimap2 on its own, but I get better results if I pre-run minimap2).

minimap2  -x asm10 A_thaliana_all_chr.fa S_div.asm.bp.p_ctg_1MB.fa > A_thal_vs_S_div_asm10.paf

The -x asm10 option tells minimap2 that we are comparing genome assemblies. The output is a .paf (Pairwise mApping Format file) that gives the locations of similarity.

Now start D-GENIES

dgenies run

D-GENES will open in a browser window. Click on run and then plot alignment. The alignment file is the .paf file created by minimap2, the target file is the Arabidopsis fasta file, and the query file is the S. diversifolius 1MB fasta file.

Click submit and then when ready view results.

In the resulting dotplot, Arabidopsis chromsomes are on the x-axis and S. diversifolius contigs are on the y-axis. Dots show regions of sequence similarity between the genomes.

Play with the settings (stroke width, precision, filter, sort, etc to optimize the display)

Exercise Eight

A Examine the D-GENEIES plot. In particular look down the columns: in general, how many contigs in S. diversifolius are matched by each position on an A. thaliana chromosome?

B Is the result consistent with the BUSCO singleton vs duplicate analysis?

C What is the most likely evolutionary explanation for the differences between these two genomes?

When you are ready to quit D-GENIES, type ctrl-c in the terminal window where you started it.

Annotation

A natural next step would be structural annotation…finding where the genes and coding-regions are in the genome. We don’t really have the computational resources / time to undertake this here, but should you need to do it, one of the most common tools is BRAKER.

An interesting alternative has recently been released that used deep learning to find genes: Helixer

Find telomeres

Telomeres are repetitive sequences at the end of DNA and are a pretty exciting research frontier right now. If you want to learn a little more, here is a brief overview. The repetitive sequence of telomeres is highly conserved, with telomeres in plants typically consisting of the repeat TTTAGGG. The telomere repeat in vertebrates is TTAGGG.

Identify telomeric sequence in your assembly

We have started the creation of a nice reference genome! A good quality control check is to determine if you have captured telomeres in your assembly. Telomeres are hard to capture because they are highly repetitive sequences, making them difficult to assemble through. A telomere-to-telomere genome (a genome where all of the chromosomes have been sequenced all the way through, from the starting telomere to the ending telomere) has become the gold standard for genome assembly and has just recently become possible. Here is a page on the telomere-to-telomere human genome, recently completed with help from a lab at UC Davis.

How can we determine if the genome we created has captured telomeres? We can use the fact that telomeric sequence is conserved! Remember, the most common repeat in plants in TTTAGGG. We will use this, what we know about genome structure, and R to see if we have captured these telomeric sequences.

Lets make sure we have the necessary libraries

library(tidyverse)
library(seqinr)
library(Biostrings)

Note: Remember, if you don’t have a package installed, you can use the install.packages() command. All of these should already be installed for you though.

Read in the genome assembly file for the contigs greater than 1MB

genome.fa <- readDNAStringSet("~/S_div_assembly/S_div.asm.bp.p_ctg_1MB.fa")

Now assign a variable to the telomere repeat.

telo_fr_repeat <- "TTTAGGGTTTAGGGTTTAGGG"
telo_rv_repeat <- "CCCTAAACCCTAAACCCTAAA"

Exercise Nine (No code needed)
There were two variables created, telo_fr_repeat and telo_rv_repeat.

A How are these two variables related? What did you notice about them?

B Why do you think we repeated each seven basepair sequence three times? In other words, why didn’t we just use TTTAGGG and CCCTAAA?

We need to create a couple of functions to use inside of the for loop. create_windows creates windows along the genome of a chosen window size. count_sequence will count the occurrence of a specified sequence within a given window.

create_windows <- function(genome, window_size) {
  windows <- substring(genome, seq(1, nchar(genome), window_size),
                       seq(window_size, nchar(genome) + window_size, window_size))
  return(windows)
}

count_sequence <- function(window, sequence) {
  count <- sum(countPattern(sequence, window))
  return(count)
}

Now that the sequences are created, we can create an empty data frame and run a for loop to count how many times the telomeric sequence appears in windows across the genome.

# We need to create an empty data frame to store each iteration
telo_count.table <- data.frame()

# Define Window Size
window_size <- 1000000

# Run a for loop for all of the entries within the fasta
for (i in 1:length(genome.fa)) {
    # Extract the sequence name
    seq_name <- names(genome.fa[i])
    # Make sure the sequence is read as characters
    sequence <- as.character(genome.fa[i])

    # Split the sequence into windows
    windows <- create_windows(genome = sequence, window_size = window_size)

    # Count occurrences of the telomere fr sequence in each window
    telo_fr_counts <- sapply(windows, count_sequence, sequence = telo_fr_repeat)
  
    # Count occurrences of the telomere rv sequence in each window
    telo_rv_counts <- sapply(windows, count_sequence, sequence = telo_rv_repeat)

    # Create a data table for the window counts
    dt_chromosome <- data.frame(chromosome = rep(seq_name, length(windows)),
                                window = seq_along(windows),
                                telo_fr_counts = telo_fr_counts,
                                telo_rv_counts= telo_rv_counts,
                                telo_all_counts = telo_fr_counts + telo_rv_counts,
                                row.names = NULL)
    
    # Bind the output of each iteration to the table
    telo_count.table <- rbind(telo_count.table, dt_chromosome)
    
    }

# View the table to help with answering questions
telo_count.table 

Exercise Ten

A What window size are we creating? Why do you think we chose that window size?

B Look through the table. What is the most common count for each window? Where do you see the highest number of each count?

C What do the numbers in the telo_fr_Counts column represent? Is it basepairs? If not, how would we calculate the number of basepairs per window?

With this table, we can start to filter and assess whether or not we have captured telomere presence in our sequencing.
Note: This is a quick way to take a look and visualize telomere presence. There are much more in depth things you can do to identify if your telomere assembly is accurate, but this is a good starting point.

# First, we filter for rows with some telomere counts
filt_telo.table <- telo_count.table %>%
  filter(telo_all_counts > 20)

What do you think we’re looking for here?

It’s time to prepare for plotting. To start, we need to make a table of the lengths of the fasta sequence. This will be important for plotting. Get the lengths with the following script:

chr_length.table <- data.frame(chromosome = names(genome.fa), length = width(genome.fa))

With this new data table, we can combine it with out filt_telo.table to create our plotting table.

plot.table <- left_join(chr_length.table, filt_telo.table, by = "chromosome") %>%
  mutate(position=(window-1) * window_size + (window_size/2), # position in middle of window.
         telo_size=telo_all_counts*21) %>%
  select(-telo_fr_counts, telo_rv_counts)

plot.table

Exercise Eleven

Here is the skeleton of the plotting code. We want a bar plot where each bar represents a chromosome. Then we want a point for each telomere found, with the size of the point corresponding to the size of the telomere.

A Add in the missing components to the code to make this plot. Additionally, make sure to provide a y axis label, and title.

plot.table %>% ggplot(aes(x = chromosome, y = length)) +
    geom_col(fill = "slateblue", position="identity") +
#    geom_point( aes(y=, size=)) + # add in the appropriate aesthetics and uncomment
    theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5))

B How many telomeres are captured? (No Code needed)

C How many contigs likely correspond to ~complete chromosomes? (No code needed)

D What would your next steps be to further polish this assembly? (No code needed)

Important: please run the code below in preparation for Thursday’s lab. It will take about 18 hours to complete.

First make sure you are in the right conda environment

conda activate pb-minimap2

Now map the PacBio reads back to the assembled genome

cd ~/S_div_assembly
mkdir methylation
cd methylation
time pbmm2 align ../S_div.asm.bp.p_ctg.fa /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/m84066_230701_203753_s1.hifi_reads.default.bam --preset HIFI  --sort --num-threads 8 > S_div_hifi_aligned.bam