Genome Assembly from PacBio HiFi reads

Important: before leaving today run the code at the very end of this lab

Regardless of whether or not you finish the exercises, please run the code at the end of this lab page before leaving today. We need the output for Thursday.

Intro

Sequencing has become inexpensive enough, reads have become long enough, and algorithms have become good enough that it is quite common for a lab to decide to sequence and assembly genomes for organisms that they are working on.

This lab will take you through some basic assembly and evaluation procedures. We will be working with Pacific Biosciences reads generated from genomics DNA extracted from Streptanthus diversifolius (Variable Leaf Jewelflower), a California native wildflower.

Pacific Biosciences technology generates long reads, up to 40kb in length. This length makes assembly much easier than using short read data such as Illumina.

Clone the repo

Click on the assignment link to create a repo, and download it. You will work in Assignment_07_template.Rmd

Read Stats

PacBio includes stats on the reads. Open the file /revio-data/Revio/Report_for_dataset_PB950_Std1-Rep1_JewelFlower_HiFiv3_cell1.pdf And find the page that has stats on read length and quality.

Note: the quality score is a representation of the probability of an error in the sequence. To convert a score to a p-value, you can use this code in R:

qscore <- 22 # replace this number with the mean quality from the run
10^(-qscore / 10)

I will explain this scoring method in more detail in the Illumina lab.

Exercise One (No code needed)

A Create a markdown table that has the mean, median, and N50 read length, and median quality.

B Does it surprise you that N50 is higher than the mean or median length or does that make sense?

C From these stats does this look like a successful PacBio run to you?

D Based on flow cytometry, we expect the genome size to be around 350MB. If that is true, how much coverage should we have from this sequencing?

Assemble the reads

We already assembled the reads using hifiasm in the previous lab.

If you ran the assembly command as directed, they assembly should be in ~/S_div_assembly

There a bunch of files there. The primary assembly is in S_div.asm.bp.p_ctg.gfa. The files S_div.asm.bp.hap1.p_ctg.gfa and S_div.asm.bp.hap2.p_ctg.gfa contain approximate haplotype phased assemblies of the two haploid genomes. We will use the primary assembly for this lab.

Convert to fasta

hifiasm outputs its assembly in Graphical Assembly Format (gfa). We need to convert this to fasta format.

Use this code to do the conversion:

cd ~/S_div_assembly
awk '/^S/{print ">"$2;print $3}' S_div.asm.bp.p_ctg.gfa > S_div.asm.bp.p_ctg.fa

QC the assembly

Basic stats

First, install the seqkit package

conda activate pb-minimap2
conda install bioconda::seqkit

Now run it:

seqkit stats -N 50,80,90  S_div.asm.bp.p_ctg.fa

Exercise Two (No Code Needed)

A How does the total assembly length compare to expectation?

B We expect 14 chromosomes. What would the average chromosome length be?

C With this in mind, interpret the max length, N50, and N80.

D Based on these simple stats, what is your initial assessment of this assembly?

14 largest sequences

Since we expect 14 chromosomes, let’s see what the size of the largest scaffolds are.

First, sort by length

seqkit sort --by-length --reverse S_div.asm.bp.p_ctg.fa > S_div.asm.bp.p_ctg_sort.fa

Next, create a fasta index file (.fai), which will contain the lengths by default

samtools faidx S_div.asm.bp.p_ctg_sort.fa

The results will be in S_div.asm.bp.p_ctg_sort.fa.fai

Exercise Three. R Code Needed

A Import the .fai into R and make a bar plot that shows the length of the 25 longest contigs. Check the link above to figure out which column you should be plotting.

Important: the .fai file does not have column headers, be sure to import it appropriately. Read the help file for whichever read function you use in R to figure out how to do this.

B What percentage of the assembly is contained in the 14 longest contigs?

BUSCO Analysis

One was to benchmark genome assembly completeness is by analyzing the presence of universal single-copy orthologs BUSCO analysis.

The busco installed on your instance is out-of-date and missing some dependencies. Let’s remove it and install a fresh copy

sudo apt remove busco
conda activate pb-minimap2 # you should already be in this environment, but just in case...
conda install -c conda-forge -c bioconda busco=5.7.1

The authors of busco have created curated lists of genes that are expected to be present as single-copy in all genomes. There are different BUSCO lists for different branches along the tree of life.

The busco tool searches an assembly for the presence of the specified BUSCO list.

You can see a list of available busco lists by entering:

busco --list-datasets

You will want to use brassicales_odb10

Exercise Four Linux code needed Figure out how to run busco using the brassicales_odb10 data set, the assembly fasta file, and 7 cpus. You can get help on busco by typing busco --help or by checking the user manual

A What command did you use? (This will take about 3 minutes to run)

B Examine the BUSCO results and include them here.

C Interpret the results with respect to completeness. See the user guide for help on interpreting the results.

D Interpret the results with respect to single copy versus duplication. Make a hypothesis regarding the duplication rate.

Exercise Five Linux code needed

Most of our data is in contigs larger than 1MB. If we subset our assembly to only retain contigs > 1MB, would our BUSCO score still be good?

A Use the seqkit seq command to create a new fasta file that contains all scaffolds > 1MB

B What command can you use to make sure that it worked? Show the command and output below

C Run BUSCO on this 1MB data set, provide the results, and comment on the results relative to BUSCO on the full dataset

Exercise Six Linux Code Needed

As noted above, we expect 14 chromosomes. If we only examine the 14 largest scaffolds, how good is our BUSCO score?

A Use Linux commands to create a file that contains the names of the 14 longest scaffolds. You will want to start with S_div.asm.bp.p_ctg_sort.fa.fai

B
Use the seqkit grep command to create a new fasta file that has just these 14 scaffolds

C Run BUSCO on the 14 largest scaffolds. Show results. Have we retained enough of the gene space just to work with these scaffolds?

Compare to Arabidopsis

One interesting analysis to do is to compare the genomic organization of our assembly to that of a relative such as Arabidopsis thaliana (Both species are Brassicaceae).

Relevant tools to perform such a comparison are the command line tool MUMmer or the web tool CoGe. Mummer has some idiosyncrasies that make it challenging to use for this course but an detailed plot from a Mummer analysis of S. diversifolius vs A. thaliana is shown here:

For this plot an all by all 6-frame translated protein similarity search was carried out between the two genomes. Dots represent positions where similar genes were found.

Exercise Seven

A Examine the MUMmer plot, above. In particular look along the rows: in general, how many chromosomes in S. diversifolius are matched by each position on an A. thaliana chromosome?

B Is the result consistent with the BUSCO singleton vs duplicate analysis?

C What is the most likely evolutionary explanation for the differences between these two genomes?

Annotation

A natural next step would be structural annotation…finding where the genes and coding-regions are in the genome. We don’t really have the computational resources / time to undertake this here, but should you need to do it, one of the most common tools is BRAKER.

An interesting alternative has recently been released that used deep learning to find genes: Helixer

Find telomeres

Telomeres are repetitive sequences at the end of DNA and are a pretty exciting research frontier right now. If you want to learn a little more, here is a brief overview. The repetitive sequence of telomeres is highly conserved, with telomeres in plants typically consisting of the repeat TTTAGGG. The telomere repeat in vertebrates is TTAGGG.

Identify telomeric sequence in your assembly

We have started the creation of a nice reference genome! A good quality control check is to determine if you have captured telomeres in your assembly. Telomeres are hard to capture because they are highly repetitive sequences, making them difficult to assemble through. A telomere-to-telomere genome (a genome where all of the chromosomes have been sequenced all the way through, from the starting telomere to the ending telomere) has become the gold standard for genome assembly and has just recently become possible. Here is a page on the telomere-to-telomere human genome, recently completed with help from a lab at UC Davis.

How can we determine if the genome we created has captured telomeres? We can use the fact that telomeric sequence is conserved! Remember, the most common repeat in plants in TTTAGGG. We will use this, what we know about genome structure, and R Studio to see if we have captured these telomeric sequences.

Lets make sure we have the necessary libraries

library(tidyverse)
library(seqinr)
library(Biostrings)

Note: Remember, if you don’t have a package installed, you can use the install.packages() command. All of these should already be installed for you though.

Read in the genome assembly file for the 14 longest contigs

genome.fa <- readDNAStringSet("~/S_div_assembly/S_div.asm.bp.p_ctg_14_longest.fa")

Now assign a variable to the telomere repeat.

telo_fr_repeat <- "TTTAGGGTTTAGGGTTTAGGG"
telo_rv_repeat <- "CCCTAAACCCTAAACCCTAAA"

Exercise Eight (No code needed)
There were two variables created, telo_fr_repeat and telo_rv_repeat.

A How are these two variables related? What did you notice about them?

B Why do you think we repeated each seven basepair sequence three times? In other words, why didn’t we just use TTTAGGG and CCCTAAA?

We need to create a couple of functions to use inside of the for loop. create_windows creates windows along the genome of a chosen window size. count_sequence will count the occurrence of a specified sequence within a given window.

create_windows <- function(genome, window_size) {
  windows <- substring(genome, seq(1, nchar(genome), window_size),
                       seq(window_size, nchar(genome) + window_size, window_size))
  return(windows)
}

count_sequence <- function(window, sequence) {
  count <- sum(countPattern(sequence, window))
  return(count)
}

Now that the sequences are created, we can create an empty data frame and run a for loop to count how many times the telomeric sequence appears in windows across the genome.

# We need to create an empty data frame to store each iteration
telo_count.table <- data.frame()

# Define Window Size
window_size <- 1000000

# Run a for loop for all of the entries within the fasta
for (i in 1:length(genome.fa)) {
    # Extract the sequence name
    seq_name <- names(genome.fa[i])
    # Make sure the sequence is read as characters
    sequence <- as.character(genome.fa[i])

    # Split the sequence into windows
    windows <- create_windows(genome = sequence, window_size = window_size)

    # Count occurrences of the telomere fr sequence in each window
    telo_fr_counts <- sapply(windows, count_sequence, sequence = telo_fr_repeat)
  
    # Count occurrences of the telomere rv sequence in each window
    telo_rv_counts <- sapply(windows, count_sequence, sequence = telo_rv_repeat)

    # Create a data table for the window counts
    dt_chromosome <- data.frame(chromosome = rep(seq_name, length(windows)),
                                window = seq_along(windows),
                                telo_fr_counts = telo_fr_counts,
                                telo_rv_counts= telo_rv_counts,
                                telo_all_counts = telo_fr_counts + telo_rv_counts,
                                row.names = NULL)
    
    # Bind the output of each iteration to the table
    telo_count.table <- rbind(telo_count.table, dt_chromosome)
    
    }

# View the table to help with answering questions
telo_count.table 

Exercise 9

A What window size are we creating? Why do you think we chose that window size?

B Look through the table. What is the most common count for each window? Where do you see the highest number of each count?

C What do the numbers in the telo_fr_Counts column represent? Is it basepairs? If not, how would we calculate the number of basepairs per window?

With this table, we can start to filter and assess whether or not we have captured telomere presence in our sequencing.
Note: This is a quick way to take a look and visualize telomere presence. There are much more in depth things you can do to identify if your telomere assembly is accurate, but this is a good starting point.

# First, we filter for rows with some telomere counts
filt_telo.table <- telo_count.table %>%
  filter(telo_all_counts > 20)

What do you think we’re looking for here?

It’s time to prepare for plotting. To start, we need to make a table on the lengths of the fasta sequence. This will be important for plotting. Get the lengths with the following script:

chr_length.table <- data.frame(chromosome = names(genome.fa), length = width(genome.fa))

With this new data table, we can combine it with out filt_telo.table to create our plotting table.

plot.table <- left_join(chr_length.table, filt_telo.table, by = "chromosome") %>%
  mutate(position=(window-1) * window_size + (window_size/2), # position in middle of window.
         telo_size=telo_all_counts*21) %>%
  select(-telo_fr_counts, telo_rv_counts)

plot.table

Exercise 10

Here is the skeleton of the plotting code. We want a bar plot where each bar represents a chromosome. Then We want a point for each telomere found, with the size of the point corresponding to the size of the telomere.

Add in the missing components to the code to make this plot. Additionally, make sure to provide a y axis label, and title.

Then answer the question: How many telomeres does it look like the assembly captured? How can you tell with this graph?”

plot.table %>% ggplot(aes(x = chromosome, y = length)) +
    geom_col(fill = "slateblue", position="identity") +
#    geom_point( aes(y=, size=)) + # add in the appropriate aesthetics and uncomment
    theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5))

Important: please run the code below in preparation for Thursday’s lab. It will take about 18 hours to complete.

cd ~/S_div_assembly
mkdir methylation
cd methylation
time pbmm2 align ../S_div.asm.bp.p_ctg.fa /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/m84066_230701_203753_s1.hifi_reads.default.bam --preset HIFI  --sort --num-threads 8 > S_div_hifi_aligned.bam