CpG Methylation Analysis from HiFi Reads

Introduction

DNA methylation is one of the major types of epigenetic marks on chromatin (along with histone modifications). The most commonly methylated base in higher eukaryotes is cytosine, but adenine residues can also be methylated.

Cytosine methylation can occur in different sequence contexts (also referred to as different sequence motifs). In animals, it is C residues in the motif CG (aka CpG) that are commonly methylated, whereas in plants there are three different motifs methylated: CG, CHH, and CHG, where “H” means any base other than “C”. In plants, C residues in each of those motifs is modified by a different methyl transferase. If you want to know more about plant DNA methylation, read this detailed review article. (NOT required reading for this lab).

Cytosine methylation is thought to play a major role in silencing of transposable elements as well as regulation of gene expression. DNA methylation can be fairly dynamic and can play a role in regulation of gene expression during development, in response to changing environmental conditions, and parental imprinting. Refer to the article linked in the previous paragraph for details.

The traditional method for determining the location of methylated cytosines in the genome is bisulphite sequencing, which can be somewhat challenging to achieve for inexperienced labs.

Recently methods have been developed for PacBio and ONT sequencing that can detect DNA methylation without the need for complicated preparatory steps. For PacBio, detection relies on the fact that the kinetics of DNA polymerization are different when the template strand has a methylated cytosine as compared to an unmethylated cytosine. Detection of methylated bases in this manner has its own challenges and requires a deep-learning model. Luckily, with the deep-learning model in place, the results show a 0.99 correlation with bisulphite sequencing data.

Currently, every PacBio Revio sequencing run contains information on CG methylation. Although CHG and CHH are also important in plants, PacBio does not currently provide this information (ONT does).

Goals

  1. Extract the CG methylation data for the S. diversifolius genome.
  2. Determine if the CG methylation marks follow expected patterns in genes.
  3. Calculate methylation proportion for each gene and ask if that correlates with gene expression levels
  4. Display methylation levels across the chromosome and ask if that pattern is non-random at the chromosome level.

Outline of steps

  1. Map Hifi Reads back to the genome assembly (you have already done this).
  2. Get methylation percentage for each CG site in the genome.
  3. Load the data into R.
  4. Use R to summarize and analyze the methylation data.

Clone the repo

Click on the assignment link to create a repo, and download it. You will work in Assignment_08_template.Rmd

Activate the Conda environment

conda activate pb-minimap2

Read Methylation Stats

PacBio includes some stats on methylation in the reads. Open the file /revio-data/Revio/Report_for_dataset_PB950_Std1-Rep1_JewelFlower_HiFiv3_cell1.pdf And find the page that has the histogram of methylated sites.

Exercise 1 What are the two most probable states of methylation?

Map the reads

Each sequencing read from the PacBio sequencer has information on whether each CG in the sequence was methylated. But to make use of this information we need to know what position in the genome each base corresponds to.

To accomplish this we “map” or “align” the sequencing reads with our newly assembled genome. While there are many programs that align sequencing reads to genome assemblies, here we must use a specific one, pbmm2, a modified version of the program minimap2 program that works with PacBio bam files and preserves the methylation information.

this should have been run at the end of the last lab and DOES NOT need to be rerun. It takes ~ 1 day to complete.

If you did not get it to run, no worries, you can skip.

cd ~/S_div_assembly
mkdir methylation
cd methylation
time pbmm2 align ../S_div.asm.bp.p_ctg.fa /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/m84066_230701_203753_s1.hifi_reads.default.bam --preset HIFI  --sort --num-threads 8 > S_div_hifi_aligned.bam

The resulting output is a .bam file. We will learn more about this format in future lab.

Index the bam

Next we need to make an index of the bam file. If you did not get this to run, no worries, you can skip it

samtools index --threads 3- S_div_hifi_aligned.bam 
# 5 minutes

Compute CpG scores

The next step is get a summary of CG methylation. We use PcBio’s CpG Tools for this. If you did not get this to run, no worries, you can skip it.

aligned_bam_to_cpg_scores \
  --bam S_div_hifi_aligned.bam  \
  --output-prefix S_div_cpg \
  --threads 30
# 5.5 minutes

I have placed a copy of the output file from this command in /revio-data/methylation/ in case you were not able to run the commands.

Create a fasta index file of the 14 longest contig assembly.

We will need this in R. You should have a fasta file with the 14 longest contigs from Assignment 07.

Note: if you named your file differently you will need to change the file name below.

cd ~/S_div_assembly
conda activate pb-minimap2
seqkit faidx S_div.asm.bp.p_ctg_14_longest.fa

Analyze CpG in R

Need to install some packages: (a bit slow). Only needs to be run once.

install.packages("ggnewscale")

BiocManager::install("genomation", update=FALSE)
BiocManager::install("plyranges", update=FALSE)
BiocManager::install("ggbio", update=FALSE)

Now, get started. Load the libraries

library(tidyverse)
library(genomation)
library(GenomicRanges)
library(plyranges)
library(ggbio)
library(ggnewscale)

If you get an error loading library(ggbio), then quit Rstudio and at the Linux command line, run the following to update R and its packages.

sudo apt update
sudo apt upgrade

import data

Contig lengths

first, import the index of the 14 longest fasta to get the lengths.

contigLengths.df<- read_delim("~/S_div_assembly/S_div.asm.bp.p_ctg_14_longest.fa.fai", col_names = c("contig", "length")) %>%
  select("contig", "length") %>%
  arrange(contig)
## Rows: 14 Columns: 5
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): contig
## dbl (4): length, X3, X4, X5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
contigLengths.df
## # A tibble: 14 × 2
##    contig       length
##    <chr>         <dbl>
##  1 ptg000001l 23242224
##  2 ptg000002l 21072181
##  3 ptg000003l 25805113
##  4 ptg000004l 23616377
##  5 ptg000005l 19997638
##  6 ptg000006l 23507006
##  7 ptg000008l 18067884
##  8 ptg000009l 25915809
##  9 ptg000010l 21346277
## 10 ptg000011l 25346973
## 11 ptg000012l 21085664
## 12 ptg000013l 25943873
## 13 ptg000014l 26576959
## 14 ptg000020l 24231524

Next we need to create a GRanges object that contains info on the contigs. GRanges objects are a class of R/Bioconductor objects specifically designed to hold genomic information. They have numerous helpful methods associated with them.

Exercise 2 Complete the code below (replace each XXXX) to create the object contigs.gr. You may want to look at the help for GRanges and IRanges (in particular the “constructor” section).

The info you need to create the GRanges object is in contigLengths.df

contigs.gr <-  GRanges(seqnames = XXXX,
                             ranges = IRanges(start=XXXX, end=XXXX))

# We need to specify the ranges we are interested in (code above) and the total length of each contig
# In our case these are the same thing, but we do have to specify it in both places

seqlengths(contigs.gr) <- XXXX

contigs.gr

If you are successful your object should look like this:

## GRanges object with 14 ranges and 0 metadata columns:
##          seqnames     ranges strand
##             <Rle>  <IRanges>  <Rle>
##    [1] ptg000001l 1-23242224      *
##    [2] ptg000002l 1-21072181      *
##    [3] ptg000003l 1-25805113      *
##    [4] ptg000004l 1-23616377      *
##    [5] ptg000005l 1-19997638      *
##    ...        ...        ...    ...
##   [10] ptg000011l 1-25346973      *
##   [11] ptg000012l 1-21085664      *
##   [12] ptg000013l 1-25943873      *
##   [13] ptg000014l 1-26576959      *
##   [14] ptg000020l 1-24231524      *
##   -------
##   seqinfo: 14 sequences from an unspecified genome

import the gff

gff files specify the location of genes (and other items) in a genome.
We need this to look at patterns of methylation relative to genes.

# read the gff.  We filter to only keep the entries pertaining to mRNAs
gff <- gffToGRanges("/revio-data/methylation/HiFiasm_Sdiv_2026.gff3", filter = "mRNA") 

# subset to only keep genes that are on the 14 contigs we are working with
gff <- gff[seqnames(gff) %in% contigLengths.df$contig] 
seqlevels(gff) <- seqlevelsInUse(gff)

# specify the lengths of the contigs

seqlengths(gff) <-  contigLengths.df %>% pull(length, name = contig)

# only keep the metadata cols that we need
mcols(gff) <- mcols(gff)[,c("type", "ID", "Name")]

gff
## GRanges object with 32984 ranges and 3 metadata columns:
##             seqnames            ranges strand |     type                     ID                   Name
##                <Rle>         <IRanges>  <Rle> | <factor>            <character>            <character>
##       [1] ptg000001l         5293-9443      + |     mRNA Sdiv_ptg000001l_0001-R Sdiv_ptg000001l_0001-R
##       [2] ptg000001l        9318-11441      - |     mRNA Sdiv_ptg000001l_0002-R Sdiv_ptg000001l_0002-R
##       [3] ptg000001l       11148-11327      + |     mRNA Sdiv_ptg000001l_0003-R Sdiv_ptg000001l_0003-R
##       [4] ptg000001l       18869-20479      + |     mRNA Sdiv_ptg000001l_0004-R Sdiv_ptg000001l_0004-R
##       [5] ptg000001l       20475-22524      - |     mRNA Sdiv_ptg000001l_0005-R Sdiv_ptg000001l_0005-R
##       ...        ...               ...    ... .      ...                    ...                    ...
##   [32980] ptg000020l 24212045-24222179      + |     mRNA Sdiv_ptg000022l_2046-R Sdiv_ptg000022l_2046-R
##   [32981] ptg000020l 24223034-24224435      + |     mRNA Sdiv_ptg000022l_2047-R Sdiv_ptg000022l_2047-R
##   [32982] ptg000020l 24224588-24225454      - |     mRNA Sdiv_ptg000022l_2048-R Sdiv_ptg000022l_2048-R
##   [32983] ptg000020l 24225090-24225848      + |     mRNA Sdiv_ptg000022l_2049-R Sdiv_ptg000022l_2049-R
##   [32984] ptg000020l 24225860-24226609      + |     mRNA Sdiv_ptg000022l_2050-R Sdiv_ptg000022l_2050-R
##   -------
##   seqinfo: 14 sequences from an unspecified genome

Exercise 3 Examine the gff object (output above). What does each row represent? What information is given in the seqnames, ranges, strand, and ID columns?

Import the methylation data

If you didn’t generate the file yourself, you can use /revio-data/methylation/S_div_cpg.combined.bed in the command below

cpg <- readGeneric("~/S_div_assembly/methylation/S_div_cpg.combined.bed.gz",
                    skip = 8,
                    meta.cols=list(score=4,
                                   coverage=6))

cpg <- cpg[seqnames(cpg) %in% contigLengths.df$contig] 
seqlevels(cpg) <- seqlevelsInUse(cpg)
seqlengths(cpg) <- contigLengths.df$length %>% set_names(contigLengths.df$contig)


cpg
## GRanges object with 7146232 ranges and 2 metadata columns:
##               seqnames            ranges strand |     score  coverage
##                  <Rle>         <IRanges>  <Rle> | <numeric> <integer>
##         [1] ptg000001l         4575-4576      * |      93.7        35
##         [2] ptg000001l         4642-4643      * |      94.8        35
##         [3] ptg000001l         4648-4649      * |      94.3        35
##         [4] ptg000001l         4702-4703      * |      73.2        35
##         [5] ptg000001l         4727-4728      * |      95.1        35
##         ...        ...               ...    ... .       ...       ...
##   [7146228] ptg000020l 24227102-24227103      * |       3.8        64
##   [7146229] ptg000020l 24227224-24227225      * |       2.6        65
##   [7146230] ptg000020l 24227236-24227237      * |       3.6        65
##   [7146231] ptg000020l 24227253-24227254      * |       3.5        65
##   [7146232] ptg000020l 24227559-24227560      * |       3.8        65
##   -------
##   seqinfo: 14 sequences from an unspecified genome

Determine the average methylation pattern at transcripton start sites

Do methylation patterns change near the transcription start site of genes? To ask that question we need to create an object that defines a window of interest around the transcription start site of each gene. Since have the gff this should be easy (and it is, thanks to GRanges methods)

Create an object with windows around transcription start sites.

Exercise 4 Use promoters() to create a GRanges object that defines a 4000bp wide window surrounding the transcription start site (2000bp upstream and 2000bp downstream of the transcription start site).

First, get help on the promoters function:

?`promoters,GenomicRanges-method`

Then fill this in

TSS <- promoters(XXXX)
TSS

When you are done, TSS should look like this

## GRanges object with 32984 ranges and 3 metadata columns:
##             seqnames            ranges strand |     type                     ID                   Name
##                <Rle>         <IRanges>  <Rle> | <factor>            <character>            <character>
##       [1] ptg000001l         3293-7292      + |     mRNA Sdiv_ptg000001l_0001-R Sdiv_ptg000001l_0001-R
##       [2] ptg000001l        9442-13441      - |     mRNA Sdiv_ptg000001l_0002-R Sdiv_ptg000001l_0002-R
##       [3] ptg000001l        9148-13147      + |     mRNA Sdiv_ptg000001l_0003-R Sdiv_ptg000001l_0003-R
##       [4] ptg000001l       16869-20868      + |     mRNA Sdiv_ptg000001l_0004-R Sdiv_ptg000001l_0004-R
##       [5] ptg000001l       20525-24524      - |     mRNA Sdiv_ptg000001l_0005-R Sdiv_ptg000001l_0005-R
##       ...        ...               ...    ... .      ...                    ...                    ...
##   [32980] ptg000020l 24210045-24214044      + |     mRNA Sdiv_ptg000022l_2046-R Sdiv_ptg000022l_2046-R
##   [32981] ptg000020l 24221034-24225033      + |     mRNA Sdiv_ptg000022l_2047-R Sdiv_ptg000022l_2047-R
##   [32982] ptg000020l 24223455-24227454      - |     mRNA Sdiv_ptg000022l_2048-R Sdiv_ptg000022l_2048-R
##   [32983] ptg000020l 24223090-24227089      + |     mRNA Sdiv_ptg000022l_2049-R Sdiv_ptg000022l_2049-R
##   [32984] ptg000020l 24223860-24227859      + |     mRNA Sdiv_ptg000022l_2050-R Sdiv_ptg000022l_2050-R
##   -------
##   seqinfo: 14 sequences from an unspecified genome

End of Exercise 4

Find the CpG score in each TSS window

We can use scoreMatrix() to summarize the CpG data at each bp position in each TSS window

cpg_sm <- cpg %>% 
  mcolAsRleList("score") %>%
  ScoreMatrix(windows = TSS, strand.aware = TRUE)

And this can be plotted using plotMeta()

plotMeta(cpg_sm, xcoords = -1999:2000)

Exercise 5: On the plot above “0” is the transcription start site and “average score” is the average CpG methylation across all genes. What does the plot tell you about the pattern of methylation in promoters and the transcribed regions of genes?

Distribution of of methylation

The above analysis showed us the average CpG methylation across a gene. Is there substantial gene-to-gene variation in methylation? Let’s make a histogram to start to answer this question.

First we need to calculate the average methylation of CpG in each gene separately. We want to limit our analysis to a smaller portion of the gene focused on the gene body. Let’s take a 1000bp region starting at the TSS. We do NOT need to create a new set of windows or a new ScoreMatrix, we can just subset the one that we have.

Take a look at the score matrix dimensions

dim(cpg_sm)
## [1] 32984  4000

Exercise 6

(A) What do the rows and columns of this matrix represent?

(B) Subset the matrix to keep the 1000bp starting at the TSS. Place the results in an object cpg_sm_genebody (Hnt: one way to do this is with the square brackets)

(C) Calculate the average CpG score for each gene in cpg_sm_genebody. Place the results in cpg_avg_genebody. (Hint: rowMeans or colMeans may be helpful.)

The first six entries should look like this:

##        1        2        3        4        5        6 
## 3.107895 3.956667 4.425806 3.151724 2.880000 2.892683

One you have the averages, when can make a tibble that has those averages and the gene names:

cpg_genebody.df <- tibble(Gene_ID=mcols(TSS)$ID,
                          CpG_methylation = cpg_avg_genebody)

head(cpg_genebody.df)
## # A tibble: 6 × 2
##   Gene_ID                CpG_methylation
##   <chr>                            <dbl>
## 1 Sdiv_ptg000001l_0001-R            3.11
## 2 Sdiv_ptg000001l_0002-R            3.96
## 3 Sdiv_ptg000001l_0003-R            4.43
## 4 Sdiv_ptg000001l_0004-R            3.15
## 5 Sdiv_ptg000001l_0005-R            2.88
## 6 Sdiv_ptg000001l_0006-R            2.89

(D) Make a histogram of the results, both with a normal y-scale and a log10 transformed y-scale (2 separate plots are fine).

(E) Interpret the plot(s). What is the y-axis? What does this tell you about the methylation status of genes?

Is methylation correlated with gene expression levels?

Is there a correlation between CpG methylation in gene bodies and expression?

The file S_div_wp0_cpm.csv in the input directory has RNA expression values from seeds imbibed in water for 24 hours in a growth chamber. (Note that that methylation data comes from leaves of mature, field plants, so this is not ideal).

Get the RNA expression data

rna <- read_csv("../input/S_div_wp0_cpm.csv")
rna

Exercise 7

(A) Combine the rna and cpg data frames. Hint: you are going to have to mutate the Gene_ID column of one of these data frames, either with str_c or with str_remove depending on which way you decide to go

(B) Make a scatter plot (and maybe add a smoothing line) to show the relationship between cpg and rna. I recommend setting alpha to 0.1 or 0.2 for the points.

(C) Use cor.test() with method = "spearman" (since cpg methylation is not normally distributed) to test for a correlation. A low p-value means that the correlation is significant. rho is the amount of correlation (0 = no correlation, 1 = perfect correlation, -1 = perfect inverse correlation)

(D) Given the plot and cor test, what do you conclude about the relationship between CpG methylation and RNA expression in these samples?

Examine the methylation pattern genome wide

Having looked at the average pattern in genes, lets look at how methylation changes across the whole genome.

To do this, we divide the genome into tiles (or bins) and then compute the average methylation in each tile.

# Create tiles of 50,000 bp in length
tiles <- tileGenome(seqlengths(contigs.gr), tilewidth = 50000, cut.last.tile.in.chrom = TRUE)

Next we calculate the average cpg methylation in each bin

# Convert the cpg data to run-length encoding, required for computing average
cpg_rle <- mcolAsRleList(cpg, "score")

# average per bin
plotbins <- binnedAverage(tiles, cpg_rle, "CpG", na.rm = TRUE)

Now lets compute the density of genes in each bin. A value of 100 would mean that 100% of the bin encodes mRNA sequences. We will compute this on the fly and add it to the plotbins GRanges object.

mcols(plotbins) <- cbind(mcols(plotbins),
                         mcols(binnedAverage(tiles, coverage(gff)*100, "genedensity")))

Plot a circular plot

One way to plot genomic information is in a circular plot. We use the package ggbio to accomplish this. ggbio uses a lot of the same grammar as ggplot but is specialized for genomic data.

ggbio() +
  circle(plotbins, geom="heatmap", aes(x=midpoint, color=CpG)) +
  new_scale_color() +
  circle(plotbins, geom="heatmap", aes(x=midpoint, color=genedensity)) +
  scale_color_gradient(high="red") +
    circle(contigs.gr, geom="text", aes(label=seqnames)) 

Plot a linear plot

ggbio also allows us to feed a Granges object straight in to ggplot.

ggplot(plotbins) +
  geom_line(mapping=aes(x=midpoint, y=CpG), color="skyblue") +
  geom_line(mapping=aes(x=midpoint, y=genedensity), color="red") +
  facet_wrap(~seqnames)

Exercise 8

Is there a relationship between CpG methylation and gene density? If so, what is it? What are the genomic features that might explain the pattern that you see?