CpG Methylation Analysis from HiFi Reads Centromere Discovery

Introduction

DNA methylation is one of the major types of epigenetic marks on chromatin (along with histone modifications). The most commonly methylated base in higher eukaryotes is cytosine, but adenine residues can also be methylated.

Cytosine methylation can occur in different sequence contexts (also referred to as different sequence motifs). In animals, it is C residues in the motif CG (aka CpG) that are commonly methylated, whereas in plants there are three different motifs methylated: CG, CHH, and CHG, where “H” means any base other than “C”. In plants, C residues in each of those motifs is modified by a different methyl transferase. If you want to know more about plant DNA methylation, read this detailed review article. (NOT required reading for this lab).

Cytosine methylation is thought to play a major role in silencing of transposable elements as well as regulation of gene expression. DNA methylation can be fairly dynamic and can play a role in regulation of gene expression during development, in response to changing environmental conditions, and parental imprinting. Refer to the article linked in the previous paragraph for details.

The traditional method for determining the location of methylated cytosines in the genome is bisulphite sequencing, which can be somewhat challenging to achieve for inexperienced labs.

Recently methods have been developed for PacBio and ONT sequencing that can detect DNA methylation without the need for complicated preparatory steps. For PacBio, detection relies on the fact that the kinetics of DNA polymerization are different when the template strand has a methylated cytosine as compared to an unmethylated cytosine. Detection of methylated bases in this manner has its own challenges and requires a deep-learning model. Luckily, with the deep-learning model in place, the results show a 0.99 correlation with bisulphite sequencing data.

Currently, every PacBio Revio sequencing run contains information on CG methylation. Although CHG and CHH are also important in plants, PacBio does not currently provide this information (ONT does).

Goals

  1. Extract the CG methylation data for the S. diversifolius genome.
  2. Determine if the CG methylation marks follow expected patterns in genes.
  3. Calculate methylation proportion for each gene and ask if that correlates with gene expression levels
  4. Display methylation levels across the chromosome and ask if that pattern is non-random at the chromosome level.

Outline of steps

  1. Map Hifi Reads back to the genome assembly (you have already done this).
  2. Get methylation percentage for each CG site in the genome.
  3. Load the data into R.
  4. Use R to summarize and analyze the methylation data.

Clone the repo

Click on the assignment link to create a repo, and download it. You will work in Assignment_08_template.Rmd

Activate the Conda environment

conda activate pb-minimap2

Read Methylation Stats

PacBio includes some stats on methylation in the reads. Open the file /revio-data/Revio/Report_for_dataset_PB950_Std1-Rep1_JewelFlower_HiFiv3_cell1.pdf And find the page that has the histogram of methylated sites.

Exercise 1 What are the two most probable states of methylation?

Map the reads

Each sequencing read from the PacBio sequencer has information on whether each CG in the sequence was methylated. But to make use of this information we need to know what position in the genome each base corresponds to.

To accomplish this we “map” or “align” the sequencing reads with our newly assembled genome. While there are many programs that align sequencing reads to genome assemblies, here we must use a specific one, pbmm2, a modified version of the program minimap2 program that works with PacBio bam files and preserves the methylation information.

this should have been run at the end of the last lab and DOES NOT need to be rerun. It takes ~ 1 day to complete.

If you did not get it to run, no worries, you can skip.

cd ~/S_div_assembly
mkdir methylation
cd methylation
time pbmm2 align ../S_div.asm.bp.p_ctg.fa /revio-data/Revio/r84066_20230630_201715_2_A01/hifi_reads/m84066_230701_203753_s1.hifi_reads.default.bam --preset HIFI  --sort --num-threads 8 > S_div_hifi_aligned.bam

The resulting output is a .bam file. We will learn more about this format in future lab.

Index the bam

Next we need to make an index of the bam file. If you did not get this to run, no worries, you can skip it

samtools index --threads 7 S_div_hifi_aligned.bam 
# 20 minutes

Compute CpG scores

The next step is get a summary of CG methylation. We use PcBio’s CpG Tools for this. If you did not get this to run, no worries, you can skip it.

aligned_bam_to_cpg_scores \
  --bam S_div_hifi_aligned.bam  \
  --output-prefix S_div_cpg \
  --model /usr/local/src/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite \
  --threads 7
# 17 minutes

I have placed a copy of the output file from this command in /revio-data/methylation/ in case you were not able to run the commands.

Create a fasta index file of the 14 longest contig assembly.

We will need this in R. You should have a fasta file with the 14 longest contigs from Assignment 07.

Note: if you named your file differently you will need to change the file name below.

cd ~/S_div_assembly
conda activate pb-minimap2
seqkit faidx S_div.asm.bp.p_ctg_14_longest.fa

Analyze CpG in R

Need to install some packages: (a bit slow). Only needs to be run once.

install.packages("ggnewscale")

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("genomation", update=FALSE)
BiocManager::install("plyranges", update=FALSE)
BiocManager::install("ggbio", update=FALSE)

Now, get started. Load the libraries

library(tidyverse)
library(genomation)
library(GenomicRanges)
library(plyranges)
library(ggbio)
library(ggnewscale)

import data

Contig lengths

first, import the index of the 14 longest fasta to get the lengths.

contigLengths.df<- read_delim("~/S_div_assembly/S_div.asm.bp.p_ctg_14_longest.fa.fai", col_names = c("contig", "length")) %>%
  select("contig", "length") %>%
  arrange(contig)
## Rows: 14 Columns: 5
## ── Column specification ───────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): contig
## dbl (4): length, X3, X4, X5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
contigLengths.df
## # A tibble: 14 × 2
##    contig       length
##    <chr>         <dbl>
##  1 ptg000001l 23258588
##  2 ptg000002l 20898951
##  3 ptg000003l 25805138
##  4 ptg000004l 25598964
##  5 ptg000005l 19994491
##  6 ptg000007l 23572188
##  7 ptg000008l 18043207
##  8 ptg000009l 25934314
##  9 ptg000010l 25671814
## 10 ptg000012l 25467740
## 11 ptg000013l 21085204
## 12 ptg000014l 26529351
## 13 ptg000015l 22446652
## 14 ptg000022l 24016919

Next we need to create a GRanges object that contains info on the contigs. GRanges objects are a class of R/Bioconductor objects specifically designed to hold genomic information. They have numerous helpful methods associated with them.

Exercise 2 Complete the code below (replace each XXXX) to create the object contigs.gr. You may want to look at the help for GRanges and IRanges (in particular the “constructor” section).

The info you need to create the GRanges object is in contigLengths.df

contigs.gr <-  GRanges(seqnames = XXXX,
                             ranges = IRanges(start=XXXX, end=XXXX))

# We need to specify the ranges we are interested in (code above) and the total length of each contig
# In our case these are the same thing, but we do have to specify it in both places

seqlengths(contigs.gr) <- XXXX

contigs.gr

If you are successful your object should look like this:

## GRanges object with 14 ranges and 0 metadata columns:
##          seqnames     ranges strand
##             <Rle>  <IRanges>  <Rle>
##    [1] ptg000001l 1-23258588      *
##    [2] ptg000002l 1-20898951      *
##    [3] ptg000003l 1-25805138      *
##    [4] ptg000004l 1-25598964      *
##    [5] ptg000005l 1-19994491      *
##    ...        ...        ...    ...
##   [10] ptg000012l 1-25467740      *
##   [11] ptg000013l 1-21085204      *
##   [12] ptg000014l 1-26529351      *
##   [13] ptg000015l 1-22446652      *
##   [14] ptg000022l 1-24016919      *
##   -------
##   seqinfo: 14 sequences from an unspecified genome

import the gff

gff files specify the location of genes (and other items) in a genome.
We need this to look at patterns of methylation relative to genes.

# read the gff.  We filter to only keep the entries pertaining to mRNAs
gff <- gffToGRanges("/revio-data/methylation/HiFiasm_S.div.small.gff3", filter = "mRNA") 

# subset to only keep genes that are on the 14 contigs we are working with
gff <- gff[seqnames(gff) %in% contigLengths.df$contig] 
seqlevels(gff) <- seqlevelsInUse(gff)

# specify the lengths of the contigs
seqlengths(gff) <- contigLengths.df$length %>% set_names(contigLengths.df$contig)

# only keep the metadata cols that we need
mcols(gff) <- mcols(gff)[,c("type", "ID", "Name")]

gff
## GRanges object with 33433 ranges and 3 metadata columns:
##             seqnames            ranges strand |     type                     ID
##                <Rle>         <IRanges>  <Rle> | <factor>            <character>
##       [1] ptg000001l         5286-9436      + |     mRNA Sdiv_ptg000001l_0001-R
##       [2] ptg000001l       11141-11320      + |     mRNA Sdiv_ptg000001l_0003-R
##       [3] ptg000001l       18862-20472      + |     mRNA Sdiv_ptg000001l_0004-R
##       [4] ptg000001l       32918-33824      + |     mRNA Sdiv_ptg000001l_0007-R
##       [5] ptg000001l       34123-34566      + |     mRNA Sdiv_ptg000001l_0008-R
##       ...        ...               ...    ... .      ...                    ...
##   [33429] ptg000022l 23979619-23982930      - |     mRNA Sdiv_ptg000022l_2041-R
##   [33430] ptg000022l 23984523-23991887      - |     mRNA Sdiv_ptg000022l_2042-R
##   [33431] ptg000022l 23993455-23993949      - |     mRNA Sdiv_ptg000022l_2044-R
##   [33432] ptg000022l 23994556-23996796      - |     mRNA Sdiv_ptg000022l_2045-R
##   [33433] ptg000022l 24009983-24010849      - |     mRNA Sdiv_ptg000022l_2048-R
##                             Name
##                      <character>
##       [1] Sdiv_ptg000001l_0001-R
##       [2] Sdiv_ptg000001l_0003-R
##       [3] Sdiv_ptg000001l_0004-R
##       [4] Sdiv_ptg000001l_0007-R
##       [5] Sdiv_ptg000001l_0008-R
##       ...                    ...
##   [33429] Sdiv_ptg000022l_2041-R
##   [33430] Sdiv_ptg000022l_2042-R
##   [33431] Sdiv_ptg000022l_2044-R
##   [33432] Sdiv_ptg000022l_2045-R
##   [33433] Sdiv_ptg000022l_2048-R
##   -------
##   seqinfo: 14 sequences from an unspecified genome

Exercise 3 Examine the gff object (output above). What does each row represent? What information is given in the seqnames, ranges, strand, and ID columns?

Import the methylation data

If you didn’t generate the file yourself, you can use /revio-data/methylation/S_div_cpg.combined.bed in the command below

cpg <- readGeneric("~/S_div_assembly/methylation/S_div_cpg.combined.bed",
                    meta.cols=list(score=4,
                                   coverage=6))

cpg <- cpg[seqnames(cpg) %in% contigLengths.df$contig] 
seqlevels(cpg) <- seqlevelsInUse(cpg)
seqlengths(cpg) <- contigLengths.df$length %>% set_names(contigLengths.df$contig)


cpg
## GRanges object with 7218137 ranges and 2 metadata columns:
##               seqnames            ranges strand |     score  coverage
##                  <Rle>         <IRanges>  <Rle> | <numeric> <integer>
##         [1] ptg000001l         4568-4569      * |      93.7        35
##         [2] ptg000001l         4635-4636      * |      94.8        35
##         [3] ptg000001l         4641-4642      * |      94.3        35
##         [4] ptg000001l         4695-4696      * |      73.2        35
##         [5] ptg000001l         4720-4721      * |      95.1        35
##         ...        ...               ...    ... .       ...       ...
##   [7218133] ptg000022l 24012497-24012498      * |       3.8        64
##   [7218134] ptg000022l 24012619-24012620      * |       2.6        65
##   [7218135] ptg000022l 24012631-24012632      * |       3.6        65
##   [7218136] ptg000022l 24012648-24012649      * |       3.5        65
##   [7218137] ptg000022l 24012954-24012955      * |       3.8        65
##   -------
##   seqinfo: 14 sequences from an unspecified genome

Determine the average methylation pattern at transcripton start sites

Do methylation patterns change near the transcription start site of genes? To ask that question we need to create an object that defines a window of interest around the transcription start site of each gene. Since have the gff this should be easy (and it is, thanks to GRanges methods)

Create an object with windows around transcription start sites.

Exercise 4 Use promoters() to create a GRanges object that defines a 4000bp wide window surrounding the transcription start site (2000bp upstream and 2000bp downstream of the transcription start site).

First, get help on the promoters function:

?`promoters,GenomicRanges-method`

Then fill this in

TSS <- promoters(XXXX)
TSS

When you are done, TSS should look like this

## GRanges object with 33433 ranges and 3 metadata columns:
##             seqnames            ranges strand |     type                     ID
##                <Rle>         <IRanges>  <Rle> | <factor>            <character>
##       [1] ptg000001l         3286-7285      + |     mRNA Sdiv_ptg000001l_0001-R
##       [2] ptg000001l        9141-13140      + |     mRNA Sdiv_ptg000001l_0003-R
##       [3] ptg000001l       16862-20861      + |     mRNA Sdiv_ptg000001l_0004-R
##       [4] ptg000001l       30918-34917      + |     mRNA Sdiv_ptg000001l_0007-R
##       [5] ptg000001l       32123-36122      + |     mRNA Sdiv_ptg000001l_0008-R
##       ...        ...               ...    ... .      ...                    ...
##   [33429] ptg000022l 23980931-23984930      - |     mRNA Sdiv_ptg000022l_2041-R
##   [33430] ptg000022l 23989888-23993887      - |     mRNA Sdiv_ptg000022l_2042-R
##   [33431] ptg000022l 23991950-23995949      - |     mRNA Sdiv_ptg000022l_2044-R
##   [33432] ptg000022l 23994797-23998796      - |     mRNA Sdiv_ptg000022l_2045-R
##   [33433] ptg000022l 24008850-24012849      - |     mRNA Sdiv_ptg000022l_2048-R
##                             Name
##                      <character>
##       [1] Sdiv_ptg000001l_0001-R
##       [2] Sdiv_ptg000001l_0003-R
##       [3] Sdiv_ptg000001l_0004-R
##       [4] Sdiv_ptg000001l_0007-R
##       [5] Sdiv_ptg000001l_0008-R
##       ...                    ...
##   [33429] Sdiv_ptg000022l_2041-R
##   [33430] Sdiv_ptg000022l_2042-R
##   [33431] Sdiv_ptg000022l_2044-R
##   [33432] Sdiv_ptg000022l_2045-R
##   [33433] Sdiv_ptg000022l_2048-R
##   -------
##   seqinfo: 14 sequences from an unspecified genome

End of Exercise 4

Find the CpG score in each TSS window

We can use scoreMatrix() to summarize the CpG data at each bp position in each TSS window

cpg_sm <- cpg %>% 
  mcolAsRleList("score") %>%
  ScoreMatrix(windows = TSS, strand.aware = TRUE)

And this can be plotted using plotMeta()

plotMeta(cpg_sm, xcoords = -1999:2000)

Exercise 5: On the plot above “0” is the transcription start site and “average score” is the average CpG methylation across all genes. What does the plot tell you about the pattern of methylation in promoters and the transcribed regions of genes?

Distribution of of methylation

The above analysis showed us the average CpG methylation across a gene. Is there substantial gene-to-gene variation in methylation? Let’s make a histogram to start to answer this question.

First we need to calculate the average methylation of CpG in each gene separately. We want to limit our analysis to a smaller portion of the gene focused on the gene body. Let’s take a 1000bp region starting at the TSS. We do NOT need to create a new set of windows or a new ScoreMatrix, we can just subset the one that we have.

Take a look at the score matrix dimensions

dim(cpg_sm)
## [1] 33433  4000

Exercise 6

(A) What do the rows and columns of this matrix represent?

(B) Subset the matrix to keep the 1000bp starting at the TSS. Place the results in an object cpg_sm_genebody (Hnt: one way to do this is with the square brackets)

(C) Calculate the average CpG score for each gene in cpg_sm_genebody. Place the results in cpg_avg_genebody. (Hint: rowMeans or colMeans may be helpful.)

The first six entries should look like this:

##        1        2        3        4        5        6 
## 3.107895 4.425806 3.151724 2.924324 2.855172 2.705882

One you have the averages, when can make a tibble that has those averages and the gene names:

cpg_genebody.df <- tibble(Gene_ID=mcols(TSS)$ID,
                          CpG_methylation = cpg_avg_genebody)

head(cpg_genebody.df)
## # A tibble: 6 × 2
##   Gene_ID                CpG_methylation
##   <chr>                            <dbl>
## 1 Sdiv_ptg000001l_0001-R            3.11
## 2 Sdiv_ptg000001l_0003-R            4.43
## 3 Sdiv_ptg000001l_0004-R            3.15
## 4 Sdiv_ptg000001l_0007-R            2.92
## 5 Sdiv_ptg000001l_0008-R            2.86
## 6 Sdiv_ptg000001l_0010-R            2.71

(D) Make a histogram of the results, both with a normal y-scale and a log10 transformed y-scale (2 separate plots are fine).

(E) Interpret the plot(s). What is the y-axis? What does this tell you about the methylation status of genes?

Is methylation correlated with gene expression levels?

Is there a correlation between CpG methylation in gene bodies and expression?

The file S_div_wp0_cpm.csv in the input directory has RNA expression values from seeds imbibed in water for 24 hours in a growth chamber. (Note that that methylation data comes from leaves of mature, field plants, so this is not ideal).

Get the RNA expression data

rna <- read_csv("../input/S_div_wp0_cpm.csv")
rna

Exercise 7

(A) Combine the rna and cpg data frames. Hint: you are going to have to mutate the Gene_ID column of one of these data frames, either with str_c or with str_remove depending on which way you decide to go

(B) Make a scatter plot (and maybe add a smoothing line) to show the relationship between cpg and rna. I recommend setting alpha to 0.1 or 0.2 for the points.

(C) Use cor.test() with method = "spearman" (since cpg methylation is not normally distributed) to test for a correlation. A low p-value means that the correlation is significant. rho is the amount of correlation (0 = no correlation, 1 = perfect correlation, -1 = perfect inverse correlation)

(D) Given the plot and cor test, what do you conclude about the relationship between CpG methylation and RNA expression in these samples?

Examine the methylation pattern genome wide

Having looked at the average pattern in genes, lets look at how methylation changes across the whole genome.

To do this, we divide the genome into tiles (or bins) and then compute the average methylation in each tile.

# Create tiles of 50,000 bp in length
tiles <- tileGenome(seqlengths(contigs.gr), tilewidth = 50000, cut.last.tile.in.chrom = TRUE)

Next we calculate the average cpg methylation in each bin

# Convert the cpg data to run-length encoding, required for computing average
cpg_rle <- mcolAsRleList(cpg, "score")

# average per bin
plotbins <- binnedAverage(tiles, cpg_rle, "CpG", na.rm = TRUE)

Now lets compute the density of genes in each bin. A value of 100 would mean that 100% of the bin encodes mRNA sequences. We will compute this on the fly and add it to the plotbins GRanges object.

mcols(plotbins) <- cbind(mcols(plotbins),
                         mcols(binnedAverage(tiles, coverage(gff)*100, "genedensity")))

Plot a circular plot

One way to plot genomic information is in a circular plot. We use the package ggbio to accomplish this. ggbio uses a lot of the same grammar as ggplot but is specialized for genomic data.

ggbio() +
  circle(plotbins, geom="heatmap", aes(x=midpoint, color=CpG)) +
  new_scale_color() +
  circle(plotbins, geom="heatmap", aes(x=midpoint, color=genedensity)) +
  scale_color_gradient(high="red") +
    circle(contigs.gr, geom="text", aes(label=seqnames)) 

Plot a linear plot

ggbio also allows us to feed a Granges object straight in to ggplot.

ggplot(plotbins) +
  geom_line(mapping=aes(x=midpoint, y=CpG), color="skyblue") +
  geom_line(mapping=aes(x=midpoint, y=genedensity), color="red") +
  facet_wrap(~seqnames)

Exercise 8

Is there a relationship between CpG methylation and gene density? If so, what is it? What are the genomic features that might explain the pattern that you see?

What about centromeres?

We’ve been able to cover a lot about genome assembly! We’ve assembled entire chromosomes from small pieces of DNA, compared our genome to other genomes, located telomeric sequences in the assembly, identified epigenetic marks like methylation, and looked at where genes are located on our assembly. Of course, there is always more we can do, and one of the biggest bioinformatic challenges in genome assembly today is detecting centromeres.

Centromeres are where the spindle fibers attach to chromosomes during mitosis and meiosis. Here is a brief refresher on centromeres. Centromeres in eukaryotes are characterized by the centromeric histone H3 variant (CenH3). If you need a quick refresher on nucleosomes and histones, check out this resource. A reliable way of identifying centromeres in your assembly is to find out where CenH3 is, but this is much easier said than done. CenH3 is an epigenetic modification, and to accurately identify where CenH3 is located we would need to do some fancy molecular biology like CUT&Tag.

So is there another way we can identify centromeres, preferably bioinformatically? Well, this is the big question for a lot of bioinformaticians. Earlier we mentioned that detecting centromeres is one of the biggest challenges in bioinformatics. One of the reasons is because centromeres are often very repetitive, which makes them hard to assemble. But telomeres are repetitive too, and we saw that with long read consensus sequencing, like PacBio HiFi, we are able to assemble and detect them. So what makes centromeres different? Remember that the repeat that defines telomeres are often conserved. Also, the repeat is across all chromosomes. We saw earlier that the TTTAGGG repeat defined the telomeres in all of our chromosomes.

Centromeres are a bit more complicated. The repeat size larger, around 100-200 bp, and they under go rapid evolution due to centromeric drive, which you can read more about here. This leads to a couple complications in their detection. One issue is that they are not as conserved as telomeres, with a lot of variation in their sequence. Another is that centromeric repeats can vary among chromosomes with the same organism! This makes their detection even more challenging.

So how are we going to tackle this issue? There have been a plethora of tools that have come out very recently to try and address this challenge. The tool we are going to try today is called RepeatOBserver. You can find the paper here, and the github for the tool here. The basic description of this tool is that it works by identifying regions of the genome with low repetitive sequence diversity. Essentially, where the same sequences are present.

This takes about 75 minutes to run, so we pre-ran it for you.

We aren’t going to have you run the initial analysis, but the code is here if you want it.

You do need to answer the exercises

(SKIP) Let’s start by installing the library in Rstudio.

#only needs to be done once
devtools::install_github("celphin/RepeatOBserverV1") #to install the package
  # Select 1:All to update all the required packages

(SKIP) load the library

library(RepeatOBserverV1) # to load the package

Once you have installed the library in R, we need to create a conda environment for the tool in the terminal. You will have to respond with y when asked for y/n.

(SKIP)

conda create -n repeat-observer

conda activate repeat-observer

conda install -c bioconda seqkit
conda install -c bioconda emboss
conda install -c conda-forge dos2unix

(SKIP) Make a new directory for this run called RepeatObserver, move into that directory, and download the script you need to run

cd ~/S_div_assembly

mkdir RepeatObserver

cd RepeatObserver

wget https://raw.githubusercontent.com/celphin/RepeatOBserverV1/main/Setup_Run_Repeats.sh

(SKIP) We need to make sure this script is runnable in Linux.

cd ~/S_div_assembly/RepeatObserver

dos2unix Setup_Run_Repeats.sh

To run the script properly, we need the genome file to be in the same directory as this script. You should have generated a 14 contig fasta file in the previous lab. Make a symbolic link to that file here. You may need to modify the file name in the command below.

(SKIP)

ln -s ../S_div.asm.bp.p_ctg_14_longest.fa ./

Now it is time to run the script. This will take about 15 minutes to complete with the resources we have. Please make sure to run it exactly like this, but the name of the genome (fasta file) will likely be different for you.

(SKIP)

bash Setup_Run_Repeats.sh -i streptanthus -f S_div.asm.bp.p_ctg_14_longest.fa -h H0 -c 7 -m 28000 -g FALSE

If it doesn’t work the first time, you will need to remove any directories the script creates before running it again.

Exercise 9

Look at the github for RepeatOBserver linked here to help answer these questions.

(A) What do the -i, -f, -c, and -m the parameters indicate?

(B) In our script, the -m parameter is set to 28000. What does that equate to in gigabytes?

After the 15 minutes is up, we will have a plethora of folders generated. First things first lets look at this file:

I have placed a copy of the output in /revio-data/RepeatObserver

# use this first one if you ran it yourself
#open output_chromosomes/streptanthus_H0-AT/Chr9/streptanthus_H0-AT_Chr9_roll_mean_Shannon_250.pdf

# use this one if you didn't run this yourself.
open /revio-data/RepeatObserver/output_chromosomes/streptanthus_H0-AT/Chr9/streptanthus_H0-AT_Chr9_roll_mean_Shannon_250.pdf 

Where we see the dips in the plot indicate regions of low sequence diversity. These are regions where we are seeing the same repeated sequences over and over.

Exercise 10

(A) In the plot you just opened, how many dips do you see?

(B) What do you think the largest dips represent?

(C) What do you think the smallest dip may represent?

Those larger dips are have a very strong signal. Large signals can mask smaller signals in data, and sometimes those smaller signals are things we are interested in. So lets try to correct for that. RepeatOBserver generates data to plot histograms for every chromosome, but we will need to edit them a little to get what we want. Normally I would not condone editing data output files like this, but we need a column defining what chromosome the data is from. Please copy this script exactly.

(SKIP)

cd output_chromosomes/streptanthus_H0-AT/Summary_output/output_data/

files=$(ls *Histogram*)

# Loop through each file
for file in $files; do
    # Extract Chr name from file name
    chr=$(echo "$file" | grep -oP 'Chr\d+')
    
    # Add a column with Chr name to the file
    awk -v chr="$chr" 'NR==1 {print $0, "Chr"} NR>1 {print $0, chr}' "$file" > "$file.tmp" && mv "$file.tmp" "$file"
done

Next we need to create a file with all of this information combined, so that we can plot this out in R.

(SKIP)

# Add the header line to a new file
head -n 1 streptanthus_H0-AT_Chr1_Histogram_input_428_s_0.5std_1_35_2000.txt > combined_histogram.txt

# Loop through each file skipping the first line and concatenate them
for file in *Histogram*; do
        tail -n +2 "$file" >> combined_histogram.txt
done

(RUN THIS) Now that we have this combined file for the histograms, lets plot them in R.

library(tidyverse)
# combo_hist <- read_table("~/S_div_assembly/RepeatObserver/output_chromosomes/streptanthus_H0-AT/Summary_output/output_data/combined_histogram.txt")
combo_hist <- read_table("/revio-data/RepeatObserver/output_chromosomes/streptanthus_H0-AT/Summary_output/output_data/combined_histogram.txt")

combo_hist %>%
ggplot(aes(x = min_powsum_seqval)) +
  geom_histogram() + 
  facet_wrap(~Chr) +
  theme_classic()

What we are looking at is a histogram of the number of repeat lengths with a minimum abundance (y-axis) in a given window (x-axis). So the bar is higher if there is more repetitive sequences at that position in the genome. For our purposes, we will assume the highest bar is our centromere. This is probably a safe assumption, especially because this plot is EXCLUDING the first and last 2 megabases of each chromosome.

Exercise 11

(A) Why was it important that the first and last 2 Mb were removed from the analysis? Why does this solve the problem we had earlier with the other plot?

(B) Do you see any potential problems with this solution?

(C) According to this plot, how many chromosomes did we capture the centromere in?

(D) It looks like something odd is going on in Chromosome 10. What are some possible explanations for what is happening? Think about the biology of centromeres, but also the way the tool is working. Feel free to look up more about centromeres to help inform your answer.