Differential Gene Expression: Now What?

Copy and Paste

Today’s lab has a lot of long code chunks that you do not need to modify. OK to copy and paste.

Assignment

Clone your Assignment_14 repository. Please answer the exercises below using the template file in the scripts folder of your repository.

Knit the file and submit the .Rmd and .html when you are done.

Background

Now what?

We have found our lists of differentially expressed genes. Now what? We want to know the potential functions of those differentially expressed genes and possibly their regulators. Specifically we will:

  1. Merge the list of differentially expressed genes with a gene annotation file to learn about the function of the most up-regulated genes.
  2. Ask if particular classes of genes are up-regulated, by using functional classifiers known as Gene Ontology (GO) terms.
  3. Determine whether the differentially expressed genes have any common promoter motifs.

Install packages

We need to install some new packages, and to make everything compatible we actually need to upgrade R and the other installed packages. In the LINUX terminal do the following.

LINUX installs

sudo apt update
sudo apt upgrade
sudo apt install r-bioc-rrvgo
sudo apt install r-bioc-tfbstools
sudo apt install r-bioc-universalmotif
sudo apt install r-bioc-topgo

Still in the LINUX terminal:

cd ~/Downloads
wget https://meme-suite.org/meme/meme-software/5.5.9/meme-5.5.9.tar.gz
tar zxf meme-5.5.9.tar.gz
cd meme-5.5.9
./configure --prefix=/usr/local/bin/meme
make # ~ 1 minute

If that goes well, then:

make test # ~ 5 minutes
          # Expect 184 PASS, 6 FAIL, 1 ERROR.

sudo make install

update your path and your .Renviron ONLY RUN THIS ONCE

echo "export PATH=/usr/local/bin/meme/bin:/usr/local/bin/meme/libexec/meme-5.5.9:$PATH" >> ~/.profile
echo "MEME_BIN=/usr/local/bin/meme/bin" >> ~/.Renviron

R installs

Return to R studio. RESTART R (Session > Restart R)

Then, in the RStudio console:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("JASPAR2024")
BiocManager::install("org.At.tair.db")
BiocManager::install("memes")

You may see a message “Installation paths not writeable, unable to update packages”. That is okay so long as above that you see “* DONE (memes)”

The Lab

Annotate the differentially expressed genes.

As part of his B_rapa annotation paper a postdoc in my lab, Upendra Devisetty, did a quick annotation of the B. rapa genes by BLASTing each gene to the NCBI non-redundant database and taking the gene description of the best match (above a threshold). You can download the description with this Linux command:

wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/FileS9.txt

Place the file into your Assignment_14/input directory.

Now proceed in RStudio.

Let’s load needed libraries:

library(topGO)
library(rrvgo)
library(rtracklayer)
library(GenomicRanges)
library(Biostrings)
library(TFBSTools)
library(JASPAR2024)
library(universalmotif)
library(memes)
library(tidyverse)
library(BiocParallel)

You created lists of differentially expressed genes at the end of Assignment 10. I have saved the output from the test for genes affected by the treatment as DEgenes.trt.csv. Please find this file in the input directory of your repo. If for some reason they are missing (or you are not enrolled in BIS180L), you find it here.

When you read in the DEgene file, the first column won’t have an informative column name, so add one:

DEgene.trt <- read_csv("../input/DEgenes.trt.csv")
head(DEgene.trt)
colnames(DEgene.trt)[1] <- "GeneID"
head(DEgene.trt)

Exercise 1
Import the gene descriptions that you downloaded (File S9, above); pay attention to the read_tsv “col_names” argument. What is appropriate here? Use one of the join() functions (which one?) to add gene descriptions for the genes found to be regulated by the NDP/DP treatment. Output a table of the top 10 genes (based on FDR) that includes the output from edgeR and the descriptions. Have the description be the first (left-most) column so that you can see it in the output (Hint: use tidyverse select, but be sure to keep the other columns)

It is hard to make much sense of these descriptions; how can we get a better idea of what processes are affected by the treatment? GO categories…

Test for enrichment of functional classes of genes.

To test for enrichment of GO terms we essentially ask the question of whether the prevalence of a term in our gene set of interest is higher than its prevalence in the rest of the genome (aka the rest of the universe). For example if 20% of the differentially expressed genes have the GO term “Cell Wall” but only 10% of the not-differentially expressed genes have the term Cell Wall, then the term “Cell Wall” might be over-represented, or enriched, in our differentially expressed genes. In principle this could be tested using a Chi-squared test or Fisher’s exact test for contingency tables. In practice we will use TopGO which implements several algorithms that account for the hierarchical structure of GO terms (where terms are related to parent and child terms).

It is important in these analyses to define the “Universe” correctly. The Universe of genes is all of the genes where you could have detected an expression difference. It would not have been possible to detect an expression difference in genes that were not expressed at a detectable level in any sample in our experiment. So the Universe in this case is all expressed genes, rather than all genes.

Download the GO annotation for B.rapa

GO term annotation for B. rapa is also available from the Devisetty et al paper.

Download the file using the command below (in Linux) and place the file in your Assignment_14/input/ directory.

GO terms:

wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/FileS11.txt

Expressed genes

We need a list of all expressed genes. You could generate this yourself, or you can download this with

wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/internode_expressed_genes.txt

Place this file in your Assignment_14/input directory.

Format data for TopGO

We need to do a bit of data wrangling to get things in the correct format for TopGO. First we import the GO terms and a table of all expressed genes in the experiment.

Get the data

go.terms <- read_tsv("../input/FileS11.txt",col_names=FALSE)
head(go.terms)
colnames(go.terms) <- c("GeneID","GO")
head(go.terms)

expressed.genes <- read_tsv("../input/internode_expressed_genes.txt")
head(expressed.genes)
names(expressed.genes) <- "GeneID"
head(expressed.genes)

Format go.terms for TopGO. We need to create a gene-to-GO mapping, which TopGO will use as a function.

# Create a list where each element is a gene and contains a vector of GO terms
go.list <- strsplit(go.terms$GO, split=",")
names(go.list) <- go.terms$GeneID
head(go.list)

# Create a function that TopGO will use to retrieve GO terms for each gene
geneID2GO <- function(gene.ids) {
  return(go.list[gene.ids])
}

Format gene expression data for TopGO. We need a named factor vector indicating which genes are “interesting” (differentially expressed).

gene.list <- expressed.genes %>%
  mutate(DE = ifelse(GeneID %in% DEgene.trt$GeneID, 1, 0)) %>% # Create a column with 1s for DE genes and 0 for background genes
  pull(DE, name = GeneID) %>% # create a named vector where 0 indicates background and 1 indicated DE
  as.factor() # Convert to factor as required by TopGO

head(gene.list)
table(gene.list) # should show counts of DE and non-DE genes

Calculate over-representation with TopGO

TopGO can use a modified Fisher’s Exact Test that accounts for the hierarchical structure of GO terms. We’ll test for enrichment in Biological Process (BP) terms using both the classic Fisher Test and a modified version that accounts for the hierarchy.

First set up the data object

# Create the topGOdata object for Biological Process terms
GO.data.BP <- new("topGOdata",
                  ontology = "BP",
                  allGenes = gene.list,
                  annot = annFUN.gene2GO,
                  gene2GO = go.list,
                  nodeSize = 5)  # minimum number of genes for a GO term

# Show summary of the data
GO.data.BP

Now run the enrichment test using the different methods

# Run enrichment tests using different algorithms
# Fisher's test (classic, does not account for GO hierarchy)
result.classic <- runTest(GO.data.BP, algorithm = "classic", statistic = "fisher")

# Weight01 algorithm (accounts for GO hierarchy, recommended)
result.weight01 <- runTest(GO.data.BP, algorithm = "weight01", statistic = "fisher")

# Combine results into a table
GO.results <- GenTable(GO.data.BP,
                       classic = result.classic,
                       weight01 = result.weight01,
                       orderBy = "weight01",
                       ranksOf = "classic",
                       topNodes = sum(result.weight01@score < 0.05 )) %>% # keep significant terms from the "weight01" algorithm 
  mutate(weight01 = as.numeric(weight01),
         classic = as.numeric(classic)) # TopGO outputs p-values as characters

View it. P-values are in the “classic” and “weight01” columns

GO.results

GO visualization

Looking through a long list of GO terms can be overwhelming, especially when many terms are redundant or closely related. We can use the rrvgo package to reduce redundancy in GO terms and create intuitive visualizations. The rrvgo package provides similar functionality to the REVIGO web tool but runs directly in R.

Note: rrvgo requires an organism-specific annotation package. Since B. rapa doesn’t have a dedicated package, we’ll use the Arabidopsis annotation (org.At.tair.db) as these species are closely related.

First, we need to prepare the data and calculate semantic similarity between GO terms:

# Extract GO IDs and p-values from our results
# Use the weight01 p-values (recommended algorithm)
go_analysis <- GO.results %>%
  select(GO.ID, weight01) %>%
  filter(weight01 < 0.05)  # Only use significant terms

# rrvgo needs a named vector of p-values
scores <- setNames(-log10(go_analysis$weight01), go_analysis$GO.ID)

# Calculate the similarity matrix between GO terms
# This uses semantic similarity based on GO structure
# Note: We use Arabidopsis (org.At.tair.db) as reference since B. rapa doesn't have
# a dedicated annotation package, but they are closely related species
simMatrix <- calculateSimMatrix(go_analysis$GO.ID,
                                orgdb="org.At.tair.db",
                                ont="BP",
                                method="Rel")

# Reduce GO terms by grouping similar terms
reducedTerms <- reduceSimMatrix(simMatrix,
                                scores,
                                threshold=0.7,  # similarity threshold (0-1, higher = more stringent)
                                orgdb="org.At.tair.db")

# View the reduced set of terms
head(reducedTerms, 20)

Now we can create visualizations. The rrvgo package offers several visualization methods:

TreeMap Visualization:

A treemap shows GO terms as nested rectangles, where:

  • Rectangle size represents the significance (larger = more significant)
  • Colors group similar GO terms together
  • Labels show the most representative term for each group
# Create a treemap
treemapPlot(reducedTerms)

Scatterplot Visualization:

A scatterplot shows:

  • Each point is a GO term
  • X and Y axes represent the first two dimensions from multidimensional scaling
  • Point size represents significance
  • Color represents which cluster/group the term belongs to
  • Terms closer together are more semantically similar
# Create a scatterplot
scatterPlot(simMatrix, reducedTerms)

Note on other GO ontologies:

The examples above use Biological Process (BP) terms. You can also analyze Cellular Component (CC) or Molecular Function (MF) terms by:

  1. Creating separate TopGO objects with ontology = "CC" or ontology = "MF"
  2. Using the corresponding ont parameter in calculateSimMatrix() (e.g., ont="CC" or ont="MF")

Generally, Biological Process terms are most informative for understanding the biological significance of gene expression changes.

Exercise 2:

Remember that the genes used in this analysis are those that are differentially expressed between plants grown at high density (“DP”, multiple plants per pot) and not densely planted (“NDP”, one plant per pot). Plants in the DP treatment are able to sense their neighbors via photoreceptors that initiate a signaling cascade leading to more elongation growth. Plants in the DP treatment may also be competing for soil nutrients (N, P, K) more than NDP plants. Looking at either the TreeMap or Scatterplot (your choice), list two GO terms from this analysis that could be related to signaling, growth, nutrient levels, or root function/growth. Explain your reasoning. Hint: look up lignin

Promoter motif enrichment

To help us understand what causes these genes to be differentially expressed it would be helpful to know if they have any common transcription factor binding motifs in their promoters.

We will use the TFBSTools Bioconductor package and memes for this analysis.

First we must get the sequence of the promoters. For ease we will define the promoters as the 1500bp upstream of the start of the gene.

How do we get the promoter sequences? We can use the Bioconductor package Genomic Ranges

Genomic Ranges contains a set of tools for defining regions of a genomic sequences.

Create a Granges objects with gene locations

We used the Rtracklayer import.gff() function in the last lab

# you may need to change the path or make a symbolic link
gff <- import.gff("../../Assigment_09/input/Brapa_reference/Brapa_gene_v1.5.gff")
gff

If you created a symbolic link with ln -s:

## GRanges object with 288624 ranges and 7 metadata columns:
##                  seqnames          ranges strand |   source     type      score     phase          ID
##                     <Rle>       <IRanges>  <Rle> | <factor> <factor>  <numeric> <integer> <character>
##        [1]            A03 8794511-8797095      + |     blat     gene          1      <NA>   Bra000001
##        [2]            A03 8794511-8797095      + |     blat     mRNA          1      <NA>   Bra000001
##        [3]            A03 8794511-8794534      + |     blat     CDS         100      <NA>        <NA>
##        [4]            A03 8794622-8794805      + |     blat     CDS         100      <NA>        <NA>
##        [5]            A03 8795258-8795339      + |     blat     CDS         100      <NA>        <NA>
##        ...            ...             ...    ... .      ...      ...        ...       ...         ...
##   [288620] Scaffold005112         131-295      + |    blat      mRNA   1.000000      <NA>   Bra041173
##   [288621] Scaffold005112         131-295      + |    blat      CDS  100.000000      <NA>        <NA>
##   [288622] Scaffold008211          18-251      + |    glean     gene   0.970334      <NA>   Bra041174
##   [288623] Scaffold008211          18-251      + |    glean     mRNA   0.970334      <NA>   Bra041174
##   [288624] Scaffold008211          18-251      + |    glean     CDS          NA         0        <NA>
##                   Name          Parent
##            <character> <CharacterList>
##        [1]   Bra000001                
##        [2]   Bra000001                
##        [3]        <NA>       Bra000001
##        [4]        <NA>       Bra000001
##        [5]        <NA>       Bra000001
##        ...         ...             ...
##   [288620]   Bra041173                
##   [288621]        <NA>       Bra041173
##   [288622]   Bra041174                
##   [288623]   Bra041174                
##   [288624]        <NA>       Bra041174
##   -------
##   seqinfo: 284 sequences from an unspecified genome; no seqlengths

The resulting object contains an entry for each gene, mRNA, and coding region exon (CDS)

We will define the promoter as 1500bp upstream of the mRNA. Let’s first subset the Granges objects to only contain mRNAs. Also we will get rid of some columns we don’t care about. Note that because this is not a normal data frame or tibble we need to use base R functions, not tidyverse.

mRNAranges <- gff[gff$type=="mRNA",c("type", "ID")]
mRNAranges
## GRanges object with 41020 ranges and 2 metadata columns:
##                 seqnames          ranges strand |     type          ID
##                    <Rle>       <IRanges>  <Rle> | <factor> <character>
##       [1]            A03 8794511-8797095      + |     mRNA   Bra000001
##       [2]            A03 8797899-8800379      - |     mRNA   Bra000002
##       [3]            A03 8812470-8813271      + |     mRNA   Bra000003
##       [4]            A03 8816858-8817232      + |     mRNA   Bra000004
##       [5]            A03 8818081-8820123      - |     mRNA   Bra000005
##       ...            ...             ...    ... .      ...         ...
##   [41016] Scaffold004047          11-321      + |     mRNA   Bra041170
##   [41017] Scaffold004813         190-414      - |     mRNA   Bra041171
##   [41018] Scaffold004894           3-410      + |     mRNA   Bra041172
##   [41019] Scaffold005112         131-295      + |     mRNA   Bra041173
##   [41020] Scaffold008211          18-251      + |     mRNA   Bra041174
##   -------
##   seqinfo: 284 sequences from an unspecified genome; no seqlengths

We don’t have any of the “Scaffold” sequences in our fasta file (we only have chromosomes) so let’s subset the object to keep only the chromosomes

chromosomes <- c(paste0("A0", 1:9), "A10")
seqlevels(mRNAranges, pruning.mode="coarse") <- chromosomes
mRNAranges
## GRanges object with 39609 ranges and 2 metadata columns:
##           seqnames            ranges strand |     type          ID
##              <Rle>         <IRanges>  <Rle> | <factor> <character>
##       [1]      A03   8794511-8797095      + |     mRNA   Bra000001
##       [2]      A03   8797899-8800379      - |     mRNA   Bra000002
##       [3]      A03   8812470-8813271      + |     mRNA   Bra000003
##       [4]      A03   8816858-8817232      + |     mRNA   Bra000004
##       [5]      A03   8818081-8820123      - |     mRNA   Bra000005
##       ...      ...               ...    ... .      ...         ...
##   [39605]      A08 11283500-11285461      - |     mRNA   Bra040840
##   [39606]      A09 26845542-26846612      + |     mRNA   Bra041026
##   [39607]      A09 26848837-26850491      - |     mRNA   Bra041027
##   [39608]      A09   6389757-6392224      + |     mRNA   Bra041081
##   [39609]      A09   6394063-6397281      + |     mRNA   Bra041082
##   -------
##   seqinfo: 10 sequences from an unspecified genome; no seqlengths
promoterRanges <- flank(mRNAranges, 1500)
promoterRanges
## GRanges object with 39609 ranges and 2 metadata columns:
##           seqnames            ranges strand |     type          ID
##              <Rle>         <IRanges>  <Rle> | <factor> <character>
##       [1]      A03   8793011-8794510      + |     mRNA   Bra000001
##       [2]      A03   8800380-8801879      - |     mRNA   Bra000002
##       [3]      A03   8810970-8812469      + |     mRNA   Bra000003
##       [4]      A03   8815358-8816857      + |     mRNA   Bra000004
##       [5]      A03   8820124-8821623      - |     mRNA   Bra000005
##       ...      ...               ...    ... .      ...         ...
##   [39605]      A08 11285462-11286961      - |     mRNA   Bra040840
##   [39606]      A09 26844042-26845541      + |     mRNA   Bra041026
##   [39607]      A09 26850492-26851991      - |     mRNA   Bra041027
##   [39608]      A09   6388257-6389756      + |     mRNA   Bra041081
##   [39609]      A09   6392563-6394062      + |     mRNA   Bra041082
##   -------
##   seqinfo: 10 sequences from an unspecified genome; no seqlengths

Exercise 3: Flanking sequences could be defined in a variety of ways. We want flank() to take sequences upstream, not downstream of the mRNAs. Also, remember that genes can be transcribed from the “+” (upper) or “-“ (lower) DNA strand. Thus we want flank() to take coding strand into consideration and take sequences that are upstream of the 5’ end of the gene. Examine the entries for the first two genes in the mRNARanges and promoterRanges to determine if flank() has done what we want. Explain how you know.

Extract the promoter sequences

Now that we have the locations of the promoters we need to get the sequences themselves. Adjust the path below and load in the fasta file of chromosome sequences.

Brapaseq <- readDNAStringSet("../../Assignment_09/input/Brapa_reference/BrapaV1.5_chrom_only.fa")

The chromosome names have the date that the sequence was finalized appended. We need to remove that.

names(Brapaseq)
##  [1] "A01 [12.17-2010]" "A02 [12.17-2010]" "A03 [12.17-2010]" "A04 [12.17-2010]" "A05 [12.17-2010]"
##  [6] "A06 [12.17-2010]" "A07 [12.17-2010]" "A08 [12.17-2010]" "A09 [12.17-2010]" "A10 [12.17-2010]"

Exercise 4: Replace “REGEX” in the code line below with an actual regular expression that removes the dates from the names.

str_remove(names(Brapaseq), "REGEX") #replace REGEX with your regular expression

If you have it working correctly you should see this output”

##  [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10"

Once your REGEX is correct, then replace the names in the object, verify that you have the correct names in the actual sequence object

names(Brapaseq) <- str_remove(names(Brapaseq), "REGEX") #replace REGEX with your regular expression
names(Brapaseq)

Now, magically, we can just use the square brackets to subset the chromosome sequences to a set of promoter sequences. We do have to add the names back.

promoters <- Brapaseq[promoterRanges]
names(promoters) <- promoterRanges$ID
promoters
## DNAStringSet object of length 39609:
##         width seq                                                                            names               
##     [1]  1500 TATCATTCTCCCAAACCCAAAGAATAATGGGGTTCCAA...TGGCGGAGAAAACAGCAGAGCGTTCTATTGCAGAAAC Bra000001
##     [2]  1500 AAAAAAAAAAATACTAAATAAATAAAGATAGTAAAAGT...AAAGAGAAAGACTGTTGATCTAAAGTTATTATCTAAG Bra000002
##     [3]  1500 TCTAAGTAACCAGGGCGTTAGATGGTAAAGGTCCTAAA...CTACTACTGTGTGACTCTTCTCTTGCTTTAGCTGGAT Bra000003
##     [4]  1500 AAATGTATTCACGATTGTGTTGGTATCAAATGTCATAA...AGAGAGAGACGATTATGCTTTAAGAAGCAACCTTTCG Bra000004
##     [5]  1500 TAAAAAGATTGATGTTTTCTAGGTTAGAGAAGCATAAG...CCAGATCTTGTTGTATCACTCGGTGATCGGAGCCACG Bra000005
##     ...   ... ...
## [39605]  1500 TAAATCACCTGTGTCTTTTACTTTCTGCAAAACAAACA...TGTTTCTGTAATTTTGATGTACTAAACCAGAGATATT Bra040840
## [39606]  1500 GAGATAACTTACACGAAGACTCTCAACCCACATTCCTG...ACGCGGAGGCCGTCGTCGTCTCAGGATGCTCTGGATC Bra041026
## [39607]  1500 AAAATTAAGGAAAATATATAATTAATTCATTTATTTTA...TTACTTGGGGCTCAAAAGAGCTCATATAATTTTGTTA Bra041027
## [39608]  1500 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GAGCTTTCATGCTGTGAAAACCAAAGTAAGTTGAACA Bra041081
## [39609]  1500 GAAGCGGATACTCGTAATCATATTGAAGCATATAATCA...CTCTCCTGCAAATTTTCAAATCTCTCTCTCTTCTTCC Bra041082

Let’s change the Ns to “-“ to reduce the likelihood of matches there

promoters <- DNAStringSet(gsub("N","-",promoters)) 

When searching for promoter motifs we should adjust the expected probability of a given base by the background frequency of those bases. Let’s calculate the base frequencies in the promtoers to use as background info

bg_base_freq <- letterFrequency(promoters, letters = c("A", "C", "G", "T"), as.prob = TRUE)
bg_base_freq <- colMeans(bg_base_freq) / sum(colMeans(bg_base_freq)) # overcome rounding errors so that it sums to 1
bg_base_freq
##         A         C         G         T 
## 0.3335061 0.1697490 0.1634457 0.3332993

Load transcription factor binding motifs.

We’ll use the JASPAR database, which contains curated, experimentally validated transcription factor binding motifs. JASPAR provides high-quality Position Weight Matrices (PWMs) for plant transcription factors.

TFBSTools integrates directly with JASPAR, making it easy to retrieve plant-specific motifs.

# Get plant transcription factor motifs from JASPAR
# The CORE collection contains high-quality, manually curated motifs
opts <- list(
  collection = "CORE",
  tax_group = "plants",
  species = "Arabidopsis thaliana",
  all_versions = FALSE  # Get only the latest version of each motif
)

# Access the JASPAR2024 database
JASPAR2024db <- JASPAR2024()@db
pfm_collection <- getMatrixSet(JASPAR2024db, opts)

# Convert PFMatrix to PWMatrix
pwm_collection <- toPWM(pfm_collection, bg=bg_base_freq)

# Convert to universalmotif format for use with memes/FIMO
# This format is required by runFimo()
motif_list <- convert_motifs(pwm_collection, class = "universalmotif-universalmotif")
# See how many motifs we have
length(pwm_collection)
# Look at the first few motif IDs and names
head(data.frame(
  ID = sapply(pwm_collection, function(x) ID(x)),
  Name = sapply(pwm_collection, function(x) name(x)),
  Species = sapply(pwm_collection, function(x) tags(x)$species)
))

Let’s look at a single entry to get a feel for that. This motif is 8 bases long. Each column in the table is one those 8 positions, and the log2 likelihood of each base at that position (relative to background) is shown. An entry of “0” means that the base occurs in the motif with the same frequncy that it does in the background (all promoter sequences). An entry of “1” means that the base it twice as likely as the background base (2^1 = 2)

motif_list[[3]]

Exercise 5 How much more or less likely is “G” at position 6 in this motif compared to the background frequency of “G”?

We can also look at this graphically, although we have to convert to a different format

pfm_collection[[3]] %>% toICM() %>% seqLogo()

Next we need to subset the promoters into those from our differentially expressed genes and those from expressed genes that are not differentially expressed.

#get names to match...there are are few names in the DEgene list not in the promoter set
DEgene.trt.match <- DEgene.trt$GeneID[DEgene.trt$GeneID %in% names(promoters)]

expressed.genes.match <- expressed.genes$GeneID[expressed.genes$GeneID %in% names(promoters)]

#subset promoter files
universe.promoters <- promoters[expressed.genes.match]
target.promoters <- promoters[DEgene.trt.match]
non.target.promoters <- universe.promoters[
  ! names(universe.promoters) %in% names(target.promoters)] # all expressed genes not in target set

Look for over-represented motifs

Before we look for all motifs, let’s start simple and look for a single motif. Here I will use the third motif from the pwm_collection object.

test.motif <- motif_list[[3]] # a single motif from the database
test.motif

We’ll use runFimo() from the memes package to find motif matches. FIMO (Find Individual Motif Occurrences). The “thresh” is the p-value threshold for declaring a match

# Search for the motif in target promoters using FIMO
# thresh sets the p-value threshold (default 1e-4)
target.matches <- runFimo(target.promoters, 
                         test.motif,
                         thresh = 1e-4)  # searches both strands by default

# View the matches
target.matches

Now let’s count how many target promoters have at least one match:

# target promoters with a match
target.count <- length(unique(seqnames(target.matches)))
target.count

# target promoters without a match
length(target.promoters) - target.count

Of the differentially expressed genes, some have the motif and some do not. How many are there in the promoters of genes that were not differentially expressed?

# Search in non-target promoters
non.target.matches <- runFimo(non.target.promoters, 
                             test.motif,
                             thresh = 1e-4)

# non-target promoters with a match
non.target.count <- length(unique(seqnames(non.target.matches)))
non.target.count

# non-target promoters without a match
length(non.target.promoters) - non.target.count

Is there an enrichment? We want to know if the proportion of DE genes with the motif is significantly different from the proportion of non-DE genes with the motif. We use Fisher’s exact test for this. First make a contingency table with the counts:

m <- matrix(c(
  target.count,
  length(target.promoters) - target.count,
  non.target.count,
  length(non.target.promoters) - non.target.count ),
  ncol=2, 
  dimnames = list(motif=c("Y", "N"), target=c("Y", "N")))

m

Now run Fisher’s test

fisher.test(m, alternative = "greater")

Exercise 6: Interpret the result: This motif represents a sequence bound by the transcription factor PIF4. You can read a brief summary of its functions here and here. Is this motif enriched in the promoters of differentially expressed genes? Explain whether the enrichment/non-enrichment is consistent with your understanding of the treatment in this experiment.

Look at all of the motifs

OK but we have many motifs. Do we have to go through them one at a time? Thankfully, no. We can give runFimo a list of motifs. The process is slow, so I split the motifs into groups and process in parallel.

This takes about 45 minutes to run, you can skip it and load the results below

# Determine number of CPUs to use
ncores <- parallel::detectCores() - 1

# Threshold for presence of motif
thresh <- 1e-04

# Split up motifs into groups for parallel processing
motif_groups <- split(
  motif_list,
  cut(seq_along(motif_list), breaks = ncores, labels = FALSE)
)

# Set up parallel processing parameters
bpparam = MulticoreParam(workers = ncores)

# Search all motifs in target promoters using FIMO, use parallel processing
message("Searching target promoters...")
system.time(
  target.matches <- bplapply(
    motif_groups,
    FUN = \(mg) runFimo(sequences = target.promoters, 
                        motifs = mg,
                        thresh = thresh,
                        skip_matched_sequence = TRUE),
    BPPARAM = bpparam)
)

# combine the list into a single object
suppressWarnings( target.matches <- do.call(c, unname(target.matches)) )

# Search all motifs at once in non-target promoters  
message("Searching non-target promoters...")
system.time(
  non.target.matches <- bplapply(
    motif_groups,
    FUN = \(mg) runFimo(sequences = non.target.promoters, 
                        motifs = mg,
                        thresh = thresh,
                        skip_matched_sequence = TRUE),
    BPPARAM = bpparam)
)

# combine the list into a single object
suppressWarnings( non.target.matches <- do.call(c, unname(non.target.matches)) )

# Count unique promoters with matches for each motif
# Note: we are counting presence / absence.  We could also compute average per gene, or sum total
target.counts.df <- target.matches %>%
  as.data.frame() %>%
  group_by(motif_id) %>%
  summarize(target.count = length(unique(seqnames)), .groups = "drop") 

non.target.counts.df <- non.target.matches %>%
  as.data.frame() %>%
  group_by(motif_id) %>%
  summarize(non.target.count = length(unique(seqnames)), .groups = "drop")

# Merge counts (filling in 0s for motifs with no matches)
results.table <- target.counts.df %>%
  full_join(non.target.counts.df, by = "motif_id") %>%
  replace_na(list(target.count = 0, non.target.count = 0))

# Perform Fisher's exact test for each motif
message("Performing statistical tests...")
results.table <- results.table %>%
  rowwise() %>%
  mutate(
    p.value = fisher.test(
      matrix(c(
        target.count,
        length(target.promoters) - target.count,
        non.target.count,
        length(non.target.promoters) - non.target.count
      ), ncol = 2),
      alternative = "greater"
    )$p.value,
    non.target.percent = round(non.target.count / length(non.target.promoters), 3) * 100,
    target.percent = round(target.count / length(target.promoters), 3) * 100
  ) %>%
  ungroup() %>%
  mutate(FDR = p.adjust(p.value, method = "fdr")) %>%
  select(motif = motif_id, non.target.percent, target.percent, FDR, p.value, everything()) %>%
  arrange(p.value)

bpstop(bpparam) # stop the parallel workers

write_csv(results.table, "../output/motif_results_trt.csv.gz")

If you didn’t run the code, load in the table

results.table <- read_csv("../output/motif_results_trt.csv.gz")
## Rows: 535 Columns: 5
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): motif
## dbl (4): non.target.percent, target.percent, p.value, FDR
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The resulting table gives the p-value for enrichment for each motif, as well as the % of promoters in the non-target and in our target gene set that have the motif.

Exercise 7

a. How many motifs are enriched at FDR < 0.05? (show code)

b. For each of the top five motifs, look them up in the JASPAR database and then click on the Uniprot ID link to get information on the protein that binds to that motif. Scroll through ALL of the information there and comment on whether and how it could be related to the response to the treatment in this experiment. Remember that lignin and sometimes suberin are components of the plant cell wall. Remember that soil nutrients can include N, P, K, as well as micronutrients like Fe.