Differential Gene Expression: Now What?
02 Jun 2026Copy and Paste
Today’s lab has a lot of long code chunks that you do not need to modify. OK to copy and paste.
Assignment
Clone your Assignment_14 repository. Please answer the exercises below using the template file in the scripts folder of your repository.
Knit the file and submit the .Rmd and .html when you are done.
Background
Now what?
We have found our lists of differentially expressed genes. Now what? We want to know the potential functions of those differentially expressed genes and possibly their regulators. Specifically we will:
- Merge the list of differentially expressed genes with a gene annotation file to learn about the function of the most up-regulated genes.
- Ask if particular classes of genes are up-regulated, by using functional classifiers known as Gene Ontology (GO) terms.
- Determine whether the differentially expressed genes have any common promoter motifs.
Install packages
We need to install some new packages, and to make everything compatible we actually need to upgrade R and the other installed packages. In the LINUX terminal do the following.
LINUX installs
sudo apt update
sudo apt upgrade
sudo apt install r-bioc-rrvgo
sudo apt install r-bioc-tfbstools
sudo apt install r-bioc-universalmotif
sudo apt install r-bioc-topgo
Still in the LINUX terminal:
cd ~/Downloads
wget https://meme-suite.org/meme/meme-software/5.5.9/meme-5.5.9.tar.gz
tar zxf meme-5.5.9.tar.gz
cd meme-5.5.9
./configure --prefix=/usr/local/bin/meme
make # ~ 1 minute
If that goes well, then:
make test # ~ 5 minutes
# Expect 184 PASS, 6 FAIL, 1 ERROR.
sudo make install
update your path and your .Renviron ONLY RUN THIS ONCE
echo "export PATH=/usr/local/bin/meme/bin:/usr/local/bin/meme/libexec/meme-5.5.9:$PATH" >> ~/.profile
echo "MEME_BIN=/usr/local/bin/meme/bin" >> ~/.Renviron
R installs
Return to R studio. RESTART R (Session > Restart R)
Then, in the RStudio console:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("JASPAR2024")
BiocManager::install("org.At.tair.db")
BiocManager::install("memes")
You may see a message “Installation paths not writeable, unable to update packages”. That is okay so long as above that you see “* DONE (memes)”
The Lab
Annotate the differentially expressed genes.
As part of his B_rapa annotation paper a postdoc in my lab, Upendra Devisetty, did a quick annotation of the B. rapa genes by BLASTing each gene to the NCBI non-redundant database and taking the gene description of the best match (above a threshold). You can download the description with this Linux command:
wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/FileS9.txt
Place the file into your Assignment_14/input directory.
Now proceed in RStudio.
Let’s load needed libraries:
library(topGO)
library(rrvgo)
library(rtracklayer)
library(GenomicRanges)
library(Biostrings)
library(TFBSTools)
library(JASPAR2024)
library(universalmotif)
library(memes)
library(tidyverse)
library(BiocParallel)
You created lists of differentially expressed genes at the end of Assignment 10. I have saved the output from the test for genes affected by the treatment as DEgenes.trt.csv. Please find this file in the input directory of your repo. If for some reason they are missing (or you are not enrolled in BIS180L), you find it here.
When you read in the DEgene file, the first column won’t have an informative column name, so add one:
DEgene.trt <- read_csv("../input/DEgenes.trt.csv")
head(DEgene.trt)
colnames(DEgene.trt)[1] <- "GeneID"
head(DEgene.trt)
Exercise 1
Import the gene descriptions that you downloaded (File S9, above); pay attention to the read_tsv “col_names” argument. What is appropriate here? Use one of the join() functions (which one?) to add gene descriptions for the genes found to be regulated by the NDP/DP treatment. Output a table of the top 10 genes (based on FDR) that includes the output from edgeR and the descriptions. Have the description be the first (left-most) column so that you can see it in the output (Hint: use tidyverse select, but be sure to keep the other columns)
It is hard to make much sense of these descriptions; how can we get a better idea of what processes are affected by the treatment? GO categories…
Test for enrichment of functional classes of genes.
To test for enrichment of GO terms we essentially ask the question of whether the prevalence of a term in our gene set of interest is higher than its prevalence in the rest of the genome (aka the rest of the universe). For example if 20% of the differentially expressed genes have the GO term “Cell Wall” but only 10% of the not-differentially expressed genes have the term Cell Wall, then the term “Cell Wall” might be over-represented, or enriched, in our differentially expressed genes. In principle this could be tested using a Chi-squared test or Fisher’s exact test for contingency tables. In practice we will use TopGO which implements several algorithms that account for the hierarchical structure of GO terms (where terms are related to parent and child terms).
It is important in these analyses to define the “Universe” correctly. The Universe of genes is all of the genes where you could have detected an expression difference. It would not have been possible to detect an expression difference in genes that were not expressed at a detectable level in any sample in our experiment. So the Universe in this case is all expressed genes, rather than all genes.
Download the GO annotation for B.rapa
GO term annotation for B. rapa is also available from the Devisetty et al paper.
Download the file using the command below (in Linux) and place the file in your Assignment_14/input/ directory.
GO terms:
wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/FileS11.txt
Expressed genes
We need a list of all expressed genes. You could generate this yourself, or you can download this with
wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/RNAseq_Annotation/internode_expressed_genes.txt
Place this file in your Assignment_14/input directory.
Format data for TopGO
We need to do a bit of data wrangling to get things in the correct format for TopGO. First we import the GO terms and a table of all expressed genes in the experiment.
Get the data
go.terms <- read_tsv("../input/FileS11.txt",col_names=FALSE)
head(go.terms)
colnames(go.terms) <- c("GeneID","GO")
head(go.terms)
expressed.genes <- read_tsv("../input/internode_expressed_genes.txt")
head(expressed.genes)
names(expressed.genes) <- "GeneID"
head(expressed.genes)
Format go.terms for TopGO. We need to create a gene-to-GO mapping, which TopGO will use as a function.
# Create a list where each element is a gene and contains a vector of GO terms
go.list <- strsplit(go.terms$GO, split=",")
names(go.list) <- go.terms$GeneID
head(go.list)
# Create a function that TopGO will use to retrieve GO terms for each gene
geneID2GO <- function(gene.ids) {
return(go.list[gene.ids])
}
Format gene expression data for TopGO. We need a named factor vector indicating which genes are “interesting” (differentially expressed).
gene.list <- expressed.genes %>%
mutate(DE = ifelse(GeneID %in% DEgene.trt$GeneID, 1, 0)) %>% # Create a column with 1s for DE genes and 0 for background genes
pull(DE, name = GeneID) %>% # create a named vector where 0 indicates background and 1 indicated DE
as.factor() # Convert to factor as required by TopGO
head(gene.list)
table(gene.list) # should show counts of DE and non-DE genes
Calculate over-representation with TopGO
TopGO can use a modified Fisher’s Exact Test that accounts for the hierarchical structure of GO terms. We’ll test for enrichment in Biological Process (BP) terms using both the classic Fisher Test and a modified version that accounts for the hierarchy.
First set up the data object
# Create the topGOdata object for Biological Process terms
GO.data.BP <- new("topGOdata",
ontology = "BP",
allGenes = gene.list,
annot = annFUN.gene2GO,
gene2GO = go.list,
nodeSize = 5) # minimum number of genes for a GO term
# Show summary of the data
GO.data.BP
Now run the enrichment test using the different methods
# Run enrichment tests using different algorithms
# Fisher's test (classic, does not account for GO hierarchy)
result.classic <- runTest(GO.data.BP, algorithm = "classic", statistic = "fisher")
# Weight01 algorithm (accounts for GO hierarchy, recommended)
result.weight01 <- runTest(GO.data.BP, algorithm = "weight01", statistic = "fisher")
# Combine results into a table
GO.results <- GenTable(GO.data.BP,
classic = result.classic,
weight01 = result.weight01,
orderBy = "weight01",
ranksOf = "classic",
topNodes = sum(result.weight01@score < 0.05 )) %>% # keep significant terms from the "weight01" algorithm
mutate(weight01 = as.numeric(weight01),
classic = as.numeric(classic)) # TopGO outputs p-values as characters
View it. P-values are in the “classic” and “weight01” columns
GO.results
GO visualization
Looking through a long list of GO terms can be overwhelming, especially when many terms are redundant or closely related. We can use the rrvgo package to reduce redundancy in GO terms and create intuitive visualizations. The rrvgo package provides similar functionality to the REVIGO web tool but runs directly in R.
Note: rrvgo requires an organism-specific annotation package. Since B. rapa doesn’t have a dedicated package, we’ll use the Arabidopsis annotation (org.At.tair.db) as these species are closely related.
First, we need to prepare the data and calculate semantic similarity between GO terms:
# Extract GO IDs and p-values from our results
# Use the weight01 p-values (recommended algorithm)
go_analysis <- GO.results %>%
select(GO.ID, weight01) %>%
filter(weight01 < 0.05) # Only use significant terms
# rrvgo needs a named vector of p-values
scores <- setNames(-log10(go_analysis$weight01), go_analysis$GO.ID)
# Calculate the similarity matrix between GO terms
# This uses semantic similarity based on GO structure
# Note: We use Arabidopsis (org.At.tair.db) as reference since B. rapa doesn't have
# a dedicated annotation package, but they are closely related species
simMatrix <- calculateSimMatrix(go_analysis$GO.ID,
orgdb="org.At.tair.db",
ont="BP",
method="Rel")
# Reduce GO terms by grouping similar terms
reducedTerms <- reduceSimMatrix(simMatrix,
scores,
threshold=0.7, # similarity threshold (0-1, higher = more stringent)
orgdb="org.At.tair.db")
# View the reduced set of terms
head(reducedTerms, 20)
Now we can create visualizations. The rrvgo package offers several visualization methods:
TreeMap Visualization:
A treemap shows GO terms as nested rectangles, where:
- Rectangle size represents the significance (larger = more significant)
- Colors group similar GO terms together
- Labels show the most representative term for each group
# Create a treemap
treemapPlot(reducedTerms)
Scatterplot Visualization:
A scatterplot shows:
- Each point is a GO term
- X and Y axes represent the first two dimensions from multidimensional scaling
- Point size represents significance
- Color represents which cluster/group the term belongs to
- Terms closer together are more semantically similar
# Create a scatterplot
scatterPlot(simMatrix, reducedTerms)
Note on other GO ontologies:
The examples above use Biological Process (BP) terms. You can also analyze Cellular Component (CC) or Molecular Function (MF) terms by:
- Creating separate TopGO objects with
ontology = "CC"orontology = "MF" - Using the corresponding
ontparameter incalculateSimMatrix()(e.g.,ont="CC"oront="MF")
Generally, Biological Process terms are most informative for understanding the biological significance of gene expression changes.
Exercise 2:
Remember that the genes used in this analysis are those that are differentially expressed between plants grown at high density (“DP”, multiple plants per pot) and not densely planted (“NDP”, one plant per pot). Plants in the DP treatment are able to sense their neighbors via photoreceptors that initiate a signaling cascade leading to more elongation growth. Plants in the DP treatment may also be competing for soil nutrients (N, P, K) more than NDP plants. Looking at either the TreeMap or Scatterplot (your choice), list two GO terms from this analysis that could be related to signaling, growth, nutrient levels, or root function/growth. Explain your reasoning. Hint: look up lignin
Promoter motif enrichment
To help us understand what causes these genes to be differentially expressed it would be helpful to know if they have any common transcription factor binding motifs in their promoters.
We will use the TFBSTools Bioconductor package and memes for this analysis.
First we must get the sequence of the promoters. For ease we will define the promoters as the 1500bp upstream of the start of the gene.
How do we get the promoter sequences? We can use the Bioconductor package Genomic Ranges
Genomic Ranges contains a set of tools for defining regions of a genomic sequences.
Create a Granges objects with gene locations
We used the Rtracklayer import.gff() function in the last lab
# you may need to change the path or make a symbolic link
gff <- import.gff("../../Assigment_09/input/Brapa_reference/Brapa_gene_v1.5.gff")
gff
If you created a symbolic link with ln -s:
## GRanges object with 288624 ranges and 7 metadata columns:
## seqnames ranges strand | source type score phase ID
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
## [1] A03 8794511-8797095 + | blat gene 1 <NA> Bra000001
## [2] A03 8794511-8797095 + | blat mRNA 1 <NA> Bra000001
## [3] A03 8794511-8794534 + | blat CDS 100 <NA> <NA>
## [4] A03 8794622-8794805 + | blat CDS 100 <NA> <NA>
## [5] A03 8795258-8795339 + | blat CDS 100 <NA> <NA>
## ... ... ... ... . ... ... ... ... ...
## [288620] Scaffold005112 131-295 + | blat mRNA 1.000000 <NA> Bra041173
## [288621] Scaffold005112 131-295 + | blat CDS 100.000000 <NA> <NA>
## [288622] Scaffold008211 18-251 + | glean gene 0.970334 <NA> Bra041174
## [288623] Scaffold008211 18-251 + | glean mRNA 0.970334 <NA> Bra041174
## [288624] Scaffold008211 18-251 + | glean CDS NA 0 <NA>
## Name Parent
## <character> <CharacterList>
## [1] Bra000001
## [2] Bra000001
## [3] <NA> Bra000001
## [4] <NA> Bra000001
## [5] <NA> Bra000001
## ... ... ...
## [288620] Bra041173
## [288621] <NA> Bra041173
## [288622] Bra041174
## [288623] Bra041174
## [288624] <NA> Bra041174
## -------
## seqinfo: 284 sequences from an unspecified genome; no seqlengths
The resulting object contains an entry for each gene, mRNA, and coding region exon (CDS)
We will define the promoter as 1500bp upstream of the mRNA. Let’s first subset the Granges objects to only contain mRNAs. Also we will get rid of some columns we don’t care about. Note that because this is not a normal data frame or tibble we need to use base R functions, not tidyverse.
mRNAranges <- gff[gff$type=="mRNA",c("type", "ID")]
mRNAranges
## GRanges object with 41020 ranges and 2 metadata columns:
## seqnames ranges strand | type ID
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] A03 8794511-8797095 + | mRNA Bra000001
## [2] A03 8797899-8800379 - | mRNA Bra000002
## [3] A03 8812470-8813271 + | mRNA Bra000003
## [4] A03 8816858-8817232 + | mRNA Bra000004
## [5] A03 8818081-8820123 - | mRNA Bra000005
## ... ... ... ... . ... ...
## [41016] Scaffold004047 11-321 + | mRNA Bra041170
## [41017] Scaffold004813 190-414 - | mRNA Bra041171
## [41018] Scaffold004894 3-410 + | mRNA Bra041172
## [41019] Scaffold005112 131-295 + | mRNA Bra041173
## [41020] Scaffold008211 18-251 + | mRNA Bra041174
## -------
## seqinfo: 284 sequences from an unspecified genome; no seqlengths
We don’t have any of the “Scaffold” sequences in our fasta file (we only have chromosomes) so let’s subset the object to keep only the chromosomes
chromosomes <- c(paste0("A0", 1:9), "A10")
seqlevels(mRNAranges, pruning.mode="coarse") <- chromosomes
mRNAranges
## GRanges object with 39609 ranges and 2 metadata columns:
## seqnames ranges strand | type ID
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] A03 8794511-8797095 + | mRNA Bra000001
## [2] A03 8797899-8800379 - | mRNA Bra000002
## [3] A03 8812470-8813271 + | mRNA Bra000003
## [4] A03 8816858-8817232 + | mRNA Bra000004
## [5] A03 8818081-8820123 - | mRNA Bra000005
## ... ... ... ... . ... ...
## [39605] A08 11283500-11285461 - | mRNA Bra040840
## [39606] A09 26845542-26846612 + | mRNA Bra041026
## [39607] A09 26848837-26850491 - | mRNA Bra041027
## [39608] A09 6389757-6392224 + | mRNA Bra041081
## [39609] A09 6394063-6397281 + | mRNA Bra041082
## -------
## seqinfo: 10 sequences from an unspecified genome; no seqlengths
promoterRanges <- flank(mRNAranges, 1500)
promoterRanges
## GRanges object with 39609 ranges and 2 metadata columns:
## seqnames ranges strand | type ID
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] A03 8793011-8794510 + | mRNA Bra000001
## [2] A03 8800380-8801879 - | mRNA Bra000002
## [3] A03 8810970-8812469 + | mRNA Bra000003
## [4] A03 8815358-8816857 + | mRNA Bra000004
## [5] A03 8820124-8821623 - | mRNA Bra000005
## ... ... ... ... . ... ...
## [39605] A08 11285462-11286961 - | mRNA Bra040840
## [39606] A09 26844042-26845541 + | mRNA Bra041026
## [39607] A09 26850492-26851991 - | mRNA Bra041027
## [39608] A09 6388257-6389756 + | mRNA Bra041081
## [39609] A09 6392563-6394062 + | mRNA Bra041082
## -------
## seqinfo: 10 sequences from an unspecified genome; no seqlengths
Exercise 3: Flanking sequences could be defined in a variety of ways. We want flank() to take sequences upstream, not downstream of the mRNAs. Also, remember that genes can be transcribed from the “+” (upper) or “-“ (lower) DNA strand. Thus we want flank() to take coding strand into consideration and take sequences that are upstream of the 5’ end of the gene. Examine the entries for the first two genes in the mRNARanges and promoterRanges to determine if flank() has done what we want. Explain how you know.
Extract the promoter sequences
Now that we have the locations of the promoters we need to get the sequences themselves. Adjust the path below and load in the fasta file of chromosome sequences.
Brapaseq <- readDNAStringSet("../../Assignment_09/input/Brapa_reference/BrapaV1.5_chrom_only.fa")
The chromosome names have the date that the sequence was finalized appended. We need to remove that.
names(Brapaseq)
## [1] "A01 [12.17-2010]" "A02 [12.17-2010]" "A03 [12.17-2010]" "A04 [12.17-2010]" "A05 [12.17-2010]"
## [6] "A06 [12.17-2010]" "A07 [12.17-2010]" "A08 [12.17-2010]" "A09 [12.17-2010]" "A10 [12.17-2010]"
Exercise 4: Replace “REGEX” in the code line below with an actual regular expression that removes the dates from the names.
str_remove(names(Brapaseq), "REGEX") #replace REGEX with your regular expression
If you have it working correctly you should see this output”
## [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10"
Once your REGEX is correct, then replace the names in the object, verify that you have the correct names in the actual sequence object
names(Brapaseq) <- str_remove(names(Brapaseq), "REGEX") #replace REGEX with your regular expression
names(Brapaseq)
Now, magically, we can just use the square brackets to subset the chromosome sequences to a set of promoter sequences. We do have to add the names back.
promoters <- Brapaseq[promoterRanges]
names(promoters) <- promoterRanges$ID
promoters
## DNAStringSet object of length 39609:
## width seq names
## [1] 1500 TATCATTCTCCCAAACCCAAAGAATAATGGGGTTCCAA...TGGCGGAGAAAACAGCAGAGCGTTCTATTGCAGAAAC Bra000001
## [2] 1500 AAAAAAAAAAATACTAAATAAATAAAGATAGTAAAAGT...AAAGAGAAAGACTGTTGATCTAAAGTTATTATCTAAG Bra000002
## [3] 1500 TCTAAGTAACCAGGGCGTTAGATGGTAAAGGTCCTAAA...CTACTACTGTGTGACTCTTCTCTTGCTTTAGCTGGAT Bra000003
## [4] 1500 AAATGTATTCACGATTGTGTTGGTATCAAATGTCATAA...AGAGAGAGACGATTATGCTTTAAGAAGCAACCTTTCG Bra000004
## [5] 1500 TAAAAAGATTGATGTTTTCTAGGTTAGAGAAGCATAAG...CCAGATCTTGTTGTATCACTCGGTGATCGGAGCCACG Bra000005
## ... ... ...
## [39605] 1500 TAAATCACCTGTGTCTTTTACTTTCTGCAAAACAAACA...TGTTTCTGTAATTTTGATGTACTAAACCAGAGATATT Bra040840
## [39606] 1500 GAGATAACTTACACGAAGACTCTCAACCCACATTCCTG...ACGCGGAGGCCGTCGTCGTCTCAGGATGCTCTGGATC Bra041026
## [39607] 1500 AAAATTAAGGAAAATATATAATTAATTCATTTATTTTA...TTACTTGGGGCTCAAAAGAGCTCATATAATTTTGTTA Bra041027
## [39608] 1500 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GAGCTTTCATGCTGTGAAAACCAAAGTAAGTTGAACA Bra041081
## [39609] 1500 GAAGCGGATACTCGTAATCATATTGAAGCATATAATCA...CTCTCCTGCAAATTTTCAAATCTCTCTCTCTTCTTCC Bra041082
Let’s change the Ns to “-“ to reduce the likelihood of matches there
promoters <- DNAStringSet(gsub("N","-",promoters))
When searching for promoter motifs we should adjust the expected probability of a given base by the background frequency of those bases. Let’s calculate the base frequencies in the promtoers to use as background info
bg_base_freq <- letterFrequency(promoters, letters = c("A", "C", "G", "T"), as.prob = TRUE)
bg_base_freq <- colMeans(bg_base_freq) / sum(colMeans(bg_base_freq)) # overcome rounding errors so that it sums to 1
bg_base_freq
## A C G T
## 0.3335061 0.1697490 0.1634457 0.3332993
Load transcription factor binding motifs.
We’ll use the JASPAR database, which contains curated, experimentally validated transcription factor binding motifs. JASPAR provides high-quality Position Weight Matrices (PWMs) for plant transcription factors.
TFBSTools integrates directly with JASPAR, making it easy to retrieve plant-specific motifs.
# Get plant transcription factor motifs from JASPAR
# The CORE collection contains high-quality, manually curated motifs
opts <- list(
collection = "CORE",
tax_group = "plants",
species = "Arabidopsis thaliana",
all_versions = FALSE # Get only the latest version of each motif
)
# Access the JASPAR2024 database
JASPAR2024db <- JASPAR2024()@db
pfm_collection <- getMatrixSet(JASPAR2024db, opts)
# Convert PFMatrix to PWMatrix
pwm_collection <- toPWM(pfm_collection, bg=bg_base_freq)
# Convert to universalmotif format for use with memes/FIMO
# This format is required by runFimo()
motif_list <- convert_motifs(pwm_collection, class = "universalmotif-universalmotif")
# See how many motifs we have
length(pwm_collection)
# Look at the first few motif IDs and names
head(data.frame(
ID = sapply(pwm_collection, function(x) ID(x)),
Name = sapply(pwm_collection, function(x) name(x)),
Species = sapply(pwm_collection, function(x) tags(x)$species)
))
Let’s look at a single entry to get a feel for that. This motif is 8 bases long. Each column in the table is one those 8 positions, and the log2 likelihood of each base at that position (relative to background) is shown. An entry of “0” means that the base occurs in the motif with the same frequncy that it does in the background (all promoter sequences). An entry of “1” means that the base it twice as likely as the background base (2^1 = 2)
motif_list[[3]]
Exercise 5 How much more or less likely is “G” at position 6 in this motif compared to the background frequency of “G”?
We can also look at this graphically, although we have to convert to a different format
pfm_collection[[3]] %>% toICM() %>% seqLogo()
Next we need to subset the promoters into those from our differentially expressed genes and those from expressed genes that are not differentially expressed.
#get names to match...there are are few names in the DEgene list not in the promoter set
DEgene.trt.match <- DEgene.trt$GeneID[DEgene.trt$GeneID %in% names(promoters)]
expressed.genes.match <- expressed.genes$GeneID[expressed.genes$GeneID %in% names(promoters)]
#subset promoter files
universe.promoters <- promoters[expressed.genes.match]
target.promoters <- promoters[DEgene.trt.match]
non.target.promoters <- universe.promoters[
! names(universe.promoters) %in% names(target.promoters)] # all expressed genes not in target set
Look for over-represented motifs
Before we look for all motifs, let’s start simple and look for a single motif. Here I will use the third motif from the pwm_collection object.
test.motif <- motif_list[[3]] # a single motif from the database
test.motif
We’ll use runFimo() from the memes package to find motif matches. FIMO (Find Individual Motif Occurrences). The “thresh” is the p-value threshold for declaring a match
# Search for the motif in target promoters using FIMO
# thresh sets the p-value threshold (default 1e-4)
target.matches <- runFimo(target.promoters,
test.motif,
thresh = 1e-4) # searches both strands by default
# View the matches
target.matches
Now let’s count how many target promoters have at least one match:
# target promoters with a match
target.count <- length(unique(seqnames(target.matches)))
target.count
# target promoters without a match
length(target.promoters) - target.count
Of the differentially expressed genes, some have the motif and some do not. How many are there in the promoters of genes that were not differentially expressed?
# Search in non-target promoters
non.target.matches <- runFimo(non.target.promoters,
test.motif,
thresh = 1e-4)
# non-target promoters with a match
non.target.count <- length(unique(seqnames(non.target.matches)))
non.target.count
# non-target promoters without a match
length(non.target.promoters) - non.target.count
Is there an enrichment? We want to know if the proportion of DE genes with the motif is significantly different from the proportion of non-DE genes with the motif. We use Fisher’s exact test for this. First make a contingency table with the counts:
m <- matrix(c(
target.count,
length(target.promoters) - target.count,
non.target.count,
length(non.target.promoters) - non.target.count ),
ncol=2,
dimnames = list(motif=c("Y", "N"), target=c("Y", "N")))
m
Now run Fisher’s test
fisher.test(m, alternative = "greater")
Exercise 6: Interpret the result: This motif represents a sequence bound by the transcription factor PIF4. You can read a brief summary of its functions here and here. Is this motif enriched in the promoters of differentially expressed genes? Explain whether the enrichment/non-enrichment is consistent with your understanding of the treatment in this experiment.
Look at all of the motifs
OK but we have many motifs. Do we have to go through them one at a time? Thankfully, no. We can give runFimo a list of motifs. The process is slow, so I split the motifs into groups and process in parallel.
This takes about 45 minutes to run, you can skip it and load the results below
# Determine number of CPUs to use
ncores <- parallel::detectCores() - 1
# Threshold for presence of motif
thresh <- 1e-04
# Split up motifs into groups for parallel processing
motif_groups <- split(
motif_list,
cut(seq_along(motif_list), breaks = ncores, labels = FALSE)
)
# Set up parallel processing parameters
bpparam = MulticoreParam(workers = ncores)
# Search all motifs in target promoters using FIMO, use parallel processing
message("Searching target promoters...")
system.time(
target.matches <- bplapply(
motif_groups,
FUN = \(mg) runFimo(sequences = target.promoters,
motifs = mg,
thresh = thresh,
skip_matched_sequence = TRUE),
BPPARAM = bpparam)
)
# combine the list into a single object
suppressWarnings( target.matches <- do.call(c, unname(target.matches)) )
# Search all motifs at once in non-target promoters
message("Searching non-target promoters...")
system.time(
non.target.matches <- bplapply(
motif_groups,
FUN = \(mg) runFimo(sequences = non.target.promoters,
motifs = mg,
thresh = thresh,
skip_matched_sequence = TRUE),
BPPARAM = bpparam)
)
# combine the list into a single object
suppressWarnings( non.target.matches <- do.call(c, unname(non.target.matches)) )
# Count unique promoters with matches for each motif
# Note: we are counting presence / absence. We could also compute average per gene, or sum total
target.counts.df <- target.matches %>%
as.data.frame() %>%
group_by(motif_id) %>%
summarize(target.count = length(unique(seqnames)), .groups = "drop")
non.target.counts.df <- non.target.matches %>%
as.data.frame() %>%
group_by(motif_id) %>%
summarize(non.target.count = length(unique(seqnames)), .groups = "drop")
# Merge counts (filling in 0s for motifs with no matches)
results.table <- target.counts.df %>%
full_join(non.target.counts.df, by = "motif_id") %>%
replace_na(list(target.count = 0, non.target.count = 0))
# Perform Fisher's exact test for each motif
message("Performing statistical tests...")
results.table <- results.table %>%
rowwise() %>%
mutate(
p.value = fisher.test(
matrix(c(
target.count,
length(target.promoters) - target.count,
non.target.count,
length(non.target.promoters) - non.target.count
), ncol = 2),
alternative = "greater"
)$p.value,
non.target.percent = round(non.target.count / length(non.target.promoters), 3) * 100,
target.percent = round(target.count / length(target.promoters), 3) * 100
) %>%
ungroup() %>%
mutate(FDR = p.adjust(p.value, method = "fdr")) %>%
select(motif = motif_id, non.target.percent, target.percent, FDR, p.value, everything()) %>%
arrange(p.value)
bpstop(bpparam) # stop the parallel workers
write_csv(results.table, "../output/motif_results_trt.csv.gz")
If you didn’t run the code, load in the table
results.table <- read_csv("../output/motif_results_trt.csv.gz")
## Rows: 535 Columns: 5
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): motif
## dbl (4): non.target.percent, target.percent, p.value, FDR
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The resulting table gives the p-value for enrichment for each motif, as well as the % of promoters in the non-target and in our target gene set that have the motif.
Exercise 7
a. How many motifs are enriched at FDR < 0.05? (show code)
b. For each of the top five motifs, look them up in the JASPAR database and then click on the Uniprot ID link to get information on the protein that binds to that motif. Scroll through ALL of the information there and comment on whether and how it could be related to the response to the treatment in this experiment. Remember that lignin and sometimes suberin are components of the plant cell wall. Remember that soil nutrients can include N, P, K, as well as micronutrients like Fe.