Differential Gene Expression: Now What?

Copy and Paste

Today’s lab has a lot of long code chunks that you do not need to modify. OK to copy and paste.

Assignment

Clone your Assignment_11 repository. Please answer the exercises below using the template file in the scripts folder of your repository.

Knit the file and submit the .Rmd and .html when you are done.

Background

Now what?

We have found our lists of differentially expressed genes. Now what? We want to know the potential functions of those differentially expressed genes and possibly their regulators. Specifically we will:

  1. Merge the list of differentially expressed genes with a gene annotation file to learn about the function of the most up-regulated genes.
  2. Ask if particular classes of genes are up-regulated, by using functional classifiers known as Gene Ontology (GO) terms.
  3. Determine whether the differentially expressed genes have any common promoter motifs.

Annotate the differentially expressed genes.

As part of his B_rapa annotation paper a postdoc in my lab, Upendra Devisetty, did a quick annotation of the B. rapa genes by BLASTing each gene to the NCBI non-redundant database and taking the gene description of the best match (above a threshold). You can download the description with this Linux command:

wget https://bis180ldata.s3.amazonaws.com/downloads/RNAseq_Annotation/FileS9.txt

Place the file into your Assignment_11/input directory.

Now open Rstudio, set the project to your Assignment_11 directory, and proceed.

Let’s load needed libraries:

library(tidyverse)
library(goseq)
library(rtracklayer)
library(GenomicRanges)
library(Biostrings)

You created lists of differentially expressed genes at the end of Assignment 10. I have saved the output from the test for genes affected by the treatment as DEgenes.trt.csv. Please find this file in the input directory of your repo. If for some reason they are missing (or you are not enrolled in BIS180L), you find it here.

When you read in the DEgene file, the first column won’t have an informative column name, so add one:

DEgene.trt <- read_csv("../input/DEgenes.trt.csv")
head(DEgene.trt)
colnames(DEgene.trt)[1] <- "GeneID"
head(DEgene.trt)

Exercise 1
Import the gene descriptions that you downloaded (File S9, above); pay attention to the read_tsv “col_names” argument. What is appropriate here? Use one of the join() functions (which one?) to add gene descriptions for the genes found to be regulated by the NDP/DP treatment. Output a table of the top 10 genes (based on FDR) that includes the output from edgeR and the descriptions. Have the description be the first (left-most) column so that you can see it in the output (Hint: use tidyverse select, but be sure to keep the other columns)

It is hard to make much sense of these descriptions; how can we get a better idea of what processes are affected by the treatment? GO categories…

Test for enrichment of functional classes of genes.

To test for enrichment of GO terms we essentially ask the question of whether the prevalence of a term in our gene set of interest is higher than its prevalence in the rest of the genome (aka the rest of the universe). For example if 20% of the differentially expressed genes have the GO term “Cell Wall” but only 10% of the not-differentially expressed genes have the term Cell Wall, then the term “Cell Wall” might be over-represented, or enriched, in our differentially expressed genes. In principle this could be tested using a Chi-squared test or Fisher’s exact test for contingency tables. In practice we will use GOseq that makes an adjustment for gene-length bias since it is easier to detect differential expression of longer genes.

It is important in these analyses to define the “Universe” correctly. The Universe of genes is all of the genes where you could have detected an expression difference. It would not have been possible to detect an expression difference in genes that were not expressed at a detectable level in any sample in our experiment. So the Universe in this case is all expressed genes, rather than all genes.

Download the GO annotation and gene length for B.rapa

GO term annotation for B. rapa is also available from the Devisetty et al paper.

cDNA gene lengths were estimated by the featureCounts() command used in Tuesday’s lab. I have saved them for you and you can download them as instructed below. (You’re welcome)

Download the files using the commands below (in Linux) and place the file in your Assignment_11/input/ directory.

GO terms:

wget https://bis180ldata.s3.amazonaws.com/downloads/RNAseq_Annotation/FileS11.txt

Gene Length:

wget https://bis180ldata.s3.amazonaws.com/downloads/RNAseq_Annotation/Brapa_CDS_lengths.txt

Expressed genes

We need a list of all expressed genes. You could generate this yourself, or you can download this with

wget https://bis180ldata.s3.amazonaws.com/downloads/RNAseq_Annotation/internode_expressed_genes.txt

Place this file in your Assignment_11/input directory.

Format data for GOseq

We need to do a bit of data wrangling to get things in the correct format for GOSeq. First we import the gene lengths and GO terms. We also import a table of all expressed genes in the experiment (you could have gotten this from Tuesday’s data)

Get the data

go.terms <- read_tsv("../input/FileS11.txt",col_names=FALSE)
head(go.terms)
colnames(go.terms) <- c("GeneID","GO")
head(go.terms)

expressed.genes <- read_tsv("../input/internode_expressed_genes.txt")
head(expressed.genes)
names(expressed.genes) <- "GeneID"

gene.lengths <- read_tsv("../input/Brapa_CDS_lengths.txt")
head(gene.lengths)

#we need to reduce the gene.length data to only contain entries for those genes in our expressed.genes set.  We also need this as a vector
gene.lengths.vector <- gene.lengths$Length[gene.lengths$GeneID %in% expressed.genes$GeneID]
names(gene.lengths.vector) <- gene.lengths$GeneID[gene.lengths$GeneID %in% expressed.genes$GeneID]
head(gene.lengths.vector)

#Do the reverse to make sure everything matches up (it seems that we don't have length info for some genes?)
expressed.genes.match <- expressed.genes[expressed.genes$GeneID %in% names(gene.lengths.vector),]

Format go.terms for goseq. We want them in list format, and we need to separate the terms into separate elements.

go.list <- strsplit(go.terms$GO,split=",")
names(go.list) <- go.terms$GeneID
head(go.list)

Format gene expression data for goseq. We need a vector for each gene with 1 indicating differential expression and 0 indicating no differential expression.

#for each gene in expressed gene, return FALSE if it is not in DEgene.trt and TRUE if it is.
DE.trt <- expressed.genes.match$GeneID %in% DEgene.trt$GeneID
names(DE.trt) <- expressed.genes.match$GeneID
head(DE.trt)

DE.trt <- as.numeric(DE.trt) #convert to 0s and 1s
head(DE.trt)
sum(DE.trt) # number of DE genes

Calculate over-representation

Now we can look for GO enrichment

#determines if there is bias due to gene length.  The plot shows the relationship.
nullp.result <- nullp(DEgenes = DE.trt,bias.data = gene.lengths.vector)

#calculate p-values for each GO term
rownames(nullp.result) <- names(gene.lengths.vector) #because of a bug in nullp()
GO.out <- goseq(pwf = nullp.result,gene2cat = go.list,test.cats=("GO:BP"))
 
#list over-represented GO terms (p < 0.05) 
GO.out[GO.out$over_represented_pvalue < 0.05,]

GO visualization

Looking through a long list can be tough. There is a nice visualizer called REVIGO. To use it we need to cut and paste the column with the GO term and the one with the p-value. Use the command below to save these columns into a file:

write.table(GO.out[GO.out$over_represented_pvalue < 0.05,1:2],row.names=FALSE,file="../output/GO_terms.txt", quote = FALSE,col.names = FALSE)

Cut and paste the GO terms and p-values into REVIGO. Click Start Revigo. You can use the default settings.

Now we can take a look at the results.

The top row of tabs allows you to choose among the three types of GO terms:

  • Biological Process (BP)
  • Cellular Compartment (CC)
  • Molecular Functions (MF)

Generally I find the BP terms most helpful but you can look at each type by clicking in the tabs.

REVIGO has four types of visualizers (Scatterplot, 3D Scatterplot, Interactive Graph, and Tree Map). These are selected by clicking among the second row of tabs. The “TreeMap” can be particular nice because it groups the GO terms into a hierarchy.

Exercise 2:

a: In REVIGO display a “TreeMap” of the BP GO terms. Take a screenshot and include it here.

b: Remember that the genes used in this analysis are those that are differentially expressed between plants grown at high density (“DP”, multiple plants per pot) and not densely planted (“NDP”, one plant per pot). Plants in the DP treatment are able to sense their neighbors via photoreceptors that initiate a signaling cascade leading to more elongation growth. List two GO terms from this analysis that could be related to signaling or growth.

Promoter motif enrichment

To help us understand what causes these genes to be differentially expressed it would be helpful to know if they have any common transcription factor binding motifs in their promoters.

First we must get the sequence of the promoters. For ease we will define the promoters as the 1500bp upstream of the start of the gene.

How do we get the promoter sequences? We can use the Bioconductor package Genomic Ranges

Genomic Ranges contains a set of tools for defining regions of a genomic sequences.

Create a Granges objects with gene locations

We used the Rtracklayer import.gff() function in the last lab

# you may need to change the path or make a symbolic link
gff <- import.gff("../../Assigment_09/input/Brapa_reference/Brapa_gene_v1.5.gff")
gff

If you created a symbolic link with ln -s:

## GRanges object with 288624 ranges and 7 metadata columns:
##                  seqnames          ranges strand |   source     type      score
##                     <Rle>       <IRanges>  <Rle> | <factor> <factor>  <numeric>
##        [1]            A03 8794511-8797095      + |     blat     gene          1
##        [2]            A03 8794511-8797095      + |     blat     mRNA          1
##        [3]            A03 8794511-8794534      + |     blat     CDS         100
##        [4]            A03 8794622-8794805      + |     blat     CDS         100
##        [5]            A03 8795258-8795339      + |     blat     CDS         100
##        ...            ...             ...    ... .      ...      ...        ...
##   [288620] Scaffold005112         131-295      + |    blat      mRNA   1.000000
##   [288621] Scaffold005112         131-295      + |    blat      CDS  100.000000
##   [288622] Scaffold008211          18-251      + |    glean     gene   0.970334
##   [288623] Scaffold008211          18-251      + |    glean     mRNA   0.970334
##   [288624] Scaffold008211          18-251      + |    glean     CDS          NA
##                phase          ID        Name          Parent
##            <integer> <character> <character> <CharacterList>
##        [1]      <NA>   Bra000001   Bra000001                
##        [2]      <NA>   Bra000001   Bra000001                
##        [3]      <NA>        <NA>        <NA>       Bra000001
##        [4]      <NA>        <NA>        <NA>       Bra000001
##        [5]      <NA>        <NA>        <NA>       Bra000001
##        ...       ...         ...         ...             ...
##   [288620]      <NA>   Bra041173   Bra041173                
##   [288621]      <NA>        <NA>        <NA>       Bra041173
##   [288622]      <NA>   Bra041174   Bra041174                
##   [288623]      <NA>   Bra041174   Bra041174                
##   [288624]         0        <NA>        <NA>       Bra041174
##   -------
##   seqinfo: 284 sequences from an unspecified genome; no seqlengths

The resulting object contains an entry for each gene, mRNA, and coding region exon (CDS)

We will define the promoter as 1500bp upstream of the mRNA. Let’s first subset the Granges objects to only contain mRNAs. Also we will get rid of some columns we don’t care about. Note that because this is not a normal data frame or tibble we need to use base R functions, not tidyverse.

mRNAranges <- gff[gff$type=="mRNA",c("type", "ID")]
mRNAranges
## GRanges object with 41020 ranges and 2 metadata columns:
##                 seqnames          ranges strand |     type          ID
##                    <Rle>       <IRanges>  <Rle> | <factor> <character>
##       [1]            A03 8794511-8797095      + |     mRNA   Bra000001
##       [2]            A03 8797899-8800379      - |     mRNA   Bra000002
##       [3]            A03 8812470-8813271      + |     mRNA   Bra000003
##       [4]            A03 8816858-8817232      + |     mRNA   Bra000004
##       [5]            A03 8818081-8820123      - |     mRNA   Bra000005
##       ...            ...             ...    ... .      ...         ...
##   [41016] Scaffold004047          11-321      + |     mRNA   Bra041170
##   [41017] Scaffold004813         190-414      - |     mRNA   Bra041171
##   [41018] Scaffold004894           3-410      + |     mRNA   Bra041172
##   [41019] Scaffold005112         131-295      + |     mRNA   Bra041173
##   [41020] Scaffold008211          18-251      + |     mRNA   Bra041174
##   -------
##   seqinfo: 284 sequences from an unspecified genome; no seqlengths

We don’t have any of the “Scaffold” sequences in our fasta file (we only have chromosomes) so let’s subset the object to keep only the chromosomes

chromosomes <- c(paste0("A0", 1:9), "A10")
seqlevels(mRNAranges, pruning.mode="coarse") <- chromosomes
mRNAranges
## GRanges object with 39609 ranges and 2 metadata columns:
##           seqnames            ranges strand |     type          ID
##              <Rle>         <IRanges>  <Rle> | <factor> <character>
##       [1]      A03   8794511-8797095      + |     mRNA   Bra000001
##       [2]      A03   8797899-8800379      - |     mRNA   Bra000002
##       [3]      A03   8812470-8813271      + |     mRNA   Bra000003
##       [4]      A03   8816858-8817232      + |     mRNA   Bra000004
##       [5]      A03   8818081-8820123      - |     mRNA   Bra000005
##       ...      ...               ...    ... .      ...         ...
##   [39605]      A08 11283500-11285461      - |     mRNA   Bra040840
##   [39606]      A09 26845542-26846612      + |     mRNA   Bra041026
##   [39607]      A09 26848837-26850491      - |     mRNA   Bra041027
##   [39608]      A09   6389757-6392224      + |     mRNA   Bra041081
##   [39609]      A09   6394063-6397281      + |     mRNA   Bra041082
##   -------
##   seqinfo: 10 sequences from an unspecified genome; no seqlengths
promoterRanges <- flank(mRNAranges, 1500)
promoterRanges
## GRanges object with 39609 ranges and 2 metadata columns:
##           seqnames            ranges strand |     type          ID
##              <Rle>         <IRanges>  <Rle> | <factor> <character>
##       [1]      A03   8793011-8794510      + |     mRNA   Bra000001
##       [2]      A03   8800380-8801879      - |     mRNA   Bra000002
##       [3]      A03   8810970-8812469      + |     mRNA   Bra000003
##       [4]      A03   8815358-8816857      + |     mRNA   Bra000004
##       [5]      A03   8820124-8821623      - |     mRNA   Bra000005
##       ...      ...               ...    ... .      ...         ...
##   [39605]      A08 11285462-11286961      - |     mRNA   Bra040840
##   [39606]      A09 26844042-26845541      + |     mRNA   Bra041026
##   [39607]      A09 26850492-26851991      - |     mRNA   Bra041027
##   [39608]      A09   6388257-6389756      + |     mRNA   Bra041081
##   [39609]      A09   6392563-6394062      + |     mRNA   Bra041082
##   -------
##   seqinfo: 10 sequences from an unspecified genome; no seqlengths

Exercise 3: Flanking sequences could be defined in a variety of ways. We want flank() to take sequences upstream, not downstream of the mRNAs. Also, remember that genes can be transcribed from the “+” (upper) or “-“ (lower) DNA strand. Thus we want flank() to take coding strand into consideration and take sequences that are upstream of the 5’ end of the gene. Examine the entries for the first two genes in the mRNARanges and promoterRanges to determine if flank() has done what we want. Explain how you know.

Extract the promoter sequences

Now that we have the locations of the promoters we need to get the sequences themselves. Adjust the path below and load in the fasta file of chromosome sequences.

Brapaseq <- readDNAStringSet("../../Assignment_09/input/Brapa_reference/BrapaV1.5_chrom_only.fa")

The chromosome names have the date that the sequence was finalized appended. We need to remove that.

names(Brapaseq)
##  [1] "A01 [12.17-2010]" "A02 [12.17-2010]" "A03 [12.17-2010]" "A04 [12.17-2010]"
##  [5] "A05 [12.17-2010]" "A06 [12.17-2010]" "A07 [12.17-2010]" "A08 [12.17-2010]"
##  [9] "A09 [12.17-2010]" "A10 [12.17-2010]"
names(Brapaseq) <- str_remove(names(Brapaseq), " \\[.*")
names(Brapaseq)
##  [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10"

Now, magically, we can just use the square brackets to subset the chromosome sequences to a set of promoter sequences. We do have to add the names back.

promoters <- Brapaseq[promoterRanges]
names(promoters) <- promoterRanges$ID
promoters
## DNAStringSet object of length 39609:
##         width seq                                           names               
##     [1]  1500 TATCATTCTCCCAAACCCAAA...AGAGCGTTCTATTGCAGAAAC Bra000001
##     [2]  1500 CTTAGATAATAACTTTAGATC...TTATTTAGTATTTTTTTTTTT Bra000002
##     [3]  1500 TCTAAGTAACCAGGGCGTTAG...CTTCTCTTGCTTTAGCTGGAT Bra000003
##     [4]  1500 AAATGTATTCACGATTGTGTT...GCTTTAAGAAGCAACCTTTCG Bra000004
##     [5]  1500 CGTGGCTCCGATCACCGAGTG...TAGAAAACATCAATCTTTTTA Bra000005
##     ...   ... ...
## [39605]  1500 AATATCTCTGGTTTAGTACAT...GTAAAAGACACAGGTGATTTA Bra040840
## [39606]  1500 GAGATAACTTACACGAAGACT...CGTCTCAGGATGCTCTGGATC Bra041026
## [39607]  1500 TAACAAAATTATATGAGCTCT...TTATATATTTTCCTTAATTTT Bra041027
## [39608]  1500 NNNNNNNNNNNNNNNNNNNNN...AAAACCAAAGTAAGTTGAACA Bra041081
## [39609]  1500 GAAGCGGATACTCGTAATCAT...CAAATCTCTCTCTCTTCTTCC Bra041082

We also need to convert “N” to “-“ in promoters. Otherwise motifs will match strings of “N”s

promoters <- DNAStringSet(gsub("N","-",promoters))

promoters

Load transcription factor binding motifs.

Professor Siobhan Brady has compiled a file of plant transcription factor binding motifs. You can download those from

wget https://bis180ldata.s3.amazonaws.com/downloads/RNAseq_Annotation/element_name_and_motif_IUPACsupp.txt

Place these in your Assignment_11/input/ directory

Load the motifs and convert to a good format for R

motifs <- read.delim("../input/element_name_and_motif_IUPACsupp.txt",header=FALSE,as.is=TRUE)
head(motifs)
motifsV <- as.character(motifs[,2])
names(motifsV) <- motifs[,1]
motifsSS <- DNAStringSet(motifsV)
motifsSS

Next we need to subset the promoters into those from our differentially expressed genes and those from expressed genes that are not differentially expressed.

#get names to match...there are are few names in the DEgene list not in the promoter set
DEgene.trt.match <- DEgene.trt$GeneID[DEgene.trt$GeneID %in% names(promoters)]

expressed.genes.match <- expressed.genes$GeneID[expressed.genes$GeneID %in% names(promoters)]

#subset promoter files
universe.promoters <- promoters[expressed.genes.match]
target.promoters <- promoters[DEgene.trt.match]
non.target.promoters <- universe.promoters[
  ! names(universe.promoters) %in% names(target.promoters)] # all expressed genes not in target set

Look for over-represented motifs

Before we look for all motifs, let’s start simple and look for a single motif. Here I will use the second motif from the motifSS object.

motifsSS[2] # a single motif from the database

If we want to count occurrences of a motif, we can use vcountPDict()

target.counts <- vcountPDict(motifsSS[2], 
            subject = target.promoters, 
            fixed = FALSE) #fixed = FALSE allows IUPAC ambiguity codes

Let’s take a look at the result

dim(target.counts) #1 row for each motif, 1 column for each sequence that we are searching
target.counts[1,1:20] # each cell contains the counts for a particular motif in a particular sequence

Because DNA is double stranded we need to search both strands:

target.counts <- 
  vcountPDict(motifsSS[2], 
            subject = target.promoters, 
            fixed = FALSE) +
    vcountPDict(motifsSS[2], 
            subject = reverseComplement(target.promoters), 
            fixed = FALSE)

We can count the total number of promoters that have at least 1 occurrence of the motif with the following code

target.total <- rowSums(target.counts>0)
target.total
length(target.promoters) - target.total

Of the 586 promoters of differentially expressed genes, 42 have the motif and 544 do not. How many are there in the promoters of genes that were not differentially expressed?

non.target.counts <- vcountPDict(motifsSS[2], 
            subject = non.target.promoters, 
            fixed = FALSE) +
    vcountPDict(motifsSS[2], 
            subject = reverseComplement(non.target.promoters), 
            fixed = FALSE)
non.target.total <- rowSums(non.target.counts > 0)
non.target.total
length(non.target.promoters) - non.target.total

Of the 22,166 expressed genes that are NOT differentially expressed, 1,559 have 1 or more copies of the motif in their promoters and 20,607 do not.

Is there an enrichment? We want to know if 42:544 is significantly different from 1559:20,607. We use Fisher’s exact test for this. First make a contingency table with the counts:

m <- matrix(c(
  target.total,
  length(target.promoters) - target.total,
  non.target.total,
  length(non.target.promoters) - non.target.total ),
  ncol=2, 
  dimnames = list(motif=c("Y", "N"), target=c("Y", "N")))

m

Now run Fisher’s test

fisher.test(m)

In this case there is no enrichment (look at the p-value).

OK but we have 100 motifs. Do we have to go through them one at a time? Thankfully, no.

I have written a function to wrap up the required code. Copy and paste this into R to create the function

#create a function to summarize the results and test for significance
motifEnrichment <- function(target.promoters,non.target.promoters,all.counts=F,motifs=motifsSS) {
  
  #use vcountPDict to count the occurrences of each motif in each promoter
  target.counts <- vcountPDict(motifs,target.promoters,fixed=F) + 
    vcountPDict(motifsSS,reverseComplement(target.promoters),fixed=F)
  non.target.counts <- vcountPDict(motifs,non.target.promoters,fixed=F) + 
    vcountPDict(motifsSS,reverseComplement(non.target.promoters),fixed=F)
  
  if (all.counts) { 
    #count all occurrences of a motif instead of the number of promoters that it occurs in
    target.counts.sum <- rowSums(target.counts)
    non.target.counts.sum <- rowSums(non.target.counts)
  } else {
    target.counts.sum <- rowSums(target.counts > 0)
    non.target.counts.sum <- rowSums(non.target.counts > 0)
  }
  n.motifs <- length(target.counts.sum)
  results <- vector(mode="numeric",length=n.motifs)
  for (i in 1:n.motifs) {
    if (all.counts) { #the contigency tables are different depending on whether we are looking at promoters or overall occurrences
      #test if ratio of occurrences to promoters is the same in the target and the non.target
      m <- matrix(c(
        target.counts.sum[i],                       #number of occurrences within target
        dim(target.counts)[2],                      #number of promoters in target
        non.target.counts.sum[i],                  #number of occurrences within non.target
        dim(non.target.counts)[2]                  #number of promoters in non.target
      ),ncol=2)
    } else { #looking at promoters with and without hits
      m <- matrix(c(
        target.counts.sum[i],                        #number of promoters in target with hit
        dim(target.counts)[2]-target.counts.sum[i],            #number of promoters in target with no hit
        non.target.counts.sum[i],                   #number of promoters in non.target with hit
        dim(non.target.counts)[2]-non.target.counts.sum[i]   #number of promoters in non.target with no hit
      ),ncol=2)
    } #else
    results[i] <- fisher.test(m,alternative="greater")$p.value
  } #for loop
  results.table <- data.frame(
    motif=names(motifs),
    non.target.percent = round(non.target.counts.sum/dim(non.target.counts)[2],3)*100,
    target.percent = round(target.counts.sum/dim(target.counts)[2],3)*100,
    p.value =  results)
  results.table <- results.table[order(results.table$p.value),]
  results.table
}

Now with the function entered we can do the enrichment

motif.results <- motifEnrichment(target.promoters, non.target.promoters)
head(motif.results)

The resulting table gives the p-value for enrichment for each motif, as well as the % of promoters in the non-target and in our target gene set that have the motif.

Exercise 4

a. How many motifs are enriched at P < 0.05? (Can you do this with code?)
b. What is the identity of the most significantly over-enriched motif? (No code needed)
c. What percentage of genes in the “non.target” have this motif? What percentage in our target set? (No code needed)
d. You can find information on the motifs here. Remembering that plants in the “DP” treatment experience shading from their neighbors, where as “NDP” plants do not. Do you think that the most enriched motif represents a biologically relevant result? Discuss why or why not.