Metagenomics

04 Jun 2026

Assignment 15: Metagenomics

Background

Metagenomics is a rapidly expanding field with the power to explain microbial communities with a very high resolution by leveraging next generation sequencing data. There are applications in the clinic, ecological environments, food safety, and elsewhere. By definition, metagenomics is the study of a collection of genetic material (genomes) from a mixed community of organisms, typically microbial.

Today we will walk through a common metagenomics workflow using DADA2and phyloseq by completing the following:

Determine the various microbial communities in our samples
Calculate the diversity within our sample (alpha diversity)
Calculate the diversity between different sample types (beta diversity)

Acknowledgement must be paid to Professor Scott Dawson for sharing his original metagenomics lab that we have adapted for this class, to the Sundaresan Lab for sharing the data from their publication, and to former TA Kristen Beck who wrote an earlier version of the lab.

Clone your repository

Clone your Assignment_15 repository

cd into the repository
There will be an assignment template for today’s lab.

Install DADA2

Divisive Amplicon Denoising Algorithm or DADA2 is an open-source bioinformatics software package for DNA sequencing analysis. It has been cited over 30,000 times since its publication in 2016.

Install Packages

if (!requireNamespace("BiocManager", quietly = TRUE)) # remove # if needed
    install.packages("BiocManager")

#BiocManager::install("dada2")

#install.packages("janitor")

library(dada2); packageVersion("dada2")

## Warning: package 'dada2' was built under R version 4.5.2

## Loading required package: Rcpp

## Warning: package 'Rcpp' was built under R version 4.5.2

## [1] '1.38.0'

library(janitor)

## Warning: package 'janitor' was built under R version 4.5.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Data

Switch to your Assignment_15/input directory

Download and unzip the data files:

#wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/Metagenomics/RiceSeqs.zip

We also need a reference file that has 16S sequences from different taxa, so that we can classify our reads. Download these to your input directory also.

#wget https://bis180ldata.s3.us-east-1.amazonaws.com/downloads/Metagenomics/SILVA_SSU_r138.2_v2.RData

Background for our Data

Today, we will be working with the samples collected from the rhizosphere of rice plants. The rhizosphere is an area of soil near the plant roots that contains both bacteria and other microbes associated with roots as well as secretions from the roots themselves. See diagram below from Phillppot et al., Nature, 2013.

plot of rhizosphere

In order to classify microbial diversity, metagenomics often relies on sequencing 16S ribosomal RNA which is the small subunit (SSU) of the prokaryotic ribosome. Our samples are sequenced from the V4-V5 region. This region has a slow rate of evolution and therefore can be advantageous in constructing phylogenies. For this lab, samples for various soil depths and cultivars were sequenced with Illumina sequencing. The de-multiplexed reads that we will be working with are in the RiceSeqs folder, and the sample information is in RiceMetaData.csv.

The naming conventions of the Sample IDs are a little abstract, so I have created a quick reference list for you here.

Replicates are indicated with 1, 2, 3 under the “Plant” column
Cultivars include M104, IR50 and Nipponbare
The different microbiomes are environmental materials are under the env_material column
- Endosphere = inside the root
- Rhizoplane = surface of the root
- Rhizosphere = 1mm from the root

Explore and Quality Control Data

Often times, as bioinformaticians, we will receive data sets with little background. Sometimes the first step is to explore the raw data that we will be working with. This can help us spot inconsistencies or logical fallacies down the line when working with more automated pipelines.

In the Terminal open RiceMetaData.csv with less to view more information about the data you are working with. This file contains information about each sample including the cultivar, microbiome or enviroment material, and number of biological replicates.

Try to determine the number of sequences for each sample. This can be accomplished using just Linux/Unix commands. I’ll start by giving you the tools, so you can try to piece together the command on your own.

Helpful Commands (in no particular order): wc, ls, echo, zcat and good ‘ol | to chain the commands together or create a for loop.

Exercise 1: Using information in the FastQC files or the fastq_files to answer the following questions.
a. Are the number of sequences for each sample approximately the same or are there any outliers? If so, which samples do they belong to?

b. Could a different number of sequences per sample affect future analysis? Explain your reasoning.

Now that we’ve poked around in our raw data, let’s carry on with analyzing the microbes present in our samples.

Set Path

We are using data that has already been demultiplexed and trimmed, so we can skip those steps and jump right into the analysis. Define the following path variable so that it points to the correct directory on your machine:

getwd()

## [1] "C:/Users/brysh/Documents/BIS180L_Spring_2026/Assignment_13_Lab/Assignment_15/scripts"

path <- "../input/RiceSeqs" 
list.files(path)

## character(0)

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

# Forward and reverse fastq.gz filenames have format: SAMPLENAME_1.fastq.gz and SAMPLENAME_2.fastq.gz
fnFs <- sort(list.files(path, pattern="_1_trimmed.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2_trimmed.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

Inspect read quality profiles

Another way to inspect quality profiles of forward and reverse reads. For further details, view the fastqc files that are in the input/fastqc_files directory. For more details about the quality profiles, see the DADA 2tutorial.

# Forward reads

plotQualityProfile(fnFs[1:2]) #uncomment

# Forward reads

plotQualityProfile(fnFs[1:2]) #uncomment

Filter and trim

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

This step takes about 4 minutes to run.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(200,100),
             maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=TRUE) # can change to TRUE if running within instance or none Windows #computer 
head(out)

Learn the Error Rates

# forward reads
errF <- learnErrors(filtFs, multithread=TRUE)

# reverse reads
errR <- learnErrors(filtRs, multithread=TRUE)

Visualize the estimated error rates

plotErrors(errF, nominalQ=TRUE)

Sample Inference

dadaFs <- dada(filtFs, err=errF, multithread=TRUE)

dadaRs <- dada(filtRs, err=errR, multithread=TRUE)

Merge paired reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])

Construct sequence table

seqtab <- makeSequenceTable(mergers)
dim(seqtab)

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

##Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)

sum(seqtab.nochim)/sum(seqtab)

Track reads through the pipeline

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

#BiocManager::install("DECIPHER")

library(DECIPHER); packageVersion("DECIPHER")

Great Job you just processed your ASVs! Now lets assign Taxonomy.

To run the full code without filtering out low frequency ASVs run code chunk below: Or skip this chunk to save time.

dna <- DNAStringSet(getSequences(seqtab.nochim)) # Create a DNAStringSet from the ASVs
load("../input/SILVA_SSU_r138.2_v2.RData") 
ids <- IdTaxa(dna, trainingSet, strand="both", processors=NULL, verbose=TRUE) 
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species") # ranks of interest
# Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy
taxid <- t(sapply(ids, function(x) {
        m <- match(ranks, x$rank)
        taxa <- x$taxon[m]
        taxa[startsWith(taxa, "unclassified_")] <- NA
        taxa
}))
colnames(taxid) <- ranks; rownames(taxid) <- getSequences(seqtab.nochim)

Assigning taxonomy is one of the longest steps, I filtered the sequence table to reduce the time, but it still takes around 30 minutes even after filtering.

# Remove ASVs seen fewer than 10 times total
seqtab.nochim.filtered <- seqtab.nochim[, colSums(seqtab.nochim) > 10]

Run this Code:

dna_filtered <- DNAStringSet(getSequences(seqtab.nochim.filtered)) # Create a DNAStringSet from the ASVs
load("../input/SILVA_SSU_r138.2_v2.RData") 
ids_filtered <- IdTaxa(dna_filtered, trainingSet, strand="both", processors=NULL, verbose=TRUE) 
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species") # ranks of interest
# Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy
taxid_filtered <- t(sapply(ids_filtered, function(x) {
        m <- match(ranks, x$rank)
        taxa <- x$taxon[m]
        taxa[startsWith(taxa, "unclassified_")] <- NA
        taxa
}))
colnames(taxid_filtered) <- ranks; rownames(taxid_filtered) <- getSequences(seqtab.nochim.filtered)

Exercise 2: Examine the taxid_filtered object. Do you notice anything interesting about the species column? Why do you think that it is rare to get species level identification with using 16S amplicon sequencing. Do you think there is any way to get a higher level of taxonomic resolution when using ASVs from 16S?

Construct the Phyloseq Object

We will use the R package phyloseq To analyze the OTU data.

First, install it (only needs to be done once).

In R:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#BiocManager::install("phyloseq")
#BiocManager::install("Biostrings")

You will see many warning messages go by. It is okay to ignore them.
When asked “Update all/some/none? [a/s/n]:” you can choose “n”

Load the libraries

library(phyloseq)
library(tidyverse)
library(Biostrings)

Import the metadata table

meta_dat <- read.csv("../input/RiceMetaData.csv", row.names = 1, check.names = F) %>% 
  clean_names() %>% 
  mutate(cultivar_env = paste(cultivar, env_material, sep = " - "))
head(meta_dat)

# Build each component separately first
OTU <- otu_table(seqtab.nochim.filtered, taxa_are_rows = FALSE)
SAM <- sample_data(meta_dat)
TAX <- tax_table(as.matrix(taxid_filtered))

rice.ps <- phyloseq(OTU, SAM, TAX)

rice.ps

Lets store the DNA sequences of our ASVs in the refseq slot of the phyloseq object, and then rename our taxa to a short string. That way, the short new taxa names will appear in tables and plots.

code <- Biostrings::DNAStringSet(taxa_names(rice.ps))
names(code) <- taxa_names(rice.ps)
rice.ps <- merge_phyloseq(rice.ps, code)
taxa_names(rice.ps) <- paste0("ASV", seq(ntaxa(rice.ps)))

print(rice.ps)

Filter our PS Object

Lets remove any ASV that are not real or uninformative

rice.ps <- subset_taxa(rice.ps, order != "Chloroplast" & family != "Mitochondria" & domain != "Eukaryota" & phylum != "NA" )

print(rice.ps)

Do some more filtering to remove rare sequences

rice.ps.small <- filter_taxa(rice.ps, function(x) sum(x > 5) > 2, prune=TRUE) #require greater than five observation in more than two samples
rice.ps.small <- prune_taxa(complete.cases(tax_table(rice.ps.small)[,"phylum"]),  rice.ps.small) #remove taxa from unknown phylum
rice.ps.small

print(rice.ps.small)

Use the rice.ps.small data set for the remainder of this lab

Heatmap

As we learned in previous labs, we can rely on the human eye to help pick out patterns based on color. We are going to make a heat map of the ASVs per sample. The OTU table is visualized as a heat map where each row corresponds to an ASV and each column corresponds to a sample. The higher the relative abundance of an ASV in a sample, the more intense the color at the corresponding position in the heat map. ASVs are clustered by Bray-Curtis dissimilarity. For a refresher on biological classification, view this helpful wiki page.

#plot_heatmap(rice.ps.small, sample.label = "cultivar_env", sample.order = "env_material")

Exercise 3: Although, the resolution of the y-axis makes it impossible to read each ASV, it is still a valuable preliminary visualization. What types of information can you gain from this heat map? Are there any trends present at this stage with respect to the various samples?

Barplots

Now we’d like to visualize our data with a little higher resolution and summarize the communities by their taxonomic composition.

Use the following command to generate a boxplot to describe our samples. Here we are coloring by phylum, but we could use any level of taxonomic information

plot_bar(rice.ps.small, fill="phylum")

a line separates every ASV, which leads to a lot of black. We can get rid of the black lines with

pl <- plot_bar(rice.ps.small, fill="phylum")
pl + geom_col(aes(fill=phylum, color=phylum)) 

If you prefer to look at the counts, you can use the psmelt() function to create a data frame, and then use tidyverse functions, for example:

rice.ps.small %>%
  psmelt() %>%
  group_by(phylum) %>%
  summarize(Abundance=sum(Abundance)) %>%
  arrange(desc(Abundance))

These are helpful visualizations/summaries, but what if you want to group by cultivar or env_material? That brings us to exercise 4…

Exercise 4:

a. Make a bar plot with samples grouped by env_material

hint: Open the help page for plot_bar to work out how to do this (You may also want to look at the phyloseq plot_bar tutorial)

b. Make a table showing the phylum abundance by env_material (build on code above). I’ve shown what the first two rows should look like; yours should have 11 rows.

c. When comparing by env_material, which two phylum is most abundant across all env_material? Are there any predominant phyla greatly enriched (greater than 10-fold) in one env_material compared to the others?

d. Repeat a, b, and c but this time grouping by cultivar. When comparing by cultivar, is the predominant phyla consistent with that observed in Part B? Are there any predominant phyla unique to a specific cultivar? What does this indicate to you about the effect of the genotype and/or the effect of the sample location?

Now that we know a little more information about the ASVs in our sample, we’d like to calculate the diversity within a sample –the alpha diversity– and between our samples –the beta diversity.

Determine the Diversity Within a Sample

Alpha diversity tells us about the richness of species diversity within our sample. It quantifies how many taxa are in one sample and allow us to answer questions like “Are polluted environments less diverse than pristine environments?”.

There are more than two dozen different established metrics to calculate the alpha diversity. We will start with a small subset of methods. Feel free to read more details about other metrics here.

generate plots

#plot_richness(rice.ps.small, measures=c("Observed", "Chao1", "Shannon"))

Exercise 5:
Look at the help file for plot_richness and plot the samples grouped first by cultivar and then make a separate plot with them grouped by env_material and finally by both. Do either of these categories seem to affect species diversity?

Exercise 6:
Is there an alpha diversity metric that estimates the sample diversity differently than the other metrics? If so, which one is it? (This is not a question about the scale of the y-axis, but the patterns of which samples show relatively high and low diversity).

Visualize the Diversity Between Samples

The definition of beta diversity has become quite contentious among ecologists. For the purpose of this lab, we will define beta diversity as the differentiation among samples. To quantify beta diversity, we will calculate the pairwise Bray-Curtis dissimilarity, resulting in a distance matrix. For more information about other definitions/uses of beta diversity, see the wikipedia page.

First compute the distances and compute MDS coordinates

rice.ord.small <- ordinate(rice.ps.small, method="NMDS", distance="bray")

Now plot:

pl <- plot_ordination(rice.ps.small, rice.ord.small, type="samples", color="cultivar", shape="env_material") 
pl + geom_point(size=4)

Exercise 7:

a. Does cultivar or env_material appear to have more of an influence on the clustering?

b. Which two env_materials are more similar to one another? Does that make biological sense?

Congratulations!

You now have learned how to do some classic metagenomics analyses.