Working with Illumina Sequence data

Today we begin a series of six labs that all work with the same data set. The data is RNA sequencing data generated from Brassica rapa, a plant with varieties that can be grown for turnips, Napa cabbage, or oil seeds (like canola).

In these labs we will explore how gene expression is influenced by plant variety, growth conditions, and organ type. In this experiment we have:

  • Two plant varieties:
    • IMB211: a variety with rapid reproduction that has been bred for laboratory work
    • R500: an oil-seed variety
  • Multiple growth conditions
    • growth chamber: simulated sun or shade
    • greenhouse: crowded or uncrowded
    • field: crowded or uncrowded
  • Multiple tissue types
    • seedling stem (hypocotyl)
    • leaf blade
    • leaf petiole
    • root
    • internode
    • inflorescence (young flower stem)
    • silique (fruit)

We won’t work with all of these samples through all of the labs, but the final section on genetic networks will make use of most of them.

Over the next several weeks the goals are to:

  1. Learn about Illumina reads, how to map them, and quality control (today)
  2. How to view reads in a genome browser and how to find single nucleotide polymorphisms (Thursday)
  3. Find genes that are differentially expressed between genotypes or treatments (next week)
  4. Ask if differentially expressed genes have any common functionality (Gene ontologies) or promoter motifs
  5. Build a gene regulatory network to determine how genes connect to one another (two weeks from now).

Background for today’s lab.

Fastq files

The raw output from an Illumina sequencer is a fastq file. Illumina sequencers generate 350 million sequences of 50 to 250 bp in length per flow-cell “lane”.

Illumina Sequencing

Want more info than was in my lecture?

Multiplexing

Because each lane on a sequencer generates 350M reads we often want to run more than one sample per lane. This is called multiplexing. After sequencing we want to assign each read to the sample that it originally came from. How? When libraries are made we include a short “barcode” or “index” on each adapter that indicates the library from which it came. We can then read the barcode from each sequence to assign it to a particular sample.

There are two general procedures. In line barcodes are placed at the ends of the adapter, adjacent to the DNA of interest. They are the first bases sequenced and form the beginning of the read. This is the type of barcode that we have in this data set. In this case the end user (us) must sort the reads according to the barcodes and trim the barcodes off of the DNA sequence.

Index tags are internal in the adapter and are read in a separate, short sequencing reaction. In this case Illumina software will automatically sort the sequence reads for the end-user before delivery. This is the current standard and what you are most likely to encounter other than in today’s lab.

Quality Control

The Illumina sequencer assigns a phred quality score Q to each base of sequence. The quality ranges from 0(low) to 41 (high) and is defined as Q = -10*log10(P) where P is the probability of an error. So a phred score of 40 corresponds to an error probability of 1 in 10,000 ( P = 10^-(Q/10) ) In current Illumina data these are encoded as the ASCII characters “!”=0 to “J”=41, but in previous Illumina data the range was from “@”=0 or “B”=3 to “h”=40 See the fastq wiki for more information.

Mapping

Our overall goals with this sequence data are to

  1. Find single nucleotide polymorphisms (SNPs) between the two genotypes, R500 and IMB211
  2. Find genes that are differentially expressed.

In order to do either of these we need to know where in the genome the reads map.

There are a number of read mapping programs available. One popular general purpose aligner is bwa. Kallisto is good for mapping RNA-seq data to cDNAs. However, we will be mapping RNAseq to a genomic reference, therefore we want a program that will splice across introns. STAR is one of the best for this task and that is what we will use today.

Outline of work

  1. Check FASTQ quality
  2. Trim reads to keep high-quality reads
  3. Split reads based on barcodes
  4. Map to reference genome

Files

We will work with the following file types today

  • .fastq – file of short read data
  • .fa – fasta files for reference genome
  • .sam – sequence alignment/map file for mapped reads
  • .bam – the binary version of a sam file
  • .bai – index for bam files
  • .gff – genome annotation: information about where the genes are in the genome

The Lab

Setup

Additional software

There are a few additional packages that need to be installed on your computer for today’s lab.

multiqc

multiqc is a program for aggregating the reports of quality check programs such as fastqc, which we will use to assess our data quality.

Install it from the Linux command line with:

python3 -m pip install multiqc # In most cases you can just do pip3 install multiqc
sudo ln -s ~/.local/bin/multiqc /usr/local/bin/multiqc

Trimmomatic

Use the commands below to download Trimmomatic, a program to filter reads based on quality scores and other criteria. You can use the wget command (below) to download directly, or if you download from your browser, be sure to click on the “binary” version.

From the Linux command line:

cd ~/Downloads
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip

Next move that folder into the /usr/local/bin directory, a common place for user installed programs.

sudo mv Trimmomatic-0.39 /usr/local/bin

The trimmomatic program itself is written in java. We use a bash alias to make it easier to start. We will place the alias in the file ~/.bashrc; this is a bash script that is run every time bash starts.

Use nano to open your ~/.bashrc file. (Note that the file is hidden because it starts with a .. To see it at the command line use ls -a). If the file appears empty when you opened it you made a mistake.

cd
ls -a
nano .bashrc

After you have the file open then paste the following lines to the end of it and save the file.

# alias to run trimmomatic
alias trimmomatic="java -jar /usr/local/bin/Trimmomatic-0.39/trimmomatic-0.39.jar"

Finally source your .bashrc file to set the alias for this session. You won’t need to do this again.

source .bashrc

You can test whether this works by typing trimmomatic at the command line. You should get a short usage statement. If you get a command not found error, then something went wrong.

auto_barcode

auto_barcode is a program for splitting fastq reads based on their barcode, written by Mike Covington, a bioinformaticist formerly in the Maloof Lab (and founder of his own company, Amaryllis Nucleics) which was acquired by Active Motif Incorporated in 2022.

auto_barcode is written in Perl and we need to install a few Perl modules before we install autobarcode. To do this we use cpan which is a module installer for Perl.

First step is to update cpan

cpan install CPAN

cpan will ask you a couple of questions, first whether you want it to autoconfigure. Press the return key at the prompt to accept the default yes

Next it will ask you about privileges. Here you should type sudo

Next install the required module:

sudo cpan install Text::Levenshtein::XS

Press the return key at the prompt to accept the default yes for autoconfigure

Now we are now able to install auto_barcode

You can cut-and-paste(!) the commands below into your terminal to install auto_barcode.

First we clone the repo:

cd /usr/local/bin
sudo git clone https://github.com/mfcovington/auto_barcode.git

When you type a command on the command line, the shell only looks in certain directories to find the requested program. We need to either modify the PATH variable to include the path to our new program, or make a symbolic link to place it in a directory already in the path. We will choose the later option here:

sudo ln -s /usr/local/bin/auto_barcode/barcode_split_trim.pl /usr/local/bin/barcode_split_trim.pl

test it:

barcode_split_trim.pl

That should produce a usage statement. If you got an error saying command not found then something went wrong.

Data

  • Clone your Assignment_9 repository.
  • Change directories into the Assignment_9 directory.
  • Now change directories into the input/Brapa_reference directory.

Download and unzip the Brassica rapa fasta reference file:

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/BrapaV1.5_chrom_only.fa.gz
gunzip BrapaV1.5_chrom_only.fa.gz

Download the Brassica rapa gff reference file:

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/Brapa_gene_v1.5.gff.gz
gunzip Brapa_gene_v1.5.gff.gz

Change directories to the input/Brapa_fastq directory

Download the fastq file

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/GH.lane67.fastq.gz

Note: I have placed these filenames in the .gitignore file for this repository so they will not be uploaded to github; they are too big

Index the B. rapa genome

STAR will needs a memory-efficient index of the reference genome file. We create that with the command below. It takes about 3 minutes to run.

The first argument is the filename of the reference (fasta) file. The second argument is the filename “stem” for the resulting index.

Change directories back to the input/Brapa_reference directory, then:

mkdir Brapa_STAR_index

STAR \
    --runThreadN 7 \
    --runMode genomeGenerate \
    --genomeDir Brapa_STAR_index \
    --genomeSAindexNbases 12 \
    --sjdbOverhang 93 \
    --sjdbGTFfile Brapa_gene_v1.5.gff \
    --sjdbGTFfeatureExon CDS \
    --sjdbGTFtagExonParentTranscript Parent \
    --genomeFastaFiles BrapaV1.5_chrom_only.fa 

Checkout the fastq file

Take a look at the fastq file back in the input/Brapa_fastq directory.

Exercise 1:

a What is the read length? (can you do this without manually counting?)
b What is the machine name?
c How may reads are in this file? (show how you figured this out)
d Are the quality scores Phred+33 or Phred+64? (how did you figure this out?)

Check sequence quality

We want to know if our reads have good sequence quality and to check for other possible errors in the sequence data. Type fastqc at the command line. This will open up a GUI. Point it to your fastq or fastq.gz file and have it check the quality. (It is also possible to run fastqc entirely from the command line if you are processing many fastq files).

Exercise 2: Compare your fastq results to the examples of good sequence and bad sequence on the fastqc website. Comment on any FastQC items that have an “X” by them in the report. What might have caused these issues? (hint: think about barcodes).

Filter reads

It is generally a good idea to trim reads when their quality drops below 20 or so. I had always used Trimmomatic in the past, but recently I have been hearing good things about fastp. Let’s try it out.

Exercise 3: Take a look at the fastp README (Or type fastp at the command line to get the usage guide) and figure out how to run fastp to accomplish the following (note: our reads are single end):

  • Output file is named GH.lane67.fastp_trimmed.fastq.gz
  • Minimum length is 50bp
  • Require a quality of 20 or greater in at least 70% of the bases
  • If necessary, tell fastp what Phred encoding we have
  • Use the --cut_right option to trim reads where the average base quality in a 4 base window drops to 20 or less
  • You can leave the rest as default (i.e. you don’t need to turn other filtering off)

a What fastp command did you use?
b Look at the output printed to the screen. How many reads were removed by trimming? How many bases are left in the data set?
c Rerun fastp with exactly the same command. Compare the number of reads removed and the number of bases remaining from the two runs. Any concerns? d Maybe there is some problem with the way they have implemented multi-threading. fastp uses 3 threads by default. Rerun fastp but now set the number of threads to 1 (you can use the --thread 1 argument). Compare the number of reads removed and the number of bases remaining in all three runs. Any concerns?

Exercise 4

a Let’s see how Trimmomatic compares. Run it twice with 4 threads (default) and twice with 1 thread, you can just cut and paste the code below. Which program do you think we should use, and why?

trimmomatic SE GH.lane67.fastq.gz GH.lane67.trimmed.fastq SLIDINGWINDOW:4:20 MINLEN:50
trimmomatic SE GH.lane67.fastq.gz GH.lane67.trimmed.fastq SLIDINGWINDOW:4:20 MINLEN:50
trimmomatic SE -threads 1 GH.lane67.fastq.gz GH.lane67.trimmed.fastq SLIDINGWINDOW:4:20 MINLEN:50
trimmomatic SE -threads 1 GH.lane67.fastq.gz GH.lane67.trimmed.fastq SLIDINGWINDOW:4:20 MINLEN:50

b rerun FastQC on the trimmomatic trimmed sequences. Which issues did the trimming fix?

Barcode Splitting

The fastq file contains reads from many different samples, indexed with a 5’ barcode on each read. We need to split the reads according to the barcode and then remove the barcode from the 5’ end.

Exercise 5: Look at the README for auto_barcode and figure out how to run it to split your samples. Specify that the split fastq files are placed in the directory split_fq. Use the Perl (.pl) version of the script

hint: at the bottom of the README are three examples of how to run the command. The second one is pretty close to what you need

a what command did you use?
b what percentage of reads did not match a barcode? What are possible explanations?

check quality of each sequence library

Now that we have split the sequences by barcode we can check the quality of each library individually.

To do this we will first run fastqc in command line mode and then run multiqc to aggregate the results.

The code below assumes that you are in your Assignment_09/input/Brapa_fastq directory. cd there if you are not.

mkdir ../../output/fastqc_out
fastqc --threads 7 -o ../../output/fastqc_out split_fq/*.fq

cd ../../output
multiqc fastqc_out

You may need to change the paths above.

Now look at the report. You can have Firefox open the html from the command line:

firefox multiqc_report.html

Exercise 6:
a: What library has the worst quality based on the Sequence Quality Histograms? Is it bad enough to concern you?
b: Click on each of the other tests. Which three libraries are most problematic? In which tests?
c: Do you think the libraries should be removed from downstream analysis?

Mapping

If we identified serious problems from multiqc we would probably want to do some additional filtering or remove those libraries. But for now we will just proceed with all the libraries.

Now map/align the sequences to the indexed reference. We will use STAR; we alread made the index for STAR earlier in this lab. Note that there are many options that could be specified (See the manual for more info.

Change directories back to the Assignment_09 directory

# assumes your are in input/Brapa_fastq
cd ../../

A single fastq file could be mapped using:

time STAR --runThreadN 7 \
--genomeDir input/Brapa_reference/Brapa_STAR_index \
--readFilesIn input/Brapa_fastq/split_fq/IMB211_DP_1_SILIQUE.fq \
--outFileNamePrefix output/STARout/IMB211_DP_1_SILIQUE.fq_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 2 \
--alignIntronMax 5000

Exercise 7: Use a bash for loop to run STAR on all of the fastq files. Show your code here. Make sure that the results are going somewhere in your output directory and make sure that each input file is getting written to a different output file.

That is it for today. You can leave your loop running once you know that it is working.