Final Exam Info and Study Guide
02 Jun 2026The final is NOT comprehensive, it will focus on topics since the midterm but because the class is cumulative, there may be some questions that require you to apply concepts from earlier in the class.
The final will be in two parts: a take-home part and an in-class part. The take-home part will be open book/web and unlimited time. The in-class part will be closed book but you can bring one hand-written page of notes.
In Class part
This will be a concept and knowledge-based exam and will not include true scripting/coding. You may be asked to pseudo-code or to find a code error.
You can bring one hand-written page of notes.
Study guide
- Know the purpose of the the various file formats we have used (fastq, sam/bam, vcf) and what types of data they contain. I will not ask you to recreate one of these but I likely will give you a snippet of one and ask you to explain/interpret the columns.
- Know what a PHRED score is and how it relates to the probability of error in a base call
- Be able to convert a PHRED score into a probability of error and vice versa. You should be able to do this without a calculator for simple scores (e.g. 10, 20, 30) and you should know the formula for converting between the two so that you could do it with a calculator for more complex scores.
- Know the difference between PHRED+33 and PHRED+64. You do not need to memorize the ASCII codes for the characters, but if I gave you the PHRED+33 and PHRED+64 character keys along with a fastq file snippet, you should be able to determine which PHRED encoding is being used and convert the quality scores to probabilities of error.
- What is mapping quality and how does it differ from base quality?
- How to read a CIGAR string and what the various indicators (M, N, I, D) mean.
- I will NOT ask you to interpret the bitwise FLAG.
- How can you (or a SNP caller) distinguish between a true SNP and a sequencing error?
- What are the steps involved between getting raw sequencing data and performing differential gene expression analysis? What are the tools we have used for each step and what do they do?
- What is clustering, why would we want to do it? What methods have we used for clustering and how do they differ? (Hierarchical vs k-means)
- Know the steps for constructing a gene co-expression network.
- Define node, edge, and degree in the context of a gene co-expression network.
- Explain the difference between degree and betweenness centrality in a gene co-expression network and how they can be used to identify important genes in the network.
- Given a simple network graph, be able to pick the node with the highest degree and the node with the highest betweenness centrality.
- What is the basis for GO and promoter enrichment analyses?
- Know the difference between alpha and beta diversity and what they are used for in the context of metagenomics.
Example Questions
Question 1 Consider the following data from an Illumina sequencing experiment:
A00887:346:H2VK2DSX2:1:1141:3884:31986 163 scaffold_0 43675 255 70M89N73M = 43736 293
CATTTCTCACCTCCTCAAGGCAACTTTCAAGCTCCTTCAATTCTTCATCCTCCGAGAAGCTCACTGTGGCTTGTTTGATTGTGTTCTTCAAATGCATCTCAGCACTAAAGAGCTCTCGCCTGCTTCCTGTGGACACTGAGATC
FI,5FFFFFFFFFFFFFFFFFFF,FFFFFF:FFFFFF,FF:FFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FF:FFFFFFFFFFFFFF:
NH:i:1 HI:i:1 AS:i:276 nM:i:6
A (2pts): The data above comes from what kind of file?
B: (3pts): The sequencing quality is in Phred+33 or Phred+64? (Note: I would give you this chart for the exam.)
C: (2pts) Convert the quality of the fourth base to a Phred score.
D: (2pts) Convert the Phred score from part C to a p-value. If you aren’t sure how to do this, give the formula.
E: (3 pts) Explain what the probability from D means in terms of confidence in the sequence. If you are stuck on C or D then just pick a p-value to illustrate (like what does a p-value of 0.01 mean in this context).
F: (2pts) What does the number “43675” refer to?
G: (2pts) What does the number “255” refer to?
H: (2pts) What does “70M89N73M” mean?
Question 2 You have performed a differential gene expression analysis to find genes expressed differentially in the developing petals of red vs white roses. You have a list of 963 genes that are higher in red petals and 1204 genes that are higher in white petals. Briefly describe three follow-up analyses you could do with these gene lists to try to understand the biological basis for the differences in petal color. For each analysis explain what you would hope to learn. Two sentences for each analysis.
Question 3 Examine the gene co-expression network graph below.

A: What is a node in this graph? Describe what the nodes in this graph represent.
B: What is an edge in this graph? Describe what the edges in this graph represent.
C: Which gene has the highest degree centrality? Explain your choice.
D: Which gene has the highest betweenness centrality? Explain your choice.
Question 4 You are examining a .vcf file. On different rows you see the following entries. What does each of these mean?
A: 0/1
B: 1/1
C: 0/0
Question 5 You have performed a metagenomics analysis of the microbial communities in the soil of a tomato farm and the soil of a nearby grassland. You find that the alpha diversity of the tomato farm soil is much lower than that of the grassland soil. What does this mean? What could be some reasons for this difference?
Note: not all topics covered in the final will be represented in the example questions above. The example questions are meant to give you a sense of the types of questions that will be on the exam, but they are not meant to be an exhaustive list of topics or question types.
Take Home
No study guide for this; it will be open book/web and unlimited time. You will be given data sets and asked to do analyses similar to what you have done in the labs and assignments.