Quality control of RNA-seq data
Do not rush through the quality control step of your RNA sequencing (RNA-seq) data processing pipeline. It is really the data exploration part of whatever pipeline you are running – you will thank yourself later for taking the time to execute properly and thoroughly. At BioStrand, quality control (QC) is an important part of RNA expression analysis and variant calling pipelines. The list of things that can go wrong with the RNA-seq data is extensive, so your best chance to avoid some funky chemistry or biology is to put extra effort into the quality control step.
In this post, I will briefly review some common steps and associated tools for QC, as well as some tips and tricks that I wish had been mentioned in that off-the-shelf tutorial! I will focus on RNA-seq data specifically, since it involves a couple of additional levels of complexity compared with the whole genome sequencing data.
Roughly speaking the QC consists of two main components:
- QC of raw reads
- QC of aligned reads - this step is needed because some problems are invisible before you have aligned the reads to the reference genome/transcriptome.
Quality control of raw reads
This part is essentially the same as for whole genome sequencing data. You want to check at least the following metrics:
- Phred Quality Scores (per base and average per sequence)
- Adapter content
- Duplication rate
- GC content
A canonical tool for collecting all of the above (and more!) is FastQC .
When referring to the reads which underperform according to some of the QC metrics, the eternal questions are “To trim or not to trim?“ and ”To filter or not to filter?” Unfortunately, a “one size fits all” answer does not and cannot exist. Frequently your computational tools are not fit to help and insight from a biologist becomes indispensable.
Quality control of aligned reads
Once the reads are aligned, some other useful QC metrics become available for grabbing:
- Percent of reads mapped to the reference
- Number of ambiguous alignments
- Genomic origin (exonic, intronic or intergenic)
- Transcript coverage profile
Qualimap is one of the tools covering these metrics.
Personal experience and extra tips
The last time we did RNA-seq data processing at BioStrand on a collection of samples, we had a strong suspicion that the data had significant ribosomal RNA (rRNA) contamination. The suspicion was triggered by a rather high quantity of ambiguously aligned reads. After an excessive amount of time researching what proportion of multi-mappers should be acceptable, a simple solution came around: look at the highest expressed genes. For the samples in question, the rRNA genes were indeed among the top highest expressed genes. Somewhat annoying, but good to be aware of.
The same collection of samples exhibited another unwelcome feature: our processing pipeline required significantly more resources for some of the samples than for the others. In hindsight, those troublesome samples could have been identified preventively by careful examination of the QC metrics.
These hiccups motivated introduction of additional simple but effective steps into our QC pipeline:
- Look at the top 10 expressed genes and the proportion of reads aligned to them. This can be very useful for identifying rRNA which was not washed away during the library preparation.
- If you have many samples, use principal component analysis (PCA) based on QC metrics to identify outliers. Hierarchical clustering can serve the same purpose.
- Agglomerate the output of different tools using MultiQC especially if you have multiple samples. This will provide a nice view on the output of all the QC tools and will help to pick outliers if present.
These QC steps have become an integral part of our RNA expression analysis and variant calling pipelines at LENSai (IPA). I hope these tips for QC of raw and aligned reads will be helpful for you as well, and will save you some of your valuable time.
Useful references
There are plenty of tutorials dedicated to RNA-seq data processing. Here are a couple with comparatively extended discussions on different QC steps as well as a detailed interpretation of what different anomalies might indicate chemically or biologically:
- Bulk RNA-seq Data Analysis using High-Performance Computing (bulk RNA-seq Part I – FASTQ to counts)
- Introduction to RNA-Seq - From quality control to pathway analysis
Subscribe to our Blog and get new articles right after publication into your inbox.