Quantifying reads for RNA-seq using STAR

Quantifying reads in general

There are a couple of issues to consider. Firstly are you interested in genes or transcripts? Secondly do you have a genome? The recipe here is for the genes / genomes pair, using STAR. This provides an easy entry into subsequent analysis of differential expression using e.g. the DESeq2 R package, but there are alternative pipelines, such as kallisto and sleuth, that may better suit your purposes. There is a big literature on accurately inferring differential gene expression - it is a good idea to be familiar with why you are making the choices you’re making.

Using STAR

The STAR manual is good.

I find it helps keep things in order to make three directories for STAR analyses. One to hold the fastq reads to be mapped, one to hold the genome, and one to do the analysis in. This isn’t essential, but there tend to be a lot of files in complex analyses.

For this example, I’m going to assume that we’re working in a base directory with three subdirectories called fastq/, genome/ and star_mapping/

Making a genome database for STAR

Copy the files clytia_hm2 (or your genome fasta file), filtered_for_orf.gtf (or your gene annotation file) and mk_index.com to a new genome directory (so, if you’re following the directory structure above that would be in genome/).

In that directory, run the command in mk_index.com.

STAR --genomeDir . --genomeFastaFiles clytia_hm2 --sjdbGTFfile filtered_for_orf.gtf --genomeChrBinNbits 16 --genomeSAindexNbases 14 --runMode genomeGenerate

This will take a while - it combines the annotation in the gtf file with the sequence in the genome file to make a database that STAR can search, so that it knows which genes are mapped when reads map to a particular region of the genome. The --genomeChrBinNbits 16 and --genomeSAindexNbases 14 relate to issues searching small and fragmentary genome and are discussed in the STAR manual. When it is done there should be many new files in the working directory.

Generally you only need to do this once, although I have seen STAR complain that the database version has changed (i.e. if you use a new version of STAR, you may need to recreate the database).

Making a directory of fastq files

In your base directory, make a directory called fastq. It doesn’t have to be called that, but in my examples, that is its name. Copy fastq files to be mapped to this directory. They can be compressed. For this example I have copied in a set of paired end reads Clytia-aboral-1_1.fq.gz and Clytia-aboral-1_2.fq.gz

Mapping with STAR

In the base directory make a directory called star_mapping. (Again the name isn’t important). cd into that directory. We’re now ready to do the mapping and quantification. The final command is a bit complicated so we’re going to build it up gradually here. The basic form of the command is:

STAR --genomeDir <genome_directory> --readFilesIn <left_reads_file> <right_reads_file> --quantMode GeneCounts

To get this to actually work, we need to tell STAR where to find the genome directory and read files (as we’ve put them in a different directory) - so, like this:

STAR --genomeDir ../genome/ --readFilesIn ../fastq/Clytia-aboral-1_1.fq.gz ../fastq/Clytia-aboral-1_2.fq.gz

Now, these read files are compressed, so we need to tell STAR how to uncompress them…

STAR --genomeDir ../genome/ --readFilesIn ../fastq/Clytia-aboral-1_1.fq.gz ../fastq/Clytia-aboral-1_2.fq.gz --readFilesCommand gunzip -c

and we want it to count genes…

STAR --genomeDir ../genome/ --readFilesIn ../fastq/Clytia-aboral-1_1.fq.gz ../fastq/Clytia-aboral-1_2.fq.gz --readFilesCommand gunzip -c --quantMode GeneCounts

Read mapping is easily paralllelized, so we’ll specify extra CPUs to get it going faster, and provide an output file prefix so that we can easily identify our output in a useful way:

STAR --genomeDir ../genome/ --readFilesIn ../fastq/Clytia-aboral-1_1.fq.gz ../fastq/Clytia-aboral-1_2.fq.gz --readFilesCommand gunzip -c --quantMode GeneCounts --runThreadN 32 --outFileNamePrefix Ch_AB_1_

This is a bit incidental at this point, but he main output of STAR is a very big .sam file. We specify that it be compressed to a .bam file:

STAR --genomeDir ../genome/ --readFilesIn ../fastq/Clytia-aboral-1_1.fq.gz ../fastq/Clytia-aboral-1_2.fq.gz --readFilesCommand gunzip -c --quantMode GeneCounts --runThreadN 32 --outFileNamePrefix Ch_AB_1_ --outSAMtype BAM Unsorted

STAR output files

Check the final log file (*_Log.final.out) - in our case Ch_AB_1_Log.final.out. The key thing to check is the % of uniquely mapped reads. You would like this to be above 60% ideally. The exact figure might depend on the experimental system, how much contamination there is etc. Very low numbers (e.g. close to 0%) suggest that something has gone wrong e.g. you have accidentally mixed up read files / not given correct paired reads.

The gene counts are in the *_ReadsPerGene.out.tab file, Ch_AB_1_ReadsPerGene.out.tab. This contains 4 columns: 1) the gene identifier 2) the total number of reads for the gene 3) the number of left reads 4) the number of right reads. (The left or right reads may be close to 0 if the sequenced library was made using a stranded protocol).