The steps I take to analyse RNA-seq data in the lab are :
  1. Checking mapping qualities, e.g. running fastQC on fastq files.
  2. Check the fastQ files to see if any reads cotain the primer (adapter sequence). See Note 3 ↓. This could also be seen in the fastQC (quality checking) results (see fig 1 & 2).
  3. Remove Illumina primers using a read trimming software, e.g. cutadapt. See Note 4 ↓. This is crucial when the sequenced fragments (High expressed genes) are likely to be shorter than the read length. Moreover, if the Illumina's adapter sequence is frequently observed on the 3' end of the reads (such as shown in Fig 1 and Fig 2) this means that the two paired (opposite directed) reads frequently overlap one another which you should take that into consideration when mapping the reads (set parameters to allow overlap of the pair reads).


    Fig 1: Read 1 adapter content
    Fig 2: Read 2 adapter content
  4. Map the reads (or the trimmed reads if they were trimmed) using mapping software e.g. tophat2. You can also map to mRNA using other mapping software e.g. sailfish, bwa.
  5. Check the statistics for the mapping by running samtools flagstat (if it is not sorted by the mapping software run samtools sort first, Tophat does sort by default !).
  6. Flag and remove PCR duplicates as for instance by using samtools rmdup after you've already sorted the bam file.
  7. For Intron retention I use IntEREst (Intron Exon Retention Estimator, an in-house developed R package soon to be published) and for transcript abundance, alternative splicing and expression analysis I use SUPPA. Other software such as cufflinks, MISO and etc. could also be used.
  8. Various tools, e.g. edgeR, DESeq, and DESeq2, and etc. could be used for differential gene-expression analysis. 
  9. For plotting and visualizing mapped bam files, you could use IGV. Your bam file needs to be sorted (If it is not run samtools sort ) and indexed (samtools index ).
NOTES:
  • Note 1. After checking fastQC results if trimming was decided to be useful it could be performed with fastq trimming software, e.g. cutadapt or fastq_quality_trimmer in  FASTX toolkit.
  • Note 2. fastq-dump from sra toolkit could be used to convert .sra files to .fastq .
  • Note 3. In Linux, to return the number of those in the first 4million reads that feature Illumina primers you could use the following scripts:
#On forward strand reads
gunzip -c <READ1_FILE.fastq.gz> | head -4000000 | grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC | wc -l
#89157
#On Opposite strand reads
gunzip -c <READ2_FILE.fastq.gz> | head -4000000 | grep AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA | wc -l
#8172
  • Note 4. Cutadapt could filter the reads with primers in a proper manner: 
#On forward and opposite strand reads
cutadapt --quality-cutoff 20 --minimum-length 50 -b AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -B AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -o <OUTPUT_READ1_FILE.fastq.gz> -p <OUTPUT_READ2_FILE.fastq.gz> <READ1_FILE.fastq.gz> <READ2_FILE.fastq.gz>
  • Note 5. After mapping if you want to extract all the unmapped/mapped reads (Tophat extracts them by default) or extract uniquely mapped reads you could use the following codes. You can find details of samtools FLAGs here.
# Extract unmapped reads
samtools view -b -f 4 file.bam > unmapped.bam

# Extract mapped reads

samtools view -b -F 4 file.bam > mapped.bam

# Extractreliable reads (reads with at least mapped quality of 1)

samtools view -bq 1 file.bam > qualityMapped.bam
Tophat mapping of Illumina RNA-seq:

Sequencing reads can be relatively long (~170 bps) using newer RNA-seq technologes hence errors could occur more often than the default maximum 2 miss-matches allowed in Tophat/Tophat2 mapping! therefore when mapping, a higher miss-match criteria could be allowed, e.g. 10 miss-matches. However, note that due to relaxing the mismatch criteria of the mapping, a great number of reads would map to intron-exon junctions rather than spanning the exon junctions. In order to solve this problem it is also recommended to use the parameter setting --read-realign-edit-dist 0 to realign those reads to the exon junctions. For these type of reads it is also a good strategy to allow the pair reads to cover one another and feature huge overlaps. This could be achieved by increasing the --mate-std-dev. As for instance in one of our Illumina RNA-Seq runs with 170/-140 bps paired-reads we set --mate-inner-dist 150 and --mate-std-dev to be 200.Note that in the follosing script by REPLICATION I actually mean technical replication meaning that all sequencing fastq.gz files are from the same sample.

tophat2 -o <PATH_TO_OUTPUT_DIRECTORY> --mate-inner-dist 150 --mate-std-dev 200 --read-realign-edit-dist 0 --read-mismatches 10 --read-gap-length 10 --read-edit-dist 10 -p 20 <PATH_TO_INDEX> <SAMPLE_REPLICATION1_READ1.fastq.gz>,<SAMPLE_REPLICATION2_READ1.fastq.gz>,<SAMPLE_REPLICATION3_READ1.fastq.gz>,... <SAMPLE_REPLICATION1_READ2.fastq.gz>,<SAMPLE_REPLICATION2_READ2.fastq.gz>,<SAMPLE_REPLICATION3_READ2.fastq.gz>,...
0

Add a comment

In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups together with their mean (or other statistics). I think this is a useful function for plotting grouped data when some groups (or all groups) have few data points ! You may be wondering why to include such function in the rnatoolbox package ?! Well ! I happen to use it quit a bit for plotting expression values of different groups of genes/transcripts in a sample or expression levels of a specific gene/transcript in several sample groups. These expression value are either FPKM, TPM, LCPM, or PSI values (Maybe I should go through these different normalizations later in a different post 😐!). But of course its application is not restricted to gene expression or RNAseq data analysis.

In this post I show how classifySex(), function of the rnatoolbox R package can be used for inferring the sex of  the studied subjects from their binary alignment bam files. The sex can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete  metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups. For muscle tissues, this showed to be useful on our ribo-depleted RNAseq data. NOTE! Earlier the function referred to in this post was named differently(i.e. getGender). Since version 0.2.1 classifySex() is used.
2

Recently I have started to organize my commonly used functions related to quality assessment and analyzing RNAseq data into an R package. It is called rnatoolbox and it is available here. In this post I introduce getMappedReadsCount(), i.e. a function that can be used for checking the number of aligned/mapped fragments in several bam files and detecting the outliers. The outliers are the bam files with oddly high (i.e. exceeding1.5 times the interquartile) and oddly low (i.e. lower than 1.5 times the interquartile) number of mapped fragments.

The adhan package is available here !

The prayer times cannot always be estimated accurately in some places such as countries located in higher latitudes (e.g. the Nordic countries) . as for instance during midsummer time the Fajr may be impossible to estimate or in other words it may simply not exist ! Some Muslim residents of those countries follow Prayer times of other places such as Mecca and Medina.

Unsupervised machine learning methods such as hierarchical clustering allow us to discover the trends and patterns of similarity within the data. Here, I demonstrate by using a test data, how to apply the Hierarchical clustering on columns of a test data matrix. Note that as my main focus is Bioinformatics application, I assume that the columns of the matrix represent individual samples and the rows represent the genes or transcripts or some other biological feature.

Note ! the & sign is to run the command in background.

Getting MD5 sum for all files and writing it to a txt file in Linux.

md5sum * > myChecklist.txt &

Getting MD5 sum for all files and subfolders and writing it to a txt file in Linux.

find ./ -type f -exec md5sum {} + > myChecklist.txt &

Getting MD5 sum for all files and writing it to a txt file in Mac.

md5 -r * > myChecklist.txt &

Getting MD5 sum for all files and subfolders and writing it to a txt file in Mac.

Many times, in our projects, we may need to compare different measured factors in our samples to one another, and study whether they are linearly dependent. These information can also help us to detect covariates and factors that affect our studies but we would like to adjust for/remove their effects (more on this at sometime later). Here, I mention several functions that can be used to perform correlation tests. All of these functions do support both Pearson and ranked (Spearman) methods. Note that in the end of this post I will focus on these two different methods (i.e. Pearson vs Spearman) and show their differences in application.
2

Occasionally when indexing data frames the format is converted, leading to confusing consequences. As for instance, when indexing to select a single column the result is a 'numeric' or 'integer' vector. The following  demonstrates this :

When analyzing a data constructed of individuals (or samples from individuals) of both male and female of a species (e.g. humans), often it is a good idea to compare the distribution of the various studied parameters for the males to those for the females. As for instance in RNAseq analysis, it is the measured expression of many genes may differ significantly between the studied males and females. In other words the gender may exhibit 'batch effect' in the gene expression data.

Here is an example of plotting 4 venn diagrams in a single screen with a 2*2 layout.

library(VennDiagram)

#defining vectors

av<- 1:10

bv<- 12:20

cv<-  7:15

# Building venndiagram grid objects (i.e.
1
Labels
Blog Archive
About Me
About Me
My Photo
I am a Postdoc researcher at the Neuromuscular Disorders Research lab and Genetic Determinants of Osteoporosis Research lab, in University of Helsinki and Folkhälsan RC. I specialize in Bioinformatics. I am interested in Machine learning and multi-omics data analysis. My go-to programming language is R.
My Blog List
My Blog List
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.