- Checking mapping qualities, e.g. running fastQC on fastq files.
- Check the fastQ files to see if any reads cotain the primer (adapter sequence). See Note 3 ↓. This could also be seen in the fastQC (quality checking) results (see fig 1 & 2).
- Remove Illumina primers using a read trimming software, e.g. cutadapt. See Note 4 ↓. This is crucial when the sequenced fragments (High expressed genes) are likely to be shorter than the read length. Moreover, if the Illumina's adapter sequence is frequently
observed on the 3' end of the reads (such as shown in Fig 1 and Fig 2) this means that the two paired (opposite directed) reads frequently overlap
one another which you should take that into consideration when mapping the reads (set parameters to allow overlap of the pair reads).
Fig 1: Read 1 adapter content Fig 2: Read 2 adapter content - Map the reads (or the trimmed reads if they were trimmed) using mapping software e.g. tophat2. You can also map to mRNA using other mapping software e.g. sailfish, bwa.
- Check the statistics for the mapping by running samtools flagstat (if it is not sorted by the mapping software run samtools sort first, Tophat does sort by default !).
- Flag and remove PCR duplicates as for instance by using samtools rmdup after you've already sorted the bam file.
- For Intron retention I use IntEREst (Intron Exon Retention Estimator, an in-house developed R package soon to be published) and for transcript abundance, alternative splicing and expression analysis I use SUPPA. Other software such as cufflinks, MISO and etc. could also be used.
- Various tools, e.g. edgeR, DESeq, and DESeq2, and etc. could be used for differential gene-expression analysis.
- For plotting and visualizing mapped bam files, you could use IGV. Your bam file needs to be sorted (If it is not run samtools sort ) and indexed (samtools index ).
- Note 1. After checking fastQC results if trimming was decided to be useful it could be performed with fastq trimming software, e.g. cutadapt or fastq_quality_trimmer in FASTX toolkit
.
- Note 2. fastq-dump from sra toolkit could be used to convert .sra files to .fastq .
- Note 3. In Linux, to return the number of those in the first 4million reads that feature Illumina primers you could use the following scripts:
#On forward strand reads
gunzip -c <READ1_FILE.fastq.gz> | head -4000000 | grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC | wc -l
#89157
#On Opposite strand reads
gunzip -c <READ2_FILE.fastq.gz> | head -4000000 | grep AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA | wc -l
#8172
- Note 4. Cutadapt could filter the reads with primers in a proper manner:
#On forward and opposite strand reads
cutadapt --quality-cutoff 20 --minimum-length 50 -b AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -B AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -o <OUTPUT_READ1_FILE.fastq.gz> -p <OUTPUT_READ2_FILE.fastq.gz> <READ1_FILE.fastq.gz> <READ2_FILE.fastq.gz>
- Note 5. After mapping if you want to extract all the unmapped/mapped reads (Tophat extracts them by default) or extract uniquely mapped reads you could use the following codes. You can find details of samtools FLAGs here.
# Extract unmapped readsTophat mapping of Illumina RNA-seq:
samtools view -b -f 4 file.bam > unmapped.bam
# Extract mapped reads
samtools view -b -F 4 file.bam > mapped.bam
# Extractreliable reads (reads with at least mapped quality of 1)
samtools view -bq 1 file.bam > qualityMapped.bam
Sequencing reads can be relatively long (~170 bps) using newer RNA-seq technologes hence errors could occur more often than the default maximum 2 miss-matches allowed in Tophat/Tophat2 mapping! therefore when mapping, a higher miss-match criteria could be allowed, e.g. 10 miss-matches. However, note that due to relaxing the mismatch criteria of the mapping, a great number of reads would map to intron-exon junctions rather than spanning the exon junctions. In order to solve this problem it is also recommended to use the parameter setting --read-realign-edit-dist 0 to realign those reads to the exon junctions. For these type of reads it is also a good strategy to allow the pair reads to cover one another and feature huge overlaps. This could be achieved by increasing the --mate-std-dev. As for instance in one of our Illumina RNA-Seq runs with 170/-140 bps paired-reads we set --mate-inner-dist 150 and --mate-std-dev to be 200.Note that in the follosing script by REPLICATION I actually mean technical replication meaning that all sequencing fastq.gz files are from the same sample.
tophat2 -o <PATH_TO_OUTPUT_DIRECTORY> --mate-inner-dist 150 --mate-std-dev 200 --read-realign-edit-dist 0 --read-mismatches 10 --read-gap-length 10 --read-edit-dist 10 -p 20 <PATH_TO_INDEX> <SAMPLE_REPLICATION1_READ1.fastq.gz>,<SAMPLE_REPLICATION2_READ1.fastq.gz>,<SAMPLE_REPLICATION3_READ1.fastq.gz>,... <SAMPLE_REPLICATION1_READ2.fastq.gz>,<SAMPLE_REPLICATION2_READ2.fastq.gz>,<SAMPLE_REPLICATION3_READ2.fastq.gz>,...
Add a comment