In this post I show how classifySex(), function of the rnatoolbox R package can be used for inferring the sex of the studied subjects from their binary alignment bam files. The sex can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups. For muscle tissues, this showed to be useful on our ribo-depleted RNAseq data.
NOTE! Earlier the function referred to in this post was named differently(i.e. getGender). Since version 0.2.1 classifySex() is used.
Recently I have started to organize my commonly used functions related to quality assessment and analyzing RNAseq data into an R package. It is called rnatoolbox and it is available here. In this post I introduce getMappedReadsCount(), i.e. a function that can be used for checking the number of aligned/mapped fragments in several bam files and detecting the outliers. The outliers are the bam files with oddly high (i.e. exceeding1.5 times the interquartile) and oddly low (i.e. lower than 1.5 times the interquartile) number of mapped fragments.
The adhan package is available here !
The prayer times cannot always be estimated accurately in some places such as countries located in higher latitudes (e.g. the Nordic countries) .
Unsupervised machine learning methods such as hierarchical clustering allow us to discover the trends and patterns of similarity within the data. Here, I demonstrate by using a test data, how to apply the Hierarchical clustering on columns of a test data matrix.
Note ! the & sign is to run the command in background.
Getting MD5 sum for all files and writing it to a txt file in Linux.
md5sum * > myChecklist.txt &
Getting MD5 sum for all files and subfolders and writing it to a txt file in Linux.
Many times, in our projects, we may need to compare different measured factors in our samples to one another, and study whether they are linearly dependent. These information can also help us to detect covariates and factors that affect our studies but we would like to adjust for/remove their effects (more on this at sometime later). Here, I mention several functions that can be used to perform correlation tests. All of these functions do support both Pearson and ranked (Spearman) methods. Note that in the end of this post I will focus on these two different methods (i.e. Pearson vs Spearman) and show their differences in application.
Occasionally when indexing data frames the format is converted, leading to confusing consequences. As for instance, when indexing to select a single column the result is a 'numeric' or 'integer' vector. The following demonstrates this :
When analyzing a data constructed of individuals (or samples from individuals) of both male and female of a species (e.g. humans), often it is a good idea to compare the distribution of the various studied parameters for the males to those for the females.
Here is an example of plotting 4 venn diagrams in a single screen with a 2*2 layout.
library(VennDiagram)
#defining vectors
av<- 1:10
bv<- 12:20
cv<- 7:15
# Building venndiagram grid objects (i.e.
Planning to draw a density line-plot with gapped (or broken) Y-axis in R, I initially tried out the plotrix package (version 3.8.1). However after facing a couple of problems, I ended up using the standard R graphics codes to draw the correct gapped line-plot.
In January 2018 our lab purchased a modest but good enough espresso machine and coffee grinder. The following table shows the coffee beans that we have used so far to make espresso in our lab.
Many times in my career I have needed to merge several PDF files into a single PDF file. As for instance, the most common format of the PhD thesis in our department is that it begins with a comprehensive review (referred to as 'review of the literature') and continues with several published papers.
Ali Oghabian
2019-04-25
The topics covered in this post are:
R and IntEREst version Files in the zip Annotating u12 type introns of HG38 ncRNAs with U12-type introns U12 annotation comparison Reference Files in the zip
You can download the zip file the includes all the scripts and R objects from
Example of a fastq file in read 1 (in paired read sequencing) is as follows:
@SRR3117565.1.1 1 length=100
NCAAAACAGCTCTCCCTCCTTTGATCTGATGGTCTGCAGAGGTCCTCAAATCCACACACTGCCACTCTTCAAGACCAACCACTGGGCCTTCTTAATCTCA
+SRR3117565.1.1 1 length=100
#1=BDDDDHFHHAHDE?GFEEDHG@HFHEECDCGHE:FDFHD*?DHFDEHHF>;;B<;A;>=A=??@CCCC>5>>AC
@SRR3117565.2.1 2 length=100
NTCCTGACTCACACGCCACAACCATGACTGGCTCAGCTCCCTTAATTCCAGCTTCCCTTACATGACGCAATTCCTTCTCAGATTCGGGTTTTCAGCTGAG
+SRR3117565.2.1 2 length=100
#4BDFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIHHFEFEEEEEEDDEDDDDDDDDBDDDDDDDDDDD
The steps I take to analyse RNA-seq data in the lab are :
Checking mapping qualities, e.g. running fastQC on fastq files. Check the fastQ files to see if any reads cotain the primer (adapter sequence). See Note 3 ↓. This could also be seen in the fastQC (quality checking) results (see fig 1 & 2).
Sometimes painstaking text modifications specially when one is dealing with large data could easily be solved using few lines of scripts of a programming language.
Yesterday (on the opening day of the new Batman movie) I search the Internet for the Batman formula and it's implementations in R. I found several links that used the ggplot2 library however only one was working with the latest version of the package. With minor changes, this is the result that I got for the Batman curve.
I just needed to come up with a name for my blog, so as most of the times, when I'm working in the lab I got some help from my favourite programming language, R. See if you could find a better name with playing a round with the seedNumber, and nameSize parameters.
View comments