Ali Oghabian

2019-04-25

The topics covered in this post are:
  • R and IntEREst version
  • Files in the zip
  • Annotating u12 type introns of HG38
  • ncRNAs with U12-type introns
  • U12 annotation comparison
  • Reference

Files in the zip

You can download the zip file the includes all the scripts and R objects from here. The files in the zip/folder include:
  • annoFil.rda R object file. A data frame that includes various annotations and information for the latest ENSEMBL genes (GRCh38.p12).
  • AllU12Introns.tsv Tab separated text file. Includes the detecetd U12-type introns and their annotations.
  • ens38UncolMod.rda ENSEMBL reference that includes gene and transcript IDs together with coordinates of the introns and exons.
  • ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda R object file. Data frame with U12/U2 type annotation of the reference (hg38) introns, the U12 GT-AG or AT-AC subtype intfo and the PWM match scores.
  • allIntronsU12annotationGeneType.tsv Tab separated text file. Includes all annotated U12 type introns.
  • nonProtCodingU12Introns.tsv Tab separated text file. Includes the detecetd U12 type introns that were annotated as NOT protein-coding.
  • report.Rmd an RMarkdown file that was used to generate this report.
  • report.html the HTML file that was generated from report.Rmd.
  • U12LncRNA.tsv Tab separated text file. Includes the detecetd U12-type introns that were annotated to be lncRNAs.

R and IntEREst version

Please note the R version and the IntEREst version used for these analysis: 
R.Version()
## $platform
## [1] "x86_64-pc-linux-gnu"
## 
## $arch
## [1] "x86_64"
## 
## $os
## [1] "linux-gnu"
## 
## $system
## [1] "x86_64, linux-gnu"
## 
## $status
## [1] "Under development (unstable)"
## 
## $major
## [1] "3"
## 
## $minor
## [1] "6.0"
## 
## $year
## [1] "2018"
## 
## $month
## [1] "05"
## 
## $day
## [1] "02"
## 
## $`svn rev`
## [1] "74682"
## 
## $language
## [1] "R"
## 
## $version.string
## [1] "R Under development (unstable) (2018-05-02 r74682)"
## 
## $nickname
## [1] "Unsuffered Consequences"
packageVersion("IntEREst")
## [1] '1.7.5'

Annotating u12 type introns of HG38

Initially, a reference data frame was produced from ENSEMBL HG38 (GRCh38.p12). Note that in order to be able to run the R scripts in the document you should set the R working directory to the folder that contains the extrcated files form the zip file.

# Time demanding
ens38Uncol<- referencePrepare (sourceBuild="biomaRt", 
    biomart="ENSEMBL_MART_ENSEMBL",
    biomartDataset="hsapiens_gene_ensembl",
    biomartHost="www.ensembl.org", 
    ignore.strand=FALSE, 
    annotateGeneIds=TRUE, 
    collapseExons=FALSE)

Next the U12-type introns were annotated using donor socre threoshold 0.07 and branch point score threshold 0.14. If an intron was not annotated as a U12-type or a U2-type we considered it as U2-type (i.e. the default parameter setting setNaAs= "U2" was used). 
# Time demanding
library(BSgenome.Hsapiens.UCSC.hg38)
# Building PWM
pwmPaste<-buildSsTypePwms(u12DonorBegin=13, u2DonorBegin=11, 
    u2AcceptorBegin=38, u12DonorEnd=17, u2DonorEnd=17, 
    u2AcceptorEnd=40, pasteSites=TRUE)
save(pwmPaste, file="./pwmPaste.rda")

# Ignore 3' end nucleotide
# Annotations will be based on Donor and BP scores
accPasteScore=as.matrix(pwmPaste[[3]])
for(i in 1:nrow(accPasteScore))accPasteScore[i,1]=0

# Modify the chr names of the reference data.frame

if(length(grep("^chr",ens38Uncol$chr))==0)
    ens38Uncol$chr<- paste("chr", ens38Uncol$chr, sep="")

accPasteScore=as.matrix(pwmPaste[[3]])
for(i in 1:nrow(accPasteScore))accPasteScore[i,1]=0

ens38UncolMod<- ens38Uncol[which(ens38Uncol$chr %in% 
    paste("chr", c(1:22,"X","Y", "MT"), sep="")),]
ens38UncolMod[which(ens38UncolMod$chr == "chrMT"),]<- "chrM"
save(ens38UncolMod, file="./ens38UncolMod.rda")

# Annotate U12-type introns using donor and BP of U12- and Donor and acceptor 
# sites of U2-type introns.

ensHg38UncolFilpaste60AnnoMat_40_don07_bp14<- annotateU12(
    pwmU12U2= list(
        pwmPaste[[1]],
        pwmPaste[[2]],
        accPasteScore,
        pwmPaste[[4]],
        pwmPaste[[5]]),
    pwmSsIndex= list(
        indexDonU12=1, 
        indexBpU12=1, 
        indexAccU12=1, 
        indexDonU2=1, 
        indexAccU2=1), 
    referenceChr= ens38UncolMod[,"chr"], 
    referenceBegin= as.numeric(ens38UncolMod[,"begin"]), 
    referenceEnd= as.numeric(ens38UncolMod[,"end"]), 
    referenceIntronExon= ens38UncolMod[,"int_ex"],
    intronExon= "intron",
    matchWindowRelativeUpstreamPos= c(2,-40,NA,NA,NA),
    matchWindowRelativeDownstreamPos= c(6,-3,NA,NA,NA), 
    minMatchScore= c( .07, .14, 0, .07,  0), 
    refGenome= BSgenome.Hsapiens.UCSC.hg38, 
    setNaAs= "U2", 
    annotateU12Subtype= TRUE,
    includeMatchScores= TRUE)

save(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14, 
    file="./ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda")

The number of the detected U12-type introns are as follows: 
load(file="./ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda")

# Number of U12-type intron
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type)
## 
##     U12      U2 
##    3528 1051845
# Number of unique U12-type intron
indU12<- which(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12")
length(
    unique(paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_")))
## [1] 826
# Number of U12-type intron subtypes
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$u12_subtype)
## 
## AT-AC GT-AG 
##   903  2625
# Number of unique U12-type intron subtypes
uniCoord<- unique(paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_"))
coord<- paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_")
indUni<-match(uniCoord, coord)
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$u12_subtype[indU12[indUni]])
## 
## AT-AC GT-AG 
##   195   631

ncRNAs with U12-type introns

To detect the ncRNAs, initially various annotations of ENSEMBL genes, e.g. extrenal gene name and biotype (description), were downloaded using Using biomaRt.
ensembl<- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", 
    host="www.ensembl.org", dataset="hsapiens_gene_ensembl")

annoFil<- biomaRt::getBM( attributes=c("ensembl_gene_id", 
        "ensembl_transcript_id", "external_gene_name", "gene_biotype", 
        "transcript_biotype"), 
    filters = "ensembl_transcript_id", 
    values = unique(ens38UncolMod$transcript_id), 
    mart=ensembl, uniqueRows = TRUE)

save(annoFil, file="./annoFil.rda")


indPick<- which((transcript_biotype=="lincRNA" | 
    transcript_biotype=="bidirectional_promoter_lncRNA") & 
    ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12")
U12Linc<-ens38UncolMod[indPick,]

donPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="+",1],
                    start=
                        as.numeric(U12Linc[U12Linc[,4]=="+",2]),
                    end=
                        as.numeric(U12Linc[U12Linc[,4]=="+",2])+9,
                    as.character=TRUE)
accNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="-",1],
                    start=
                        as.numeric(U12Linc[U12Linc[,4]=="-",2]),
                    end=
                        as.numeric(U12Linc[U12Linc[,4]=="-",2])+39,
                    as.character=TRUE)

accPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="+",1],
                    start=as.numeric(U12Linc[U12Linc[,4]=="+",3])-39,
                    end=as.numeric(U12Linc[U12Linc[,4]=="+",3])
                        ,
                    as.character=TRUE)
donNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="-",1],
                    start=as.numeric(U12Linc[U12Linc[,4]=="-",3])-9,
                    end=as.numeric(U12Linc[U12Linc[,4]=="-",3]),
                    as.character=TRUE)

donNeg<- as.vector(
    as.character(Biostrings::reverseComplement(DNAStringSet(donNeg))))
accNeg<- as.vector(
    as.character(Biostrings::reverseComplement(DNAStringSet(accNeg))))
accPos<- as.vector(accPos)
donPos<- as.vector(donPos)

donorSeq<- rep("",nrow(U12Linc))
acceptorSeq<- rep("",nrow(U12Linc))
donorSeq[U12Linc[,4]=="+"]<-donPos
donorSeq[U12Linc[,4]=="-"]<-donNeg
acceptorSeq[U12Linc[,4]=="+"]<-accPos
acceptorSeq[U12Linc[,4]=="-"]<-accNeg
U12LincMat<- cbind(U12Linc, donorSeq, acceptorSeq)

# Write lincRNA U12-type introns 
write.table(U12LincMat, file="./U12LncRNA.tsv", 
    col.names=T, row.names=F, sep='\t', quote=F)
# write data frame with info for all introns
write.table(
    cbind(ens38UncolMod, external_gene_name, 
        ensHg38UncolFilpaste60AnnoMat_40_don07_bp14, 
        transcript_biotype),
    file="./allIntronsU12annotationGeneType.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)

# Build datat frame with U12 type introns info
u12Out<- 
    cbind(ens38UncolMod, ensHg38UncolFilpaste60AnnoMat_40_don07_bp14,
            external_gene_name, 
            transcript_biotype)[which(
                ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12"),]

donPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="+",1],
                    start=
                        as.numeric(u12Out[u12Out[,4]=="+",2]),
                    end=
                        as.numeric(u12Out[u12Out[,4]=="+",2])+9,
                    as.character=TRUE)
accNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="-",1],
                    start=
                        as.numeric(u12Out[u12Out[,4]=="-",2]),
                    end=
                        as.numeric(u12Out[u12Out[,4]=="-",2])+39,
                    as.character=TRUE)

accPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="+",1],
                    start=as.numeric(u12Out[u12Out[,4]=="+",3])-39,
                    end=as.numeric(u12Out[u12Out[,4]=="+",3])
                        ,
                    as.character=TRUE)
donNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="-",1],
                    start=as.numeric(u12Out[u12Out[,4]=="-",3])-9,
                    end=as.numeric(u12Out[u12Out[,4]=="-",3]),
                    as.character=TRUE)

donNeg<- 
as.vector(as.character(Biostrings::reverseComplement(DNAStringSet(donNeg))))
accNeg<- 
as.vector(as.character(Biostrings::reverseComplement(DNAStringSet(accNeg))))
accPos<- as.vector(accPos)
donPos<- as.vector(donPos)

donorSeq<- rep("",nrow(u12Out))
acceptorSeq<- rep("",nrow(u12Out))
donorSeq[u12Out[,4]=="+"]<-donPos
donorSeq[u12Out[,4]=="-"]<-donNeg
acceptorSeq[u12Out[,4]=="+"]<-accPos
acceptorSeq[u12Out[,4]=="-"]<-accNeg
u12OutSeq<- cbind(u12Out, donorSeq, acceptorSeq)

write.table(
    u12OutSeq,
    file="./AllU12Introns.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)
write.table(
    u12OutSeq[which(u12OutSeq$transcript_biotype!="protein_coding"),],
    file="./nonProtCodingU12Introns.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)

U12 annotation comparison

Changes in the parameter settings of the annotateU12() function may lead to detection of different sets of U12-type introns. Here, we compare the U12-type introns detected from the same annotateU12() run as mentioned above (on a reference built from HG19) to those reported by Merico, D. et al.

Detected U12 type introns for HG19 compared to those detected by merico, D. et al.

Finally

The following scripts were used in R to generate the suppl1.html file from the suppl1.Rmd file. 
library("rmarkdown")
render("./report.Rmd")

Reference



  • Merico,D. et al. (2015) Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman Syndrome by disrupting minor intron splicing. Nat Commun, 6, 8718.
0

Add a comment

In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups together with their mean (or other statistics). I think this is a useful function for plotting grouped data when some groups (or all groups) have few data points ! You may be wondering why to include such function in the rnatoolbox package ?! Well ! I happen to use it quit a bit for plotting expression values of different groups of genes/transcripts in a sample or expression levels of a specific gene/transcript in several sample groups. These expression value are either FPKM, TPM, LCPM, or PSI values (Maybe I should go through these different normalizations later in a different post 😐!). But of course its application is not restricted to gene expression or RNAseq data analysis.

In this post I show how classifySex(), function of the rnatoolbox R package can be used for inferring the sex of  the studied subjects from their binary alignment bam files. The sex can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete  metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups. For muscle tissues, this showed to be useful on our ribo-depleted RNAseq data. NOTE! Earlier the function referred to in this post was named differently(i.e. getGender). Since version 0.2.1 classifySex() is used.
2

Recently I have started to organize my commonly used functions related to quality assessment and analyzing RNAseq data into an R package. It is called rnatoolbox and it is available here. In this post I introduce getMappedReadsCount(), i.e. a function that can be used for checking the number of aligned/mapped fragments in several bam files and detecting the outliers. The outliers are the bam files with oddly high (i.e. exceeding1.5 times the interquartile) and oddly low (i.e. lower than 1.5 times the interquartile) number of mapped fragments.

The adhan package is available here !

The prayer times cannot always be estimated accurately in some places such as countries located in higher latitudes (e.g. the Nordic countries) . as for instance during midsummer time the Fajr may be impossible to estimate or in other words it may simply not exist ! Some Muslim residents of those countries follow Prayer times of other places such as Mecca and Medina.

Unsupervised machine learning methods such as hierarchical clustering allow us to discover the trends and patterns of similarity within the data. Here, I demonstrate by using a test data, how to apply the Hierarchical clustering on columns of a test data matrix. Note that as my main focus is Bioinformatics application, I assume that the columns of the matrix represent individual samples and the rows represent the genes or transcripts or some other biological feature.

Note ! the & sign is to run the command in background.

Getting MD5 sum for all files and writing it to a txt file in Linux.

md5sum * > myChecklist.txt &

Getting MD5 sum for all files and subfolders and writing it to a txt file in Linux.

find ./ -type f -exec md5sum {} + > myChecklist.txt &

Getting MD5 sum for all files and writing it to a txt file in Mac.

md5 -r * > myChecklist.txt &

Getting MD5 sum for all files and subfolders and writing it to a txt file in Mac.

Many times, in our projects, we may need to compare different measured factors in our samples to one another, and study whether they are linearly dependent. These information can also help us to detect covariates and factors that affect our studies but we would like to adjust for/remove their effects (more on this at sometime later). Here, I mention several functions that can be used to perform correlation tests. All of these functions do support both Pearson and ranked (Spearman) methods. Note that in the end of this post I will focus on these two different methods (i.e. Pearson vs Spearman) and show their differences in application.
2

Occasionally when indexing data frames the format is converted, leading to confusing consequences. As for instance, when indexing to select a single column the result is a 'numeric' or 'integer' vector. The following  demonstrates this :

When analyzing a data constructed of individuals (or samples from individuals) of both male and female of a species (e.g. humans), often it is a good idea to compare the distribution of the various studied parameters for the males to those for the females. As for instance in RNAseq analysis, it is the measured expression of many genes may differ significantly between the studied males and females. In other words the gender may exhibit 'batch effect' in the gene expression data.

Here is an example of plotting 4 venn diagrams in a single screen with a 2*2 layout.

library(VennDiagram)

#defining vectors

av<- 1:10

bv<- 12:20

cv<-  7:15

# Building venndiagram grid objects (i.e.
1
Labels
Blog Archive
About Me
About Me
My Photo
I am a Postdoc researcher at the Neuromuscular Disorders Research lab and Genetic Determinants of Osteoporosis Research lab, in University of Helsinki and Folkhälsan RC. I specialize in Bioinformatics. I am interested in Machine learning and multi-omics data analysis. My go-to programming language is R.
My Blog List
My Blog List
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.