Ali Oghabian

2019-04-25

The topics covered in this post are:
  • R and IntEREst version
  • Files in the zip
  • Annotating u12 type introns of HG38
  • ncRNAs with U12-type introns
  • U12 annotation comparison
  • Reference

Files in the zip

You can download the zip file the includes all the scripts and R objects from here. The files in the zip/folder include:
  • annoFil.rda R object file. A data frame that includes various annotations and information for the latest ENSEMBL genes (GRCh38.p12).
  • AllU12Introns.tsv Tab separated text file. Includes the detecetd U12-type introns and their annotations.
  • ens38UncolMod.rda ENSEMBL reference that includes gene and transcript IDs together with coordinates of the introns and exons.
  • ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda R object file. Data frame with U12/U2 type annotation of the reference (hg38) introns, the U12 GT-AG or AT-AC subtype intfo and the PWM match scores.
  • allIntronsU12annotationGeneType.tsv Tab separated text file. Includes all annotated U12 type introns.
  • nonProtCodingU12Introns.tsv Tab separated text file. Includes the detecetd U12 type introns that were annotated as NOT protein-coding.
  • report.Rmd an RMarkdown file that was used to generate this report.
  • report.html the HTML file that was generated from report.Rmd.
  • U12LncRNA.tsv Tab separated text file. Includes the detecetd U12-type introns that were annotated to be lncRNAs.

R and IntEREst version

Please note the R version and the IntEREst version used for these analysis: 
R.Version()
## $platform
## [1] "x86_64-pc-linux-gnu"
## 
## $arch
## [1] "x86_64"
## 
## $os
## [1] "linux-gnu"
## 
## $system
## [1] "x86_64, linux-gnu"
## 
## $status
## [1] "Under development (unstable)"
## 
## $major
## [1] "3"
## 
## $minor
## [1] "6.0"
## 
## $year
## [1] "2018"
## 
## $month
## [1] "05"
## 
## $day
## [1] "02"
## 
## $`svn rev`
## [1] "74682"
## 
## $language
## [1] "R"
## 
## $version.string
## [1] "R Under development (unstable) (2018-05-02 r74682)"
## 
## $nickname
## [1] "Unsuffered Consequences"
packageVersion("IntEREst")
## [1] '1.7.5'

Annotating u12 type introns of HG38

Initially, a reference data frame was produced from ENSEMBL HG38 (GRCh38.p12). Note that in order to be able to run the R scripts in the document you should set the R working directory to the folder that contains the extrcated files form the zip file.

# Time demanding
ens38Uncol<- referencePrepare (sourceBuild="biomaRt", 
    biomart="ENSEMBL_MART_ENSEMBL",
    biomartDataset="hsapiens_gene_ensembl",
    biomartHost="www.ensembl.org", 
    ignore.strand=FALSE, 
    annotateGeneIds=TRUE, 
    collapseExons=FALSE)

Next the U12-type introns were annotated using donor socre threoshold 0.07 and branch point score threshold 0.14. If an intron was not annotated as a U12-type or a U2-type we considered it as U2-type (i.e. the default parameter setting setNaAs= "U2" was used). 
# Time demanding
library(BSgenome.Hsapiens.UCSC.hg38)
# Building PWM
pwmPaste<-buildSsTypePwms(u12DonorBegin=13, u2DonorBegin=11, 
    u2AcceptorBegin=38, u12DonorEnd=17, u2DonorEnd=17, 
    u2AcceptorEnd=40, pasteSites=TRUE)
save(pwmPaste, file="./pwmPaste.rda")

# Ignore 3' end nucleotide
# Annotations will be based on Donor and BP scores
accPasteScore=as.matrix(pwmPaste[[3]])
for(i in 1:nrow(accPasteScore))accPasteScore[i,1]=0

# Modify the chr names of the reference data.frame

if(length(grep("^chr",ens38Uncol$chr))==0)
    ens38Uncol$chr<- paste("chr", ens38Uncol$chr, sep="")

accPasteScore=as.matrix(pwmPaste[[3]])
for(i in 1:nrow(accPasteScore))accPasteScore[i,1]=0

ens38UncolMod<- ens38Uncol[which(ens38Uncol$chr %in% 
    paste("chr", c(1:22,"X","Y", "MT"), sep="")),]
ens38UncolMod[which(ens38UncolMod$chr == "chrMT"),]<- "chrM"
save(ens38UncolMod, file="./ens38UncolMod.rda")

# Annotate U12-type introns using donor and BP of U12- and Donor and acceptor 
# sites of U2-type introns.

ensHg38UncolFilpaste60AnnoMat_40_don07_bp14<- annotateU12(
    pwmU12U2= list(
        pwmPaste[[1]],
        pwmPaste[[2]],
        accPasteScore,
        pwmPaste[[4]],
        pwmPaste[[5]]),
    pwmSsIndex= list(
        indexDonU12=1, 
        indexBpU12=1, 
        indexAccU12=1, 
        indexDonU2=1, 
        indexAccU2=1), 
    referenceChr= ens38UncolMod[,"chr"], 
    referenceBegin= as.numeric(ens38UncolMod[,"begin"]), 
    referenceEnd= as.numeric(ens38UncolMod[,"end"]), 
    referenceIntronExon= ens38UncolMod[,"int_ex"],
    intronExon= "intron",
    matchWindowRelativeUpstreamPos= c(2,-40,NA,NA,NA),
    matchWindowRelativeDownstreamPos= c(6,-3,NA,NA,NA), 
    minMatchScore= c( .07, .14, 0, .07,  0), 
    refGenome= BSgenome.Hsapiens.UCSC.hg38, 
    setNaAs= "U2", 
    annotateU12Subtype= TRUE,
    includeMatchScores= TRUE)

save(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14, 
    file="./ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda")

The number of the detected U12-type introns are as follows: 
load(file="./ensHg38UncolFilpaste60AnnoMat_40_don07_bp14.rda")

# Number of U12-type intron
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type)
## 
##     U12      U2 
##    3528 1051845
# Number of unique U12-type intron
indU12<- which(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12")
length(
    unique(paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_")))
## [1] 826
# Number of U12-type intron subtypes
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$u12_subtype)
## 
## AT-AC GT-AG 
##   903  2625
# Number of unique U12-type intron subtypes
uniCoord<- unique(paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_"))
coord<- paste(ens38UncolMod[indU12,"chr"], 
        ens38UncolMod[indU12,"begin"], ens38UncolMod[indU12,"end"], sep="_")
indUni<-match(uniCoord, coord)
table(ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$u12_subtype[indU12[indUni]])
## 
## AT-AC GT-AG 
##   195   631

ncRNAs with U12-type introns

To detect the ncRNAs, initially various annotations of ENSEMBL genes, e.g. extrenal gene name and biotype (description), were downloaded using Using biomaRt.
ensembl<- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", 
    host="www.ensembl.org", dataset="hsapiens_gene_ensembl")

annoFil<- biomaRt::getBM( attributes=c("ensembl_gene_id", 
        "ensembl_transcript_id", "external_gene_name", "gene_biotype", 
        "transcript_biotype"), 
    filters = "ensembl_transcript_id", 
    values = unique(ens38UncolMod$transcript_id), 
    mart=ensembl, uniqueRows = TRUE)

save(annoFil, file="./annoFil.rda")


indPick<- which((transcript_biotype=="lincRNA" | 
    transcript_biotype=="bidirectional_promoter_lncRNA") & 
    ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12")
U12Linc<-ens38UncolMod[indPick,]

donPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="+",1],
                    start=
                        as.numeric(U12Linc[U12Linc[,4]=="+",2]),
                    end=
                        as.numeric(U12Linc[U12Linc[,4]=="+",2])+9,
                    as.character=TRUE)
accNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="-",1],
                    start=
                        as.numeric(U12Linc[U12Linc[,4]=="-",2]),
                    end=
                        as.numeric(U12Linc[U12Linc[,4]=="-",2])+39,
                    as.character=TRUE)

accPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="+",1],
                    start=as.numeric(U12Linc[U12Linc[,4]=="+",3])-39,
                    end=as.numeric(U12Linc[U12Linc[,4]=="+",3])
                        ,
                    as.character=TRUE)
donNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=U12Linc[U12Linc[,4]=="-",1],
                    start=as.numeric(U12Linc[U12Linc[,4]=="-",3])-9,
                    end=as.numeric(U12Linc[U12Linc[,4]=="-",3]),
                    as.character=TRUE)

donNeg<- as.vector(
    as.character(Biostrings::reverseComplement(DNAStringSet(donNeg))))
accNeg<- as.vector(
    as.character(Biostrings::reverseComplement(DNAStringSet(accNeg))))
accPos<- as.vector(accPos)
donPos<- as.vector(donPos)

donorSeq<- rep("",nrow(U12Linc))
acceptorSeq<- rep("",nrow(U12Linc))
donorSeq[U12Linc[,4]=="+"]<-donPos
donorSeq[U12Linc[,4]=="-"]<-donNeg
acceptorSeq[U12Linc[,4]=="+"]<-accPos
acceptorSeq[U12Linc[,4]=="-"]<-accNeg
U12LincMat<- cbind(U12Linc, donorSeq, acceptorSeq)

# Write lincRNA U12-type introns 
write.table(U12LincMat, file="./U12LncRNA.tsv", 
    col.names=T, row.names=F, sep='\t', quote=F)
# write data frame with info for all introns
write.table(
    cbind(ens38UncolMod, external_gene_name, 
        ensHg38UncolFilpaste60AnnoMat_40_don07_bp14, 
        transcript_biotype),
    file="./allIntronsU12annotationGeneType.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)

# Build datat frame with U12 type introns info
u12Out<- 
    cbind(ens38UncolMod, ensHg38UncolFilpaste60AnnoMat_40_don07_bp14,
            external_gene_name, 
            transcript_biotype)[which(
                ensHg38UncolFilpaste60AnnoMat_40_don07_bp14$int_type=="U12"),]

donPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="+",1],
                    start=
                        as.numeric(u12Out[u12Out[,4]=="+",2]),
                    end=
                        as.numeric(u12Out[u12Out[,4]=="+",2])+9,
                    as.character=TRUE)
accNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="-",1],
                    start=
                        as.numeric(u12Out[u12Out[,4]=="-",2]),
                    end=
                        as.numeric(u12Out[u12Out[,4]=="-",2])+39,
                    as.character=TRUE)

accPos=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="+",1],
                    start=as.numeric(u12Out[u12Out[,4]=="+",3])-39,
                    end=as.numeric(u12Out[u12Out[,4]=="+",3])
                        ,
                    as.character=TRUE)
donNeg=Biostrings::getSeq(BSgenome.Hsapiens.UCSC.hg38, 
                    names=u12Out[u12Out[,4]=="-",1],
                    start=as.numeric(u12Out[u12Out[,4]=="-",3])-9,
                    end=as.numeric(u12Out[u12Out[,4]=="-",3]),
                    as.character=TRUE)

donNeg<- 
as.vector(as.character(Biostrings::reverseComplement(DNAStringSet(donNeg))))
accNeg<- 
as.vector(as.character(Biostrings::reverseComplement(DNAStringSet(accNeg))))
accPos<- as.vector(accPos)
donPos<- as.vector(donPos)

donorSeq<- rep("",nrow(u12Out))
acceptorSeq<- rep("",nrow(u12Out))
donorSeq[u12Out[,4]=="+"]<-donPos
donorSeq[u12Out[,4]=="-"]<-donNeg
acceptorSeq[u12Out[,4]=="+"]<-accPos
acceptorSeq[u12Out[,4]=="-"]<-accNeg
u12OutSeq<- cbind(u12Out, donorSeq, acceptorSeq)

write.table(
    u12OutSeq,
    file="./AllU12Introns.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)
write.table(
    u12OutSeq[which(u12OutSeq$transcript_biotype!="protein_coding"),],
    file="./nonProtCodingU12Introns.tsv",
    col.names=T, row.names=F, sep='\t', quote=F)

U12 annotation comparison

Changes in the parameter settings of the annotateU12() function may lead to detection of different sets of U12-type introns. Here, we compare the U12-type introns detected from the same annotateU12() run as mentioned above (on a reference built from HG19) to those reported by Merico, D. et al.

Detected U12 type introns for HG19 compared to those detected by merico, D. et al.

Finally

The following scripts were used in R to generate the suppl1.html file from the suppl1.Rmd file. 
library("rmarkdown")
render("./report.Rmd")

Reference



  • Merico,D. et al. (2015) Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman Syndrome by disrupting minor intron splicing. Nat Commun, 6, 8718.
0

Add a comment

  1. In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups together with their mean (or other statistics). I think this is a useful function for plotting grouped data when some groups (or all groups) have few data points ! You may be wondering why to include such function in the rnatoolbox package ?! Well ! I happen to use it quit a bit for plotting expression values of different groups of genes/transcripts in a sample or expression levels of a specific gene/transcript in several sample groups. These expression value are either FPKM, TPM, LCPM, or PSI values (Maybe I should go through these different normalizations later in a different post 😐!). But of course its application is not restricted to gene expression or RNAseq data analysis.

    For the test, I first generate a list with three random values. The values are generated randomly using normal distribution, featuring different means and standard deviations.

    library(rnatoolbox)
    datList<- list(
      l1=rnorm(n=30, mean = 10, sd = 3),
      l2=rnorm(n=20, mean = 0, sd = 1),
      l3=rnorm(n=25, mean = 10, sd = 1)
    )


    Then I plot the grouped values. By default the mean function is used to add a summary for the values. However, other functions (e.g. median) can be defined as the FUN parameter.


    png(
      "/proj/pehackma/ali/test/test_rnatoolbox/test_groupedScatterPlot_3.png",
      width=500, height=500, pointsize=21)
    groupScatterPlot(l=datList, col=rainbow(3),
                     lty=1, lwd=1.5,
                     ylab="Test values")
    dev.off()



    0

    Add a comment

  2.  

    In this post I show how classifySex(), function of the rnatoolbox R package can be used for inferring the sex of  the studied subjects from their binary alignment bam files. The sex can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete  metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups. For muscle tissues, this showed to be useful on our ribo-depleted RNAseq data.

    NOTE! Earlier the function referred to in this post was named differently(i.e. getGender). Since version 0.2.1 classifySex() is used.

    The classifySex() function can be run on a vector that contains the path to several bam files.

    Let's first check the package version. These codes should work with version 0.2.1 and later versions.

     

    packageVersion("rnatoolbox")
    [1] ‘0.2.1’ 

    In the example, I first construct a vector that has the paths to several bam files (from various places on the server!) that I would like to analyze.

    library(rnatoolbox)

    bam1<-scan(
      "/proj/pehackma/ali/OBSCN_ExonUsage/encode/bamOnlyFiles.txt",
      what="character", sep='\n')
    bam2<- list.files(
      path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/",
      pattern = ".bam$", recursive = TRUE, full.names = TRUE)
    bam3<- list.files(
      path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/",
      pattern = ".bam$", recursive = TRUE, full.names = TRUE)
    bam2<- bam2[-c(grep("TX_",bam2))]
    bam4<-scan(
      "/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles_05_10_2022.txt",
      what="character", sep='\n')
    bam5<-scan(
      "/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles2_05_10_2022.txt",
      what="character", sep='\n')
    bam6<- list.files(
      path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/",
      pattern = ".bam$", recursive = TRUE, full.names = T)
    bam7<- list.files(
      path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/",
      pattern = ".bam$", recursive = TRUE, full.names=T)
    bam6<- bam6[c(grep("TX_",bam6))]
    bam7<- bam7[c(grep("TX_",bam7))]

    bamVec<- unique(c(bam1,bam2,bam3,bam4,bam5,bam6,bam7))
    names(bamVec)<- gsub("\\..*","",basename(bamVec))

    The classification will be based on the number of the fragments mapping to chromosome Y (defined as the numChr parameter) vs the number of fragments mapping to chromosome Y (defined as the numChr parameter). I need to define a ratio cutoff. For now I assign 0.05. Later however I will show how a suitable cutoff can be chosen and why 0.05 makes sense for this data.

    ratioCuftoff=0.05

    (chrRatio<- 
    classifySex(bamFiles=bamVec,
              plotFile="/proj/pehackma/ali/test/test_rnatoolbox/classifySex_res_3.png",
              height=600, width=1100, fileFormat="png",pointsize=14,
              numChr="chrY", denumChr="chrX",

             
    mar=c(24.1, 4.1, 1.1, 1.1),
              hLine=ratioCuftoff, main=""))

    # [1] 0.01506109 0.07722695 0.01344776 0.01196316 0.02678638 0.06733287
    # [7] 0.01923814 0.07279685 0.06990508 0.01842288 0.06224493 0.01863105
    # [13] 0.02040690 0.01802919 0.01715347 0.01968033 0.01868028 0.03803660
    # [19] 0.01974870 0.01619692 0.07069193 0.06678158 0.07593763 0.10010664
    # [25] 0.07228590 0.06714948 0.07386076 0.06651051 0.06846690 0.06959353
    # [31] 0.08471437 0.07470632 0.09894125 0.08619078 0.01249461 0.07840080
    # [37] 0.06956370 0.01244372 0.07121727 0.06335510 0.02085689 0.10241148
    # [43] 0.07971098 0.09502325 0.02094541 0.07242126 0.02400726 0.01946038
    # [49] 0.02036327 0.08630780 0.01860873 0.06303455 0.01895086 0.01871783
    # [55] 0.02924722 0.08475490 0.08575956 0.08299903 0.08161365 0.02532606
    # [61] 0.02747711 0.09117602 0.08533186 0.11900393 0.08611137 0.02049278
    # [67] 0.08114836 0.07215539 0.06997266 0.08368385 0.08486157 0.07210998
    # [73] 0.09763424 0.09177915 0.08260885 0.01730037 0.07390761 0.10200900
    # [79] 0.08319738 0.09293284 0.09269107 0.02454981 0.10519674 0.09717035
    # [85] 0.08505843 0.09700546

    and the plot looks like this:

     

    Now let's see how to find a good ratio cutoff. The best ratio should be where the biggest jump within the ratio values are! This is how it can be detected.

    diffDf<- data.frame(ratio=sort(chrRatio),
                        consequtive_difference=c(NA,diff(sort(chrRatio))))

    head(diffDf)
    # ratio consequtive_difference
    # 1 0.01196316                     NA
    # 2 0.01244372           4.805589e-04
    # 3 0.01249461           5.088199e-05
    # 4 0.01344776           9.531592e-04
    # 5 0.01506109           1.613323e-03
    # 6 0.01619692           1.135837e-03


    indJump<- which(
      diffDf$consequtive_difference==
        max(diffDf$consequtive_difference, na.rm=TRUE))

    We notice that after sorting (in increasing order), the biggest jump of ratio values is from 0.03803660 to 0.06224493.  The sum of these 2 values divided 2 would be a good choice which is ~ 0.05.

    diffDf[c(indJump-1,indJump),]
    # ratio consequtive_difference
    # 31 0.03803660            0.008789386
    # 32 0.06224493            0.024208324


    sum(diffDf[c(indJump-1,indJump),1])/2
    #[1] 0.05014077

    We can Finally use the ratios and the cutoff to label males and females.

    (sex<- c("FEMALE","MALE")[as.numeric(chrRatio>ratioCuftoff)+1])
    #[1] "FEMALE" "MALE"   "FEMALE" "FEMALE" "FEMALE" "MALE"   "FEMALE" "MALE"  
    #[9] "MALE"   "FEMALE" "MALE"   "FEMALE" "FEMALE" "FEMALE" "FEMALE" "FEMALE"
    #[17] "FEMALE" "FEMALE" "FEMALE" "FEMALE" "MALE"   "MALE"   "MALE"   "MALE"  
    #[25] "MALE"   "MALE"   "MALE"   "MALE"   "MALE"   "MALE"   "MALE"   "MALE"  
    #[33] "MALE"   "MALE"   "FEMALE" "MALE"   "MALE"   "FEMALE" "MALE"   "MALE"  
    #[41] "FEMALE" "MALE"   "MALE"   "MALE"   "FEMALE" "MALE"   "FEMALE" "FEMALE"
    #[49] "FEMALE" "MALE"   "FEMALE" "MALE"   "FEMALE" "FEMALE" "FEMALE" "MALE"  
    #[57] "MALE"   "MALE"   "MALE"   "FEMALE" "FEMALE" "MALE"   "MALE"   "MALE"  
    #[65] "MALE"   "FEMALE" "MALE"   "MALE"   "MALE"   "MALE"   "MALE"   "MALE"  
    #[73] "MALE"   "MALE"   "MALE"   "FEMALE" "MALE"   "MALE"   "MALE"   "MALE"  
    #[81] "MALE"   "FEMALE" "MALE"   "MALE"   "MALE"   "MALE"

    2

    View comments

  3. Recently I have started to organize my commonly used functions related to quality assessment and analyzing RNAseq data into an R package. It is called rnatoolbox and it is available here. In this post I introduce getMappedReadsCount(), i.e. a function that can be used for checking the number of aligned/mapped fragments in several bam files and detecting the outliers. The outliers are the bam files with oddly high (i.e. exceeding1.5 times the interquartile) and oddly low (i.e. lower than 1.5 times the interquartile) number of mapped fragments.


    The getMappedReadsCount() function is especially useful when our aim is to analyze bam files from several samples. It may be a good idea to filter out those samples that their bam files have very low mapped reads !

     In the example, I first construct a vector that has path to several bam files (from various paths on the server!) that I would like to analyze: 

    library(rnatoolbox)
    bam1<- scan(

        
    "/proj/pehackma/ali/OBSCN_ExonUsage/encode/bamOnlyFiles.txt", 
       
    what="character", sep='\n')
    bam2<- list.files(

       
    path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/",
       
    pattern = ".bam$", recursive = TRUE, full.names = TRUE)
    bam3<- list.files(

       
    path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/",
       
    pattern = ".bam$", recursive = TRUE, full.names = TRUE)
    bam2<- bam2[-c(grep("TX_",bam2))]
    bam4<- scan(

       
    "/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles_05_10_2022.txt",
       
    what="character", sep='\n')
    bam5<- scan(

       
    "/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles2_05_10_2022.txt",
       
    what="character", sep='\n')
    bam6<- list.files(path =

       
    "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/",
       
    pattern = ".bam$", recursive = TRUE, full.names = T)
    bam7<- list.files(

       
    path = "/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/",
       
    pattern = ".bam$", recursive = TRUE, full.names=T)
    bam6<- bam6[c(grep("TX_",bam6))]
    bam7<- bam7[c(grep("TX_",bam7))]


    bamVec<- unique(c(bam1,bam2,bam3,bam4,bam5,bam6,bam7))

    Next, I define the name of the vector elements to include the sample names. This consists of extracting the file name from the full path of the files and removing everything after the ".".

    names(bamVec)<- gsub("\\..*","",basename(bamVec))

    Finally, I run the function to get the read counts and plot the values. Note that the chromosomes to be considered for the read/fragment counts is defined with the chr parameter. Here, chr1-22 and chrX and chrY are considered.

    getReadCount( bamFiles= bamVec,
       
    plotFile= "/proj/pehackma/ali/test/test_rnatoolbox/getMappedReadCount_res_3.png",
              chr=paste("chr",c(1:22, "X", "Y"), sep=""),

             
    fileFormat="png",
              height=9, width=19, mar=c(24.1, 4.1, 1.1, 1.1), main="",

             
    cex=1.5)

    # [1] 159088406 316574542 205082424 244803844 104856472 191436762 184803592
    # [8] 197802126 122037534 160421014 122235780 181429548 199137268 161879702
    # [15] 184806182 182604280 176479586 188499272 173812402 180758764 154386574
    # [22] 184470344 175950728 178088586 169036670 176570622 194848566 228952806
    # [29] 204706202 197660756 140237062 184285630 157409076 368553534 316282098
    # [36] 252580714 204283090 449077092 334729570 320968158 260299694 253185212
    # [43] 230665084 264365018 262947800 227354874 307361834 275600346 293804574
    # [50] 278743518 220880234 329853936 266015994 280758612 219623626 234773624
    # [57] 215822812 289486614 265109110 241848712 218349294 268142972 228934248
    # [64] 251021610 238898076 220085138 251466112 235138264 296144082 233070926
    # [71] 232363502 218421700 202184526 293047712 303007888 217222656 243148532
    # [78] 281507178 163940324 152692722 197206004 118648912 163852344 185098294
    # [85] 188690242 153280640

    and the plot is as follows !

    Similar to the default setting of boxplot, the horizontal gray lines show the interquartile ranges (i.e. the solid narrow lines), the 1.5 the interquartile range (i.e. the dashed lines) and the median (i.e. the solid thick line). The outlier sample with oddly high number of mapped read counts in shown with a red up-pointing triangle.



    0

    Add a comment

  4.  

    The adhan package is available here !

     The prayer times cannot always be estimated accurately in some places such as countries located in higher latitudes (e.g. the Nordic countries) . as for instance during midsummer time the Fajr may be impossible to estimate or in other words it may simply not exist ! Some Muslim residents of those countries follow Prayer times of other places such as Mecca and Medina. However, day light saving times can make it complicated ! Some align the Dhuhr prayer time of Mecca to the local Dhuhr time and measure all the remaining times based on their differences from Dhuhr in Mecca. This would also resolve complications caused by the daylight saving time. The adhan package facilitates mapping of the prayer times of two locations based on alignment over a specific time (e.g. Dhuhr).  It can also show the prayer times of a city using several methods. The package is dependent on the  Aladhan API.

     The library is available on GACATAG GitHub. It can be installed using install_github() function supported by devtools package.

    devtools::install_github("gacatag/adhan")

    The following script shows the local prayer times of Helsinki, Finland for today (1st of April 2024), measured by the "Institute of Geophysics, University of Tehran" method. A specific day and month na dyear can also be mentioned. For more info check the parameter settings by typing ?adhan::adhan.

    adhan::adhan(city="Helsinki", country="Finland", method=7)
    #          date           Fajr        Sunrise          Dhuhr            Asr         Sunset        Maghrib
    #  "01-04-2024" "04:07 (EEST)" "06:44 (EEST)" "13:24 (EEST)" "16:50 (EEST)" "20:05 (EEST)" "20:36 (EEST)"
    #          Isha          Imsak       Midnight     Firstthird      Lastthird
    #"22:02 (EEST)" "03:57 (EEST)" "01:24 (EEST)" "23:38 (EEST)" "03:11 (EEST)"

    Currently there are 17 methods which are supported by the Aladhan API. Defining a custom method is also possible. The following shows the Helsinki prayer times for the entire month of April 2024.

    HelsinkiAdhanApr2024<- adhan::adhanMonth(
        method="7",
        city="Helsinki",
        country="Finland",
        month=4,
        year=2024)

    The following maps the parayer times of Mecca to helsinki by aligning the Dhuhr of the two cities. So the differences of Fajr, Sunrise, Sunset and midnight from Dhuhur (noon) will be similar to those in Mecca however the Dhuhr prayer time itself would be according to Helsinki, Finland.

    HelsinkiMeccaAdhanApr2024<- adhan::adhanMapMonth(
        method="7",
        city="Helsinki",
        country="Finland",
        mapCity="Mecca",
        mapCountry="Saudi Arabia",
        mapBy="Dhuhr",
        month=4,
        year=2024)

    The tables can be organized into a nicer format using functions from the kableExtra library.

    library("kableExtra")

    x<- kbl(HelsinkiMeccaAdhanApr2024,
        table.attr = "style='width:100%;'" ) %>%
        kable_classic(full_width = TRUE, position = "center" )

    kableExtra::save_kable(x,file="April.pdf")

    as_image(x, file="April.png")


    Link of the cover photo: https://www.flaticon.com/free-icon/adhan_2918161.

    0

    Add a comment

  5. Unsupervised machine learning methods such as hierarchical clustering allow us to discover the trends and patterns of similarity within the data. Here, I demonstrate by using a test data, how to apply the Hierarchical clustering on columns of a test data matrix. Note that as my main focus is Bioinformatics application, I assume that the columns of the matrix represent individual samples and the rows represent the genes or transcripts or some other biological feature. However, as the application of clustering algorithms are not restricted to biology the rows or the column of the matrix may represent other things based on the field of research ! For the distance metric, I will use the Spearman correlation based distance supported by the Dist function of amap package. For a skewed data, it is  a good idea to check the similarity of the orders of the values rather than their linear relationship (i.e. Pearson correlation) or how geometrically close the values are (i.e. Euclidean distance). For more info, you can see an example that I provided in one of my previous posts on how Spearman correlation may discover associations more efficiently for a skewed data. Furthermore, check the "Details" in the manual for the various methods supported by the hclust function.


    values<- matrix(rnorm(1000),ncol=20)
    colnames(values)<- paste("col",1:20,sep="")
    library(amap)
    hRes<- hclust(Dist(t(values), method="spearman"))
    plot(hRes)



     

    After running Hierarchical clustering we can cut the result binary tree at a certain depth or request that it be cut in a manner that would result a certain number of clusters. Here, I request that the resulted binary tree be cut in away that would result to 2 sample clusters. Furthermore, I convert the resulted tree to a "dendogram" object and colour the branches and the labels of the tree to visualize the 2 clusters. One can use color_branches and color_labels functions to cut and colour the trees.

    library(dendextend)

    # Cut and colour
    hResDen<- as.dendrogram(hRes)
    hResCut<- cutree(hResDen,2)
    hResDen <- color_branches(hResDen, k= 2)
    hResDen <- color_labels(hResDen, k= 2)
    plot(hResDen)


     

    Alternatively, one can use color_branches and color_labels functions to manually define the colours of the labels and the branches of the tree.

    # manual colouring based on cut results
    colours<- c(2,3)
    hResDen<- as.dendrogram(hRes)
    colOrder<- hRes$order
    hResDen <- color_branches(hResDen,clusters=hResCut[colOrder],col=colours)
    lableCol<- colours
    names(lableCol)<- unique(hResCut[colOrder])
    hResDen <- color_labels(hResDen,col=lableCol[as.character(hResCut[colOrder])])
    plot(hResDen)




    But what if we want to colour the branches and the labels of the tree based on a predefined grouping of the samples ? Here, we colour the labels and the edges leading to them to visualize the position of "class1", "class2" and "class3" samples in the tree.

    # Manual colouring based on some predefined classes

    sampleClass<- c(rep("class1",5), rep("class2",6), rep("class3",9))
    colours<- c("lightblue","green", "red")
    hResDen<- as.dendrogram(hRes)
    colOrder<- hRes$order
    hResDen <- color_branches(hResDen,clusters=as.numeric(as.factor(sampleClass[colOrder])),col=colours)
    lableCol<- colours
    names(lableCol)<- unique(sampleClass[colOrder])
    hResDen <- color_labels(hResDen,col=lableCol[as.character(sampleClass[colOrder])])
    plot(hResDen)


    0

    Add a comment

  6.  


     Note ! the & sign is to run the command in background.

    Getting MD5 sum for all files and writing it to a txt file in Linux.

    md5sum * > myChecklist.txt &


    Getting MD5 sum for all files and subfolders and writing it to a txt file in Linux.

    find ./ -type f -exec md5sum {} + > myChecklist.txt &


    Getting MD5 sum for all files and writing it to a txt file in Mac.

    md5 -r * > myChecklist.txt &


    Getting MD5 sum for all files and subfolders and writing it to a txt file in Mac.

    find ./ -type f -exec md5 -r {} + > myChecklist.txt &
    0

    Add a comment

  7. Many times, in our projects, we may need to compare different measured factors in our samples to one another, and study whether they are linearly dependent. These information can also help us to detect covariates and factors that affect our studies but we would like to adjust for/remove their effects (more on this at sometime later). Here, I mention several functions that can be used to perform correlation tests. All of these functions do support both Pearson and ranked (Spearman) methods. Note that in the end of this post I will focus on these two different methods (i.e. Pearson vs Spearman) and show their differences in application.

    First, let's construct a 10×5 matrix and replace 10 randomly picked positions within the matrix with NAs (i.e. missing values).


    # Make test data matrix
    dat<- matrix(rnorm(50), nrow=10, ncol=5)

    # JUST R VALUES NO P !
    # Replace 10 NAs within matrix randomly

    set.seed (877)
    naInd<- sample(1:length(dat), 10)
    dat[naInd]<- NA
    colnames(dat)<- paste("col", 1:ncol(dat), sep="")
    rownames(dat)<- paste("row", 1:nrow(dat), sep="")
    dat
    #             col1       col2       col3       col4        col5
    # row1  -0.8882978 -1.3064945 -0.8559183 -1.2621139  0.28889517
    # row2  -1.1934817         NA  1.0021094  0.2707312  2.65574584
    # row3   0.5436480 -0.9709940         NA -2.0137933 -0.03901379
    # row4  -0.1557453 -1.6251252 -0.2549788 -0.5652703          NA
    # row5  -0.7226121  2.7137291 -0.5804944  0.4200483 -0.18883746
    # row6          NA -0.9527775  2.1885032 -0.3665413  1.14035680
    # row7          NA  1.4430923         NA  0.3362986          NA
    # row8   0.4691104 -1.5502340         NA         NA  0.22606033
    # row9  -0.3557879  0.1540679 -0.4542577  0.4951978 -1.11224029
    # row10  0.6162009         NA -0.9514461 -1.0438710 -1.48530042

    Normally, we can use the cor() function from the stats package to get pairwise correlations over the columns of a data frame such as dat. However, as demonstrated in the following, due to the NAs in the data frame the results of the correlations will be mainly NA. Note, that parameter settings such as na.rm=TURE (that ignores the NAs in functions such as median and mean) are not supported by the cor() function. Furthermore, the cor() function does not run statistical tests, hence it does not return p values for the comparisons (execute ?cor for more info).

    cor(dat)
    #      col1 col2 col3 col4 col5
    # col1    1   NA   NA   NA   NA
    # col2   NA    1   NA   NA   NA
    # col3   NA   NA    1   NA   NA
    # col4   NA   NA   NA    1   NA
    # col5   NA   NA   NA   NA    1


    One may be tempted to remove those rows from the data frame that have one or more missing values. However, you may be left with too small number of rows in your data to run any meaningful correlation analysis.

    #Removing rows with NAs
    as.data.frame(na.omit(dat))

    #            col1       col2       col3       col4       col5
    # row1 -0.5903153  1.1200880 -1.4642429  0.2085692  1.1770598
    # row5  0.4496017  0.8385497 -0.4793778 -0.1731461  0.8716287
    # row9 -0.5647845 -1.6658176 -0.5613469  0.7549264 -1.1794651

    #Running correlation on NA filtered data
    cor(na.omit(dat))

    #            col1       col2       col3       col4       col5
    # col1  1.0000000  0.1517447  0.9106435  0.7645283 -0.9994524
    # col2  0.1517447  1.0000000  0.5465933  0.7531387 -0.1843677
    # col3  0.9106435  0.5465933  1.0000000  0.9625528 -0.9238171
    # col4  0.7645283  0.7531387  0.9625528  1.0000000 -0.7854387
    # col5 -0.9994524 -0.1843677 -0.9238171 -0.7854387  1.0000000

    We would, of course, prefer to get the most from our data. Therefore, we would like to ignore the NAs in each paired correlation test. One correlation function supported by R's stats package that can remove the NAs is cor.test() . However, this function only runs correlation on a pair of vectors and does NOT accept a data.frame/matrix as its input (to run correlation on the columns of the data frame and build a a pairwise correlation matrix accordingly). Note that the cor.test() function also returns measured p values for the comparisons. Here, we define a mycor() function that adapts cor.test to run pairwise correlations over all columns of an input data frame and returns two matrices for the r values and p values of the pairwise comparisons.

    # Returns r and p values but does not accept data frame as input !
    cor.test(dat[,1], dat[,2], na.action=na.omit)
    # Pearson's product-moment correlation
    #
    # data:  dat[, 1] and dat[, 2]
    # t = -1.063, df = 4, p-value = 0.3477
    # alternative hypothesis: true correlation is not equal to 0
    # 95 percent confidence interval:
    #  -0.9275857  0.5527702
    # sample estimates:
    #        cor
    # -0.4693404

    # Defining a correlation function that suits a data frame input
    # and returns r and p values
    mycor<- function(x,...){
      r<- apply(x, 2, function(j){
        apply(x, 2, function(i){
          as.numeric(cor.test(i,j, ...)$estimate)
        })
      })
      P<- apply(x, 2, function(j){
        apply(x, 2, function(i){
          as.numeric(cor.test(i,j, ...)$p.value)
        })
      })
      out<-c()
      out$P<- P
      out$r<- r
      return(out) 
    }

    # Running the defined correlation function and measuring
    # its running time
    time1<- Sys.time()
    myCorDat<- mycor(dat, method="pearson", na.action=na.omit)
    time2<- Sys.time()


    (runTimeMyCor<- difftime(time2,time1))

    #Time difference of 0.02293086 secs


    One problem with using cor.test in a loop or apply function is that it is inefficient (or in other words slow), especially for a large data. This problem can also be seen in the mycor() function mentioned above, as it takes ~0.02 seconds on my system to run on a data of size 50 (i.e. 10×5 matrix). If you are willing to install the Hmisc R package (available in CRAN), it supports an rcorr() function which: is robust, accepts a numeric matrix (or 2 numeric matrics !) as input, removes NAs according to the pairwise comparisons, and returns measured r and p values for the pairwise comparisons.

    #Running rcorr function and measuring run time
    ime1<- Sys.time()
    rcorrDat<- Hmisc::rcorr(dat, type="pearson")
    time2<- Sys.time()
    (runTimeRcorr<- difftime(time2,time1))
    #Time difference of 0.0009348392 secs

    #mycor vs rcorr run time
    c(runTimeMyCor, runTimeRcorr)
    # Time differences in secs
    # [1] 0.0229308605 0.0009348392

    rcorrDat
    #       col1  col2  col3  col4  col5
    # col1  1.00 -0.47 -0.59 -0.61 -0.60
    # col2 -0.47  1.00 -0.25  0.69 -0.40
    # col3 -0.59 -0.25  1.00  0.25  0.71
    # col4 -0.61  0.69  0.25  1.00  0.19
    # col5 -0.60 -0.40  0.71  0.19  1.00
    #
    # n
    #      col1 col2 col3 col4 col5
    # col1    8    6    6    7    7
    # col2    6    8    5    7    6
    # col3    6    5    7    7    6
    # col4    7    7    7    9    7
    # col5    7    6    6    7    8
    #
    # P
    #        col1   col2   col3   col4   col5 
    # col1        0.3477 0.2166 0.1428 0.1506
    # col2 0.3477        0.6842 0.0854 0.4283
    # col3 0.2166 0.6842        0.5851 0.1115
    # col4 0.1428 0.0854 0.5851        0.6782
    # col5 0.1506 0.4283 0.1115 0.6782      



     

    Correlation plotting

    One interesting plotting method is corrplot() that provided by the corrplot R package. This function mainly visualizes the r measurements for the paired correlations. The size and colour of circles in the figure represent the r. If the p value is higher than the defined sig.level threshold parameter,  an X sign shows that the correlation is NOT significant.

    library(corrplot)
    jpeg("corplotTest.jpg", width=800, height=800, quality=100, pointsize=24)
    corrplot(corr=myCorDat$r,
             p.mat = myCorDat$P,
             type="full", insig="pch", sig.level =.1, pch.cex = .9)
    dev.off()


    The default colour palette is not too popular in our lab though! We prefer to use as warm (e.g. red) colour for the higher values and cold (e.g. blue) colour for the lower values. Using the following code I usually build a custom colour palette function that is in the reverse order as the default colours used by corrplot.

    # make function that makes set colors
    col2 <- colorRampPalette(c("#053061", "#2166AC", "#4393C3",
                               "#92C5DE", "#D1E5F0", "#FFFFFF",
                               "#FDDBC7", "#F4A582", "#D6604D",
                               "#B2182B", "#67001F"))

    jpeg("corplotTest2.jpg", width=800, height=800, quality=100, pointsize=24)
    corrplot(corr=myCorDat$r,
             p.mat = myCorDat$P,
             type="full", insig="pch", sig.level =.1, pch.cex = .9, col=col3(200))
    dev.off()


    Since we have randomly generated the values in our test data, the moderately high and significant correlation of col4 with col2 may come as a surprise! First of all, note that its pvalue is actually ~0.085 which is not really less than 0.05. We actually chose a more relaxed 0.1 pvalue significance threshold in the plot. Nevertheless, this shows that as the number of comparisons increases (in relation to the size of data), the chances of acheiving an accidental signifcant correlation increases. Pvalue adjustments can be helpful to correct for multiple testing. In the following we first recheck the correlation of col4 vs col2. Next, we adjust the pvalues acheived by the correlation tests using Benjamini and Hochberg method. Finally, we plot col4 vs col2 abd fit a line to the data points in the plot. Note that since the pairwise correlation matrices are symmetric and the diagonals are predictable (since it is the result of correlation of every column with itself !), I will run the p adjustment over the lower triangular of the P value matrix. As you can see the adjusted p values are no longer significant !

    # Recheck correlation of col4 vs col2
    cor.test(x=dat[,"col2"], y=dat[,"col4"], method="pearson")

    #
    # Pearson's product-moment correlation
    #
    # data:  dat[, "col2"] and dat[, "col4"]
    # t = 2.1394, df = 5, p-value = 0.08538
    # alternative hypothesis: true correlation is not equal to 0
    # 95 percent confidence interval:
    #  -0.1287915  0.9498704
    # sample estimates:
    #       cor
    # 0.6913155

    p.adjust(myCorDat$P[lower.tri(myCorDat$P, diag = FALSE)], method="BH")
    # [1] 0.9401500 0.8955621 0.9401500 0.9401500 0.9401500 0.6116986
    # [7] 0.6116986 0.7738987 0.6116986 0.9401500

    # Plot col4 (y-axis) vs col2 (x-axis)
    jpeg("highCorPlot.jpg", width=800, height=800, quality=100, pointsize=24)
    plot(dat[,"col2"], dat[,"col4"], pch=16, xlab="col2", ylab="col4")
    abline(lm(dat[,"col4"]~dat[,"col2"]), col="red", lwd=2)
    dev.off()


    Pearson vs Spearman correlation

    All the codes above use Pearson; i.e. default method used in most correlation related functions. Many of these functions do however support Spearman (i.e. ranked correlation measurement method) as well well. As demostrated by a simple exponential test data, Spearman correlation is more forgiving towards skewness and extreme outliers and manages to detect the strong correlation between the defined x and y in the following scripts.

    # Defining an unskewed data
    x<- 1:50
    # Defining a highly skewed data
    y<-exp(x)

    e1071::skewness(x)
    #[1] 0
    e1071::skewness(y)
    #[1] 5.574573

    cor.test(x=x, y=y, method="pearson")
    #
    # Pearson's product-moment correlation
    #
    # data:  x and y
    # t = 2.6098, df = 48, p-value = 0.01205
    # alternative hypothesis: true correlation is not equal to 0
    # 95 percent confidence interval:
    #  0.08223784 0.57449344
    # sample estimates:
    #       cor
    # 0.3525162 

    cor.test(x=x, y=y, method="spearman")
    #
    # Spearman's rank correlation rho
    #
    # data:  x and y
    # S = 0, p-value < 2.2e-16
    # alternative hypothesis: true rho is not equal to 0
    # sample estimates:
    # rho
    #   1

    jpeg("whySpearman.jpg", width=800, height=800, quality=100, pointsize=24)
    plot(x, y, pch=16, xlab="x", ylab="y = exp(x)")
    dev.off()

    2

    View comments

  8.  

    Occasionally when indexing data frames the format is converted, leading to confusing consequences. As for instance, when indexing to select a single column the result is a 'numeric' or 'integer' vector. The following  demonstrates this : 



    df<- data.frame(num=1:10, al=letters[1:10], bool=c(rep(TRUE,5), rep(FALSE,5)) )

    rownames(df)<- df$al

    df
    #   num al  bool
    # a   1  a  TRUE
    # b   2  b  TRUE
    # c   3  c  TRUE
    # d   4  d  TRUE
    # e   5  e  TRUE
    # f   6  f FALSE
    # g   7  g FALSE
    # h   8  h FALSE
    # i   9  i FALSE
    # j  10  j FALSE

    class(df[,1])
    #[1] "integer"

    class(df[,2])
    #[1] "
    factor"

    class(df[,3])
    #[1] "
    logical" 

    df[,1]
    #[1]  1  2  3  4  5  6  7  8  9 10

    # Note that the following returns an error !

    rowSums(df[,1])
    #Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
    #  'x' must be an array of at least two dimensions

    Using the drop=FALSE parameter setting, it is possible to maintain the data frame format.


    class(df[,1, drop=FALSE])
    #[1] "data.frame

    df[,1, drop=FALSE] 

    #  num
    #a   1
    #b   2
    #c   3
    #d   4
    #e   5
    #f   6
    #g   7
    #h   8
    #i   9
    #j  10

    # No error raised by the following command!

    rowSums(df[,1, drop=FALSE])
    # [1]  1  2  3  4  5  6  7  8  9 10


    0

    Add a comment


  9. When analyzing a data constructed of individuals (or samples from individuals) of both male and female of a species (e.g. humans), often it is a good idea to compare the distribution of the various studied parameters for the males to those for the females. As for instance in RNAseq analysis, it is the measured expression of many genes may differ significantly between the studied males and females. In other words the gender may exhibit 'batch effect' in the gene expression data. Here is one way to use the neat Venus and Mars signs in an R-plot to label the data related to the females and males. Note that the male and the female signs were just randomly assigned here so no gender batch effect is noticeable in this figure.

    #Generate normally distributed random variables
    # and organize them in a 50*2 matrix

    randM<-matrix(rnorm(100, 500, 50), ncol = 2)

    #Output plot into png
    png("MaleFemale.png", res=350, width=2000, height=2000)

    #Randomly generate 50 MALE or FEMALE signs
    gender<- sample(c("M","F"), 50, replace=T)

    #Define suitable pch vector based on Unicodes of
    #Venus (Female) and Mars (Mars)

    pchVec<-rep(-0x2642L, nrow(randM))
    pchVec[gender=="F"]<- -0x2640L

    # Rndomly assign colors to define 3 subgroups within
    #the 50 individuals

    colors<- rainbow(3)
    colVec<-sample(colors, 50, replace=T)

    plot(randM, pch=pchVec, cex=2, col=colVec, xlab="X", ylab="Y")
    legend("topleft", legend=c("group1","group2","group3"),
           fill =colors)
    dev.off()
    0

    Add a comment

  10. Here is an example of plotting 4 venn diagrams in a single screen with a 2*2 layout.


    library(VennDiagram)

    #defining vectors
    av<- 1:10
    bv<- 12:20
    cv<-  7:15

    # Building venndiagram grid objects (i.e. gList)
    a<- venn.diagram(list(av= av, bv=bv), filename=NULL, main="VD1")
    b<- venn.diagram(list(av=av, cv=cv), filename=NULL, main="VD2")
    c<- venn.diagram(list(bv=bv, cv=cv), filename=NULL, main="VD3")
    d<- venn.diagram(list(av=av, cv=cv), filename=NULL, main="VD4")

    # Draw the diagrams
    pushViewport(plotViewport(layout=grid.layout(2, 2)))
    pushViewport(plotViewport(layout.pos.col=1, layout.pos.row=1))
    grid.draw(a)
    popViewport()
    pushViewport(plotViewport(layout.pos.col=2, layout.pos.row=1))
    grid.draw(b)
    popViewport()
    pushViewport(plotViewport(layout.pos.col=1, layout.pos.row=2))
    grid.draw(c)
    popViewport()
    pushViewport(plotViewport(layout.pos.col=2, layout.pos.row=2))
    grid.draw(d)



    1

    View comments

About Me
About Me
My Photo
I am a Postdoc researcher at the Neuromuscular Disorders Research lab and Genetic Determinants of Osteoporosis Research lab, in University of Helsinki and Folkhälsan RC. I specialize in Bioinformatics. I am interested in Machine learning and multi-omics data analysis. My go-to programming language is R.
Loading