Hierarchical clustering, cutting the tree and colouring the tree leaves based on sample classes

Unsupervised machine learning methods such as hierarchical clustering allow us to discover the trends and patterns of similarity within the data. Here, I demonstrate by using a test data, how to apply the Hierarchical clustering on columns of a test data matrix. Note that as my main focus is Bioinformatics application, I assume that the columns of the matrix represent individual samples and the rows represent the genes or transcripts or some other biological feature. However, as the application of clustering algorithms are not restricted to biology the rows or the column of the matrix may represent other things based on the field of research ! For the distance metric, I will use the Spearman correlation based distance supported by the Dist function of amap package. For a skewed data, it is a good idea to check the similarity of the orders of the values rather than their linear relationship (i.e. Pearson correlation) or how geometrically close the values are (i.e. Euclidean distance). For more info, you can see an example that I provided in one of my previous posts on how Spearman correlation may discover associations more efficiently for a skewed data. Furthermore, check the "Details" in the manual for the various methods supported by the hclust function.

values<- matrix(rnorm(1000),ncol=20)
colnames(values)<- paste("col",1:20,sep="")
library(amap)
hRes<- hclust(Dist(t(values), method="spearman"))
plot(hRes)

After running Hierarchical clustering we can cut the result binary tree at a certain depth or request that it be cut in a manner that would result a certain number of clusters. Here, I request that the resulted binary tree be cut in away that would result to 2 sample clusters. Furthermore, I convert the resulted tree to a "dendogram" object and colour the branches and the labels of the tree to visualize the 2 clusters. One can use color_branches and color_labels functions to cut and colour the trees.

library(dendextend)

# Cut and colour
hResDen<- as.dendrogram(hRes)
hResCut<- cutree(hResDen,2)
hResDen <- color_branches(hResDen, k= 2)
hResDen <- color_labels(hResDen, k= 2)
plot(hResDen)

Alternatively, one can use color_branches and color_labels functions to manually define the colours of the labels and the branches of the tree.

# manual colouring based on cut results
colours<- c(2,3)
hResDen<- as.dendrogram(hRes)
colOrder<- hRes$order
hResDen <- color_branches(hResDen,clusters=hResCut[colOrder],col=colours)
lableCol<- colours
names(lableCol)<- unique(hResCut[colOrder])
hResDen <- color_labels(hResDen,col=lableCol[as.character(hResCut[colOrder])])
plot(hResDen)

But what if we want to colour the branches and the labels of the tree based on a predefined grouping of the samples ? Here, we colour the labels and the edges leading to them to visualize the position of "class1", "class2" and "class3" samples in the tree.

# Manual colouring based on some predefined classes

sampleClass<- c(rep("class1",5), rep("class2",6), rep("class3",9))
colours<- c("lightblue","green", "red")
hResDen<- as.dendrogram(hRes)
colOrder<- hRes$order
hResDen <- color_branches(hResDen,clusters=as.numeric(as.factor(sampleClass[colOrder])),col=colours)
lableCol<- colours
names(lableCol)<- unique(sampleClass[colOrder])
hResDen <- color_labels(hResDen,col=lableCol[as.character(sampleClass[colOrder])])
plot(hResDen)

In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups toge

plotting individual values within multiple groups together with their means

In this post I show how classifySex(), function of the rnatoolbox R package can be used for inferring the sex of the studied subjects from their bina

Inferring the sex of the subjects from RNAseq BAM files

Many times, in our projects, we may need to compare different measured factors in our samples to one another, and study whether they are linearly depe

Hierarchical clustering, cutting the tree and colouring the tree leaves based on sample classes

Add a comment

gacatag

plotting individual values within multiple groups together with their means

Inferring the sex of the subjects from RNAseq BAM files

Assessing the number of mapped reads in several bam files

adhan package: retrieving and aligning the prayer times in R

Hierarchical clustering, cutting the tree and colouring the tree leaves based on sample classes

MD5 for large files, and folders with subfolders and many files

Correlation in R ( NA friendliness, accepting matrix as input, returning p values, visualization, and Pearson vs Spearman)

Maintining the data frame fromat when indexing

Venus (female) and Mars (male) symbols in R plot - using Unicode

Plotting multiple Venndiagram (or diagrams) in single screen

Axis break in R for line plot

Espresso coffee so far consumed in our lab

Merging PDFs in Linux

Detecting U12-type introns using IntEREst R/Biocondcutor package

Modifying read names in fastq and bam files

Analyzing RNA-seq data (DEG, Alt. splicing and splicing efficiency analysis)

Adding a text to sequence names of a fasta file using R

The daRk knight Rises

gacatag