21.1 Cluster analysis
Clustering procedures group features or observations into homogeneous sets by minimising within-group and maximising among-group distances
21.1.1 Hierarchical clustering
Hierarchical clustering produces a stratified organisation of features or observations where relatively similar objects are grouped together. The clustering can be performed using different criteria to measure the distance between clusters, which will affect the final outcome of the analysis (e.g., single linkage, complete linkage, average linkage and Ward’s minimum variance).
# Load the dataset
data <- read.csv("mydata.csv")
# Perform hierarchical clustering
dist_matrix <- dist(data) # calculate distance matrix
hc <- hclust(dist_matrix) # perform hierarchical clustering
# Plot dendrogram of clustering
plot(hc, hang=-1)
A useful exploratory analysis to reveal general patterns in an omic layer can be obtained by simultaneous application of hierarchical clustering to the rows and columns of the data matrix, and visualising the results in a heatmap.
# Load the dataset
data <- read.csv("mydata.csv", row.names=1)
# Perform hierarchical clustering of rows and columns
row_clusters <- hclust(dist(data))
col_clusters <- hclust(dist(t(data)))
# Plot heatmap with row and column dendrograms
library(gplots)
heatmap.2(as.matrix(data),
Rowv=row_clusters,
Colv=col_clusters,
scale="row",
dendrogram="both",
key=TRUE,
keysize=1.5,
col=redgreen(75))
21.1.2 Disjoint clustering
Disjoint clustering techniques aim at separating the objects into individual, usually mutually exclusive, and in most cases, unconnected clusters. K-means clustering is one of the most typical algorithms where objects are assigned to k clusters using an iterative procedure that minimises the within-clusters sums of squares. Other available clustering methods include twinspan, self-organising maps, dbscan and Dirichlet multinomial mixtures (DMM). DMM were specifically developed to analyse MG data but can be equally useful for other sequencing-based omic datasets.