21.1 Cluster analysis

Clustering procedures group features or observations into homogeneous sets by minimising within-group and maximising among-group distances

21.1.1 Hierarchical clustering

Hierarchical clustering produces a stratified organisation of features or observations where relatively similar objects are grouped together. The clustering can be performed using different criteria to measure the distance between clusters, which will affect the final outcome of the analysis (e.g., single linkage, complete linkage, average linkage and Ward’s minimum variance).

# Load the dataset
data <- read.csv("mydata.csv")

# Perform hierarchical clustering
dist_matrix <- dist(data)  # calculate distance matrix
hc <- hclust(dist_matrix)  # perform hierarchical clustering

# Plot dendrogram of clustering
plot(hc, hang=-1)

A useful exploratory analysis to reveal general patterns in an omic layer can be obtained by simultaneous application of hierarchical clustering to the rows and columns of the data matrix, and visualising the results in a heatmap.

# Load the dataset
data <- read.csv("mydata.csv", row.names=1)

# Perform hierarchical clustering of rows and columns
row_clusters <- hclust(dist(data))
col_clusters <- hclust(dist(t(data)))

# Plot heatmap with row and column dendrograms
library(gplots)
heatmap.2(as.matrix(data),
          Rowv=row_clusters,
          Colv=col_clusters,
          scale="row",
          dendrogram="both",
          key=TRUE,
          keysize=1.5,
          col=redgreen(75))

21.1.2 Disjoint clustering

Disjoint clustering techniques aim at separating the objects into individual, usually mutually exclusive, and in most cases, unconnected clusters. K-means clustering is one of the most typical algorithms where objects are assigned to k clusters using an iterative procedure that minimises the within-clusters sums of squares. Other available clustering methods include twinspan, self-organising maps, dbscan and Dirichlet multinomial mixtures (DMM). DMM were specifically developed to analyse MG data but can be equally useful for other sequencing-based omic datasets.

# Load the dataset
data <- read.csv("mydata.csv")

# Perform K-means clustering
k <- 3  # number of clusters
km <- kmeans(data, k)

# View the cluster assignments
head(km$cluster)

# Load the package
library(DirichletMultinomial)

# Load the dataset
data <- read.csv("mydata.csv")

# Fit Dirichlet multinomial mixture model
model <- DMM(data, K=3, alpha=1, beta=1)

# View the cluster assignments
head(model$Z)