20.1 Transformations to account for statistical assumptions
Most of the statistical techniques applied to omic datasets have a set of requirements, known as assumptions, which is necessary the aalysed data meet for statistical models to make accurate and unbiased predictions based on the available data. Violating these assumptions can lead to biased and inaccurate results, and statistical tests and methods should be selected and applied carefully, taking into account the specific assumptions required for each analysis.
The specific assumptions required vary depending on the type of statistical analysis being performed, but some common examples include:
- Normality: The assumption that the distribution of the data is approximately normal.
- Independence: The assumption that the observations are independent of each other.
- Homogeneity of variance: The assumption that the variance of the data is the same across all levels of the independent variable.
- Linearity: The assumption that there is a linear relationship between the independent and dependent variables.
- Randomness: The assumption that the data was obtained randomly and that there are no systematic biases in the sample selection process.
- Stationarity: The assumption that the statistical properties of the data do not change over time.
Unfortunately, biological datasets rarely meet these assumptions. However, original values can be transformed so that the modified values conform better to those assumptions. Some of the most typical transformations include:
- Log transformation: This transformation is used to reduce the effect of extreme values in the data and to stabilize the variance. It is often used when the data is highly skewed or when the relationship between the variables is multiplicative rather than additive.
- Square root transformation: This transformation is similar to the log transformation, but is less extreme. It is often used when the data is moderately skewed and when the variance increases with the mean.
- Box-Cox transformation: This is a more general transformation that allows for a range of power transformations to be applied to the data. It is used when the data is highly skewed or when the variance is not constant across the range of values.
- Arcsine transformation: This transformation is used for data that is bounded between 0 and 1, such as proportions or percentages. It is used to stabilise the variance and to improve the normality of the data.
- Rank transformation: This transformation involves converting the data to ranks, which can be useful for non-parametric tests. It is often used when the data is highly skewed or when there are outliers.
Transforming data to meet normality assumption
In this example, we first load the multivariate dataset and use the Shapiro-Wilk test to check for normality. We then loop through each variable in the dataset and apply the Box-Cox transformation using the boxcox function from the MASS package if the Shapiro-Wilk test indicates that the variable is not normally distributed. Finally, we use the Shapiro-Wilk test again to check for normality of the transformed data.
# Load the multivariate dataset
data <- read.csv("mydata.csv")
# Check for normality using Shapiro-Wilk test
shapiro.test(data)
# Apply Box-Cox transformation to each variable
library(MASS)
data_transformed <- data
for (i in 1:ncol(data)) {
if (shapiro.test(data[,i])$p.value < 0.05) { # if not normal
data_transformed[,i] <- boxcox(data[,i]) # apply Box-Cox transformation
}
}
# Check for normality again using Shapiro-Wilk test
shapiro.test(data_transformed)
20.1.1 Transformations to account for compositional data
When dealing with multi-omic datasets, it’s important to determine if the measurements represent absolute or relative values. While HG provides qualitative information about host genomes, MG, HT, and MT provide quantitative information that’s dependent on the amount of sequencing performed and thus are compositional. Consequently, raw quantitative values of genome abundance or gene expression across samples cannot be compared directly. To address this, the most common solution is to transform the raw abundance values into relative abundance data for comparison. However, this transformation reduces the independence of individual variables, which is an assumption for many statistical methods. An alternative is using ratio transformations like the centred log-ratio, which better suit compositional data analysis by removing the effect of the constant-sum constraint on the covariance and correlation matrices.
Transforming data using centred log-ratio
In this example, we first load the multivariate dataset and then apply the CLR transformation using the clr()
function from the compositions
package. The CLR transformation is a commonly used method for analyzing compositional data, where the data represents proportions or percentages that add up to a constant sum. The CLR transformation is designed to remove the constant sum constraint and make the data amenable to standard multivariate statistical methods.
The resulting data_transformed
object will be a transformed version of the original dataset, with each variable now representing the logarithm of the ratio between that variable and the geometric mean of the other variables. Note that the CLR transformation assumes that the data is non-negative, so it may not be appropriate for all types of multivariate data.
20.1.2 Transformations to account for scaling
It’s important to keep in mind that scaling can also impact the results when analsing multi-omic data. For instance, in ME data, certain metabolites such as ATP may have much higher concentrations than other important metabolites like signaling molecules, potentially overshadowing significant differences in the less abundant yet meaningful metabolites. In Transcriptomics, transcript length biases may also cause similar distortions. One solution is to standardise the features by using transformations like z-score normalisation, but this can amplify the influence of measurement error that is typically higher for less abundant features.
Transforming data using z-score normalisation
In this example, we first load the dataset and then use the apply()
function to apply the z-score normalization to each variable in the dataset. The apply()
function applies a function to either the rows or columns of a matrix, and the 2 argument specifies that we want to apply the function to the columns (i.e., variables) of the dataset. The function applied to each column calculates the z-score of each observation by subtracting the mean of the variable from each observation and dividing by the standard deviation of the variable. This centers the data around 0 and scales it to have a standard deviation of 1. The resulting data_norm
object will be a normalized version of the original dataset, with each variable now having a mean of 0 and a standard deviation of 1.