Assigned Reading:
Zuur, A. F., E. N. Ieno, and C. S. Elphick. 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3-14. DOI: 10.1111/j.2041-210X.2009.00001.x
Homogeneity Y
Not all steps always needed
The code below relies on the phyloseq
package from Bioconductor. See these instructions for how to install it.
Zurr et al. recommends plotting your data using boxplots and dotcharts to detect outliers. Before removing suspected outliers, make sure they are actually outliers!
Since my data is multivariate, I used sequence count per sample for outlier detection in the following examples.
Boxplot
boxplot(sample_sums(merged_data), ylab = "Sequences per sample")
Cleveland dotchart
dotchart(sample_sums(merged_data), xlab = "Sequences per sample",
ylab = "Samples", labels = F)
Plotly - great visualization tool
seq_num <- data.frame(ID=sample_data(merged_data)$ID, Seq=sample_sums(merged_data))
dot_gg <- ggplot(seq_num, aes(x=Seq,y=ID)) + geom_point(col="coral") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
ggplotly(dot_gg)
Dotchart for multiple variables
Since most of the variables in my dataset are categorical, here I used raw data (sequence count) subsetted by bird species.
Z <- qpcR:::cbind.na(sample_sums(data.BTSA), sample_sums(data.CCRO), sample_sums(data.OBNT),
sample_sums(data.RCWA), sample_sums(data.SWTH), sample_sums(data.YWAR))
colnames(Z) <- c("BTSA", "CCRO", "OBNT", "RCWA", "SWTH", "YWAR")
dotplot(as.matrix(Z), groups = FALSE,
strip = strip.custom(bg = 'white',
par.strip.text = list(cex = 0.8)),
scales = list(x = list(relation = "free"),
y = list(relation = "free"),
draw = FALSE),
col = 1, cex = 0.5, pch = 16,
xlab = "Number of sequences per sample",
ylab = "Order of the data from text file")
Some statistical tests assume normal distributions, making it important to check the shape of your data. Plot a histogram with all of your data and then groups of your data.
“If you want to apply a statistical test to determine whether there is significant group separation in a discriminant analysis, however, normality of observations of a particular variable within each group is important”
Histogram
hist(sample_sums(merged_data), breaks = 100,col = "#a6bddb", xlab="Sequences per sample", main = ' ')
Q1: When should you let go of an outlier?
Q2: How can you check for independence using multivariate data?
“Hence, it is important to check whether there is dependence in the raw data before doing the analysis, and also the residuals afterwards. These checks can be made by plotting the response variable vs. time or spatial coordinates.”
Q3: To transform or not to transform? That is the question.
“There are three main reasons for a transformation: to reduce the effect of outliers (especially in covariates), to stabilize the variance and to linearize relationships. However, using more advanced techniques like GLS and GAMs, heterogenity and nonlinearity problems can be solved, making transformation less important.”
Q4: What are some data exploration techniques you have used?
Also, from Sandra: the PCNM (truncated distance matrix) method is discussed starting on page 244 of this book: https://link.springer.com/content/pdf/10.1007%2F978-1-4419-7976-6.pdf