Assigned Reading:
Appendix A from: Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. 2009. Mixed Effects Models and Extensions in Ecology with R. Springer. Here’s the whole book or you can download just the appendix.
summary(M1)
, drop1(M1, test="F")
, anova(M1)
- Is drop1()
the best method?step(M1)
- easy one-line command; AIC-based backwards selectionIn this example, we’ll use linear regression to examine whether there is any sort of relationship beween Damage Type (DT) diversity and herbivorized leaf area on fossil leaves.
First, let’s read in the data:
# Read in a data file from the data folder on the BIO 202 GitHub repository
data <- read.csv("https://raw.githubusercontent.com/FukamiLab/BIO202/master/data/03-B-permian.csv")
data <- data[order(data$dts),]
Now, we can simply plot DT diversity against herbivorized leaf area:
plot(data$dts~data$rem)
Not a particularly clear relationship there! We can try running a simple linear regression:
fit <- lm(dts~0+rem, data=data)
summary(fit)
##
## Call:
## lm(formula = dts ~ 0 + rem, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5759 0.0000 0.6205 1.6602 10.9268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## rem 19.876 1.255 15.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.879 on 1025 degrees of freedom
## Multiple R-squared: 0.1966, Adjusted R-squared: 0.1959
## F-statistic: 250.9 on 1 and 1025 DF, p-value: < 2.2e-16
Very low p-value (< 2.2e-16), very unconvincing R-squared (0.10).
We can try plotting a different line for each site (as in Fig. A.3 from the reading, on page 544).
fit <- lm(dts~0+rem+site, data=data)
summary(fit)
##
## Call:
## lm(formula = dts ~ 0 + rem + site, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8717 -1.2556 -0.2946 0.6452 9.7026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## rem 11.35979 1.08391 10.480 <2e-16 ***
## siteCCP 1.25560 0.05452 23.031 <2e-16 ***
## siteMCF 0.25235 0.12777 1.975 0.0485 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 1023 degrees of freedom
## Multiple R-squared: 0.4717, Adjusted R-squared: 0.4702
## F-statistic: 304.5 on 3 and 1023 DF, p-value: < 2.2e-16
D1 <- data[data$site=="CCP",]
D2 <- data[data$site=="MCF",]
P1 <- predict(fit, newdata = D1)
P2 <- predict(fit, newdata = D2)
plot(data$dts~data$rem)
lines(D1$rem, P1, col="blue")
lines(D2$rem, P2, col="red")
We can try plotting a different line for each species instead of each site.
fit <- lm(dts~0+rem+sp, data=data)
D1 <- data[data$sp=="cau",]
D2 <- data[data$sp=="cev",]
D3 <- data[data$sp=="cta",]
D4 <- data[data$sp=="mgi",]
D5 <- data[data$sp=="mta",]
P1 <- predict(fit, newdata = D1)
P2 <- predict(fit, newdata = D2)
P3 <- predict(fit, newdata = D3)
P4 <- predict(fit, newdata = D4)
P5 <- predict(fit, newdata = D5)
plot(data$dts~data$rem)
lines(D1$rem, P1, co="red")
lines(D2$rem, P2, col="blue")
lines(D3$rem, P3, col="green")
lines(D4$rem, P4, col="purple")
lines(D5$rem, P5, col="yellow")
To be thorough, we can try adding all possible predictor variables, dropping them, and seeing what happens.
fit.full <- lm(dts~0+rem+sp+site+area, data=data)
drop1(fit.full, test="F")
## Single term deletions
##
## Model:
## dts ~ 0 + rem + sp + site + area
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1882.4 636.66
## rem 1 188.86 2071.3 732.76 102.236 < 2.2e-16 ***
## sp 4 506.35 2388.8 873.08 68.526 < 2.2e-16 ***
## site 0 0.00 1882.4 636.66
## area 1 470.84 2353.2 863.71 254.878 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now a run-of-the-mill GAM:
library(mgcv)
AM1 <- gam(dts ~ s(rem) + sp + site + s(area), data = data)
anova(AM1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## dts ~ s(rem) + sp + site + s(area)
##
## Parametric Terms:
## df F p-value
## sp 4 6.303 5.2e-05
## site 1 2.061 0.151
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(rem) 8.368 8.887 41.91 <2e-16
## s(area) 5.344 6.455 58.11 <2e-16
All predictors except “site” are significant.
Now a fancy GAM with cubic regression spline with shrinkage.
AM2 <- gam(dts ~ s(rem, bs = "cs") + sp + site + s(area, bs = "cs"), data = data)
anova(AM2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## dts ~ s(rem, bs = "cs") + sp + site + s(area, bs = "cs")
##
## Parametric Terms:
## df F p-value
## sp 4 4.75 0.000845
## site 1 0.29 0.590265
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(rem) 8.355 9.000 56.61 <2e-16
## s(area) 2.781 9.000 37.70 <2e-16
Values change but no difference in significance at p = 0.05. So let’s get rid of “site.”
AM3 <- gam(dts ~ s(rem, bs = "cs") + sp + s(area, bs = "cs"), data = data)
anova(AM3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## dts ~ s(rem, bs = "cs") + sp + s(area, bs = "cs")
##
## Parametric Terms:
## df F p-value
## sp 4 4.75 0.000845
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(rem) 8.355 9.000 57.1 <2e-16
## s(area) 2.781 9.000 37.7 <2e-16
We can plot the smoothing function.
plot(AM3)
drop1()
analysis above because some species were observed only in one of the two sites.