Assigned Reading:

Chapters 1.1, 1.2, 2, 3, 9 and 10 in: Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer. DOI: Stanford Full Text

The sections on mutating and filtering joins in this vignette

Optional Reading:

Chapter 4 in: Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer. DOI: Stanford Full Text

RStudio has produced several helpful cheatsheets on data management and visualization with ggplot2 and tidyr. You may want to view or print one or several of them.

Before Class

Download this R script and save it in your code folder in the research project you created on the first day of class. You will need to click the ‘Raw’ button and then download the text file that appears. Be sure to remove the .txt extension that may have been added when you saved the file to your computer so that the file ends in .R. You may want to run the first part of the script that downloads the data to be sure that there will be no issues with data access during class.

Key Points

  • Data should be organized so that observations are in rows and attributes (variables) are in columns.
  • The dplyr package contains functions for manipulating data:
    • spread and gather change the format of data tables.
    • separate and unite change the format of variables.
    • filter creates a subset of a data table.
    • mutate creates new variables.
    • group_by and summarise are used to summarize observations according to the values of their variables.
  • The pipe function %>% is used to link several operations together (e.g. function composition).
  • The ggplot2 package is set of functions for visualizing data.
    • A ggplot2 plot contains three components: (1) the data, (2) the aesthetic mappings between variables and visual properties, and (3) layers describing how to display the observations.
    • The aesthetic mapping aes must minimally describe which variables define the plot coordinate space.
    • facet_wrap creates plots for specified data subsets.

Manipulating data with dplyr

Open the R project that you created on the first day of class. Then when R Studio opens, open the R script file that you downloaded for today’s class (see above). This script downloads a data set from our course website and then proceeds to manipulate and summarise the data. These data contain three tables from the Winter 2016 BIO46 class on lichen microorganisms:

  1. trees is a table where observations are trees on which lichens were collected. Columns describe each tree’s location, elevation, and species name.
  2. lichens is a table where each observation is a lichen that was collected. Columns describe the tree on which the lichen was collected, the team that collected the lichen, the collection date, and the lichen species name.
  3. algae is a table where each observation is a fragment of a lichen from which we attempted to amplify and sequence green algal ITS2 rDNA. We attempted to amplify algal DNA from four sections of each lichen in order to quantity algal diversity within a single lichen. If sequencing was successful then the DNA sequence is given along with a code identifying unique haplotype (GenotypeID). The TaxonID column indicates haplotypes that matched those reported in Werth and Sork (2010).

The R script explores algal diversity among lichens and trees. Working in pairs, execute each line of code and then add a comment (using #) above each line of code with a description of what the code does.

When you have finished commenting the code, try to complete the following challenge:


How could you modify the code that creates the lichenXgeno table so that it displays the fraction of successful sequences belonging to each GenotypeID for each lichen.

Plotting data with ggplot2

Working with your partner, create the following plots using ggplot2:

  1. A plot that maps the locations where each lichen was collected and colors the point by the Chao1 richness estimator.
  2. Modify the above plot so that three plots are shown, one for each tree species.