Downloading datasets

The BacDive website lets you easily search for all their strains within a given taxonomic unit. For example, searching for “Bacillus” returns a paginated list of strains that you can then access, download and analyse further. All manual, though. BacDiveR automates this workflow:

Suppose we want to compare the optimal growth temperatures of strains from the taxon Bacillus halodurans with another one. bd_retrieve_taxon() will get you all taxon data in a single (albeit large) list of dataframes. Feel free to take a break while the computers do some work for you:

Extracting data fields

We wanted the growth temperatures, right? As with any other database field, you now need to determine its path within the list data structure that BacDiveR returned to you. Use either

  1. RStudio’s Environment > Data viewer, or
  2. str(Bac_data), or
  3. your web browser’s JSON viewer on the dataset’s URL: BacDive.DSMZ.de/api/bacdive/taxon/Bacillus/halodurans,

to find the $-marked path to the field of your interest. In our example, it’s $culture_growth_condition$culture_temp$temp, which we’ll now use to extract that field from all entries in our downloaded datasets.

Multiple steps are necessary here, which could easily result in hardly readable code if we used the regular assignment operator <-, intermediate variables and/or nested function calls. We will avoid this with the pipe operator %>%. It indicates that

  1. an object is passed into a function as its first argument, and that
  2. the function’s output is “piped” into the next function.

Note the ~ .x prepended to the path $culture_growth_condition$culture_temp$temp! This is map()’s way of indicating that each element in the piped-in dataset will be accessed at that path.

Keep in mind, that in case of growth temperatures, a range may also be given (e.g. “28-32”). The above code turns a range into NA, but you could compute the mid-point / mean of such ranges.

Plotting the extracted data

Before visualising the data, we need to create a dataframe of the two datasets.

library("ggplot2")

rbind(temperature_Bac, temperature_At) %>% 
  ggplot(aes(x = taxon, y = temp_C)) +
  geom_boxplot(notch = TRUE) +
  geom_jitter(height = 0, alpha = 0.5)
#> notch went outside hinges. Try setting notch=FALSE.

And thus we find, that Aneurinibacillus thermoaerophilus contains strains with different growth optima (note the groups of data points), even up to the 50-something-°C-range as the thermo-part in its name suggest). On the other hand, all Bacillus halodurans strains known to BacDive were found to grow best at the lower temperature of 31.2727273°C. Thanks to the notch in Aneurinibacillus thermoaerophilus’s box, we can also say that there is a significant difference between the temperature ranges of these two taxons, even before digging into the numbers: