Tidying & Packaging Datasets

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • What are possible forms that a dataset can have?

  • What advantages and disadvantages do these forms have?

  • Which features make a dataset more or less reusable?

  • How can we add datasets to R packages?

Objectives
  • Use tidyr::gather() to convert wide data to its long form.

  • Find self-descriptive variable/column names.

Tidying our inflammation datasets

In the preparations episode, we read in, visualised, centered and rescaled a dataset about inflammation in 60 patients measured over 40 days. Let’s pretend we need to tidy it up for further publication alongside the package we have been constructing.

Staying in spreadsheets

It is not always possible to abandon Excel & Co. and switch to R or Python. In order to learn more about how to organise data in spreadsheets, please read DataCarpentry.org/spreadsheets-socialsci and/or DataCarpentry.org/spreadsheet-ecology-lesson.

First, run library(usethis) and use_data_raw() and note the Next: instructions in the console. In particular, copy the inflammation.csv file to data-raw and start a tidy-inflammation.R file there to which you add the necessary lines of the following code examples. Think about which of them are purely for interactive checks of the tidying process! Don’t save those to the “data creation script”.

dat <- read.csv(file = "inflammation.csv", header = FALSE)
head(colnames(dat), 10)
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"
head(rownames(dat), 10)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

You probably noticed that neither “patients” nor “days” is listed anywhere. Thus, we should label the data first, starting with pseudonyms for the patients. If this were real data from real human patients, we would have acquired their written and informed consent to the pseudonymous data publication before collecting it, and during approval of our study by the ethics committee. We would also have pre-registered our study, e.g. on AllTrials.net.

patient_ID <- paste("patient", sep = "_", seq(nrow(dat)))
dat_labelled <- cbind(dat, patient_ID)
head(dat_labelled)
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1  0  0  1  3  1  2  4  7  8   3   3   3  10   5   7   4   7   7  12  18
2  0  1  2  1  2  1  3  2  2   6  10  11   5   9   4   4   7  16   8   6
3  0  1  1  3  3  2  6  2  5   9   5   7   4   5   4  15   5  11   9  10
4  0  0  2  0  4  2  2  1  6   7  10   7   9  13   8   8  15  10  10   7
5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17
6  0  0  1  2  2  4  2  1  6   4   7   6   6   9   9  15   4  16  18  12
  V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
1   6  13  11  11   7   7   4   6   8   8   4   4   5   7   3   4   2   3
2  18   4  12   5  12   7  11   5  11   3   3   5   4   4   5   5   1   1
3  19  14  12  17   7  12  11   7   4   2  10   5   4   2   2   3   2   2
4  17   4   4   7   6  15   6   4   9  11   3   5   6   3   3   4   2   3
5   9  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0
6  12   5  18   9   5   3  10   3  12   7   8   4   7   3   5   4   4   3
  V39 V40 patient_ID
1   0   0  patient_1
2   0   1  patient_2
3   1   1  patient_3
4   2   1  patient_4
5   1   1  patient_5
6   2   1  patient_6

Now we have labelled our observations with patient_ID as the column/variable name and patient_… as the values. Next we are going to “gather” the observations, meaning that we:

  1. define day as the common key of all the V1, V2, etc. variables,
  2. define a name for the newly created column, in which the values from the messy variables will be gathered, and
  3. specify which variables we want to gather into key-value pairs.
dat_long <- tidyr::gather(dat_labelled,
                         key = "day",
                         value = "inflammatory_response",
                         V1:V40)
tail(dat_long)
     patient_ID day inflammatory_response
2395 patient_55 V40                     1
2396 patient_56 V40                     1
2397 patient_57 V40                     1
2398 patient_58 V40                     1
2399 patient_59 V40                     0
2400 patient_60 V40                     0

Notice the last row number? Because we gathered observations into one column, and thus one observation per row, that number should be the product of the initial 60 rows and 40 columns. However, we should verify this with:

nrow(dat_long)
[1] 2400

This time, it matches, but dataframes also have row names, which may also be numeric, but don’t necessarily represent a contiguous row count. That’s why we should only rely on nrow() for counting. You can read more about this on PerfectlyRandom.org: “Never trust the row names…” .

Because the days shouldn’t remain labelled with Vs, we also convert them to numbers. This works with either as.factor() and as.numeric() after another, or nested:

dat_long$day <- as.numeric(as.factor(dat_long$day))
str(dat_long)
'data.frame':	2400 obs. of  3 variables:
 $ patient_ID           : Factor w/ 60 levels "patient_1","patient_10",..: 1 12 23 34 45 56 58 59 60 2 ...
 $ day                  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ inflammatory_response: int  0 0 0 0 0 0 0 0 0 0 ...

The structure of our inflammation dataset is now a lot tidier than before and we can finally package it up in an R-supported format. Rename the dataframe to some- thing more self-descriptive than dat..., using for example inflammation <- dat_long and then run use_data(inflammation).

In case you need to “widen” a long/tidy dataset again:

Use the opposite of gather(): spread(). You can learn more about both in the Data Carpentry’s “Data Manipulation” lesson or the “R for Data Science” book.

Documenting datasets

Just as a good function documentation makes your code more accessible and reusable for others and your future self, datasets benefit from a documentation as well. Please read through r-pkgs.had.co.nz/data.html#documenting-data and keep in mind, that the descriptive tags for datasets are not entirely the same as the those for function documentation.

Challenge: Document the dataset!

  1. Which file should we create to document our dataset?
  2. Which items of our dataset should be documented?
  3. How could the documentation file look in the end?

Solutions

  1. R/inflammation.R
  2. patient_ID, day & inflammatory_response
  3. Like this:
    #' Inflammation In Patients During Study...
    #'
    #' @source Pre-registration: \url{http://wwww.alltrials.net/study...}.
    #'   Method: \url{...}.
    #' @format A data frame with the variables/columns:
    #' \describe{
    #'  \item{patient_ID}{A factor prepresenting the different patients}
    #'  \item{day}{Number of days after start of the study}
    #'  \item{inflammatory_response}{Measured daily as described in the methods section of ...}
    #' }
    "inflammation"
    

    Note that this is exactly the “LaTeX-like” syntax mentioned in the “Writing Documentation” section. We could reduce some of those \commands to the simpler Markdown format with usethis::use_roxygen_md().

After saving the raw data, the data creation script, the tidy dataset, and the documentation file, what is left to do in our little personal package?

Challenge: Last packaging steps (for now; an incomplete list)

Solutions

  1. Run roxygen2::roxygenise().
  2. Run RStudio’s Build > Check. Do any errors remain?
  3. After Build > Install and Restart, start writing your paper as a vignette.
  4. Commit the accumulated changes. Which commits would you suggest?

Checking CSVs for structural validity

This was an average case of cleaning a dataset. On the one hand, we had to label columns (patient_ID, day, inflammatory_response). On the other hand, the data itself was easily tidied. You have probably encountered many other .csv files which were messy in their own ways. Two useful tools to find structural problems in CSVs are CSVLint.io and try.goodtables.io.

Challenge: Search for structural problems in both the initial inflammation.csv and in your tidied version

How many errors or warnings does that produce? Then, write.csv(dat_long, "inflammation-tidy.csv") and check that version. A few warnings may be left. If yes, which one(s)? How would you approach eliminating those?

Solutions

  1. CSVLint.io should report “Structural problem: Possible title row detected” and try.goodtables.io should report “1x Blank Header”.
  2. View inflammation-tidy.csv in RStudio, or a text editor and read ?write.csv.
  3. An empty "", in the beginning of the first row indicates that we need to set row.names = FALSE. Row names are a bit weird, because they will often be numbers, but not necessarily will the number you see with tail(...) be equal to the number of rows. Always use nrow(...) to count those. We recommend to move actually useful information about a row / observation into its own column.
  4. Depending on your operating system, you could set eol = "\r\n" to meet the most common CSV format specification (RFC 4180).

Reusing and citing public datasets

This is a bonus episode that shows how FAIR datasets, together with the right mindset of finding and reusing existing R code can help you in scientific data analyses.

Key Points

  • Spreadsheets incentivise the wide data format, which may spread variables across columns.

  • If variables in different dataset are comparable methodologically, their variable/column names should be spelled exactly alike.

  • (Raw) Data can be packaged by itself (and/or alongside related cleaning and analysis code).