Tidying & Packaging Datasets
Overview
Teaching: 45 min
Exercises: 15 minQuestions
What are possible forms that a dataset can have?
What advantages and disadvantages do these forms have?
Which features make a dataset more or less reusable?
How can we add datasets to R packages?
Objectives
Use
tidyr::gather()
to convert wide data to its long form.Find self-descriptive variable/column names.
Tidying our inflammation datasets
In the preparations episode, we read in, visualised, centered and rescaled a dataset about inflammation in 60 patients measured over 40 days. Let’s pretend we need to tidy it up for further publication alongside the package we have been constructing.
Staying in spreadsheets
It is not always possible to abandon Excel & Co. and switch to R or Python. In order to learn more about how to organise data in spreadsheets, please read DataCarpentry.org/spreadsheets-socialsci and/or DataCarpentry.org/spreadsheet-ecology-lesson.
First, run library(usethis)
and use_data_raw()
and note the Next:
instructions
in the console. In particular, copy the inflammation.csv
file to data-raw
and start a tidy-inflammation.R
file there to which you add the necessary lines
of the following code examples. Think about which of them are purely for interactive
checks of the tidying process! Don’t save those to the “data creation script”.
dat <- read.csv(file = "inflammation.csv", header = FALSE)
head(colnames(dat), 10)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10"
head(rownames(dat), 10)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
You probably noticed that neither “patients” nor “days” is listed anywhere. Thus, we should label the data first, starting with pseudonyms for the patients. If this were real data from real human patients, we would have acquired their written and informed consent to the pseudonymous data publication before collecting it, and during approval of our study by the ethics committee. We would also have pre-registered our study, e.g. on AllTrials.net.
patient_ID <- paste("patient", sep = "_", seq(nrow(dat)))
dat_labelled <- cbind(dat, patient_ID)
head(dat_labelled)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18
2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6
3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10
4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7
5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17
6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12
V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
1 6 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3
2 18 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1
3 19 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2
4 17 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3
5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0
6 12 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3
V39 V40 patient_ID
1 0 0 patient_1
2 0 1 patient_2
3 1 1 patient_3
4 2 1 patient_4
5 1 1 patient_5
6 2 1 patient_6
Now we have labelled our observations with patient_ID
as the column/variable
name and patient_…
as the values. Next we are going to “gather” the observations,
meaning that we:
- define
day
as the commonkey
of all theV1
,V2
, etc. variables, - define a name for the newly created column, in which the
value
s from the messy variables will be gathered, and - specify which variables we want to gather into key-value pairs.
dat_long <- tidyr::gather(dat_labelled,
key = "day",
value = "inflammatory_response",
V1:V40)
tail(dat_long)
patient_ID day inflammatory_response
2395 patient_55 V40 1
2396 patient_56 V40 1
2397 patient_57 V40 1
2398 patient_58 V40 1
2399 patient_59 V40 0
2400 patient_60 V40 0
Notice the last row number? Because we gathered observations into one column, and thus one observation per row, that number should be the product of the initial 60 rows and 40 columns. However, we should verify this with:
nrow(dat_long)
[1] 2400
This time, it matches, but dataframes also have row names, which may also be
numeric, but don’t necessarily represent a contiguous row count. That’s why we
should only rely on nrow()
for counting. You can read more about this on
PerfectlyRandom.org: “Never trust the row names…” .
Because the days shouldn’t remain labelled with V
s, we also convert them to
numbers. This works with either as.factor()
and as.numeric()
after another,
or nested:
dat_long$day <- as.numeric(as.factor(dat_long$day))
str(dat_long)
'data.frame': 2400 obs. of 3 variables:
$ patient_ID : Factor w/ 60 levels "patient_1","patient_10",..: 1 12 23 34 45 56 58 59 60 2 ...
$ day : num 1 1 1 1 1 1 1 1 1 1 ...
$ inflammatory_response: int 0 0 0 0 0 0 0 0 0 0 ...
The structure of our inflammation dataset is now a lot tidier than before and we
can finally package it up in an R-supported format. Rename the dataframe to some-
thing more self-descriptive than dat...
, using for example
inflammation <- dat_long
and then run use_data(inflammation)
.
In case you need to “widen” a long/tidy dataset again:
Use the opposite of
gather()
:spread()
. You can learn more about both in the Data Carpentry’s “Data Manipulation” lesson or the “R for Data Science” book.
Documenting datasets
Just as a good function documentation makes your code more accessible and reusable for others and your future self, datasets benefit from a documentation as well. Please read through r-pkgs.had.co.nz/data.html#documenting-data and keep in mind, that the descriptive tags for datasets are not entirely the same as the those for function documentation.
Challenge: Document the dataset!
- Which file should we create to document our dataset?
- Which items of our dataset should be documented?
- How could the documentation file look in the end?
Solutions
R/inflammation.R
patient_ID
,day
&inflammatory_response
- Like this:
#' Inflammation In Patients During Study... #' #' @source Pre-registration: \url{http://wwww.alltrials.net/study...}. #' Method: \url{...}. #' @format A data frame with the variables/columns: #' \describe{ #' \item{patient_ID}{A factor prepresenting the different patients} #' \item{day}{Number of days after start of the study} #' \item{inflammatory_response}{Measured daily as described in the methods section of ...} #' } "inflammation"
Note that this is exactly the “LaTeX-like” syntax mentioned in the “Writing Documentation” section. We could reduce some of those
\commands
to the simpler Markdown format withusethis::use_roxygen_md()
.
After saving the raw data, the data creation script, the tidy dataset, and the documentation file, what is left to do in our little personal package?
Challenge: Last packaging steps (for now; an incomplete list)
Solutions
- Run
roxygen2::roxygenise()
.- Run RStudio’s
Build > Check
. Do any errors remain?- After
Build > Install and Restart
, start writing your paper as a vignette.- Commit the accumulated changes. Which commits would you suggest?
Checking CSVs for structural validity
This was an average case of cleaning a dataset. On the one hand, we had to label
columns (patient_ID, day, inflammatory_response). On the other hand, the data itself was easily
tidied. You have probably encountered many other .csv
files which were messy
in their own ways. Two useful tools to find structural problems in CSVs are
CSVLint.io and try.goodtables.io.
Challenge: Search for structural problems in both the initial
inflammation.csv
and in your tidied versionHow many errors or warnings does that produce? Then,
write.csv(dat_long, "inflammation-tidy.csv")
and check that version. A few warnings may be left. If yes, which one(s)? How would you approach eliminating those?Solutions
- CSVLint.io should report “Structural problem: Possible title row detected” and try.goodtables.io should report “1x Blank Header”.
- View
inflammation-tidy.csv
in RStudio, or a text editor and read?write.csv
.- An empty
"",
in the beginning of the first row indicates that we need to setrow.names = FALSE
. Row names are a bit weird, because they will often be numbers, but not necessarily will the number you see withtail(...)
be equal to the number of rows. Always usenrow(...)
to count those. We recommend to move actually useful information about a row / observation into its own column.- Depending on your operating system, you could set
eol = "\r\n"
to meet the most common CSV format specification (RFC 4180).
Reusing and citing public datasets
This is a bonus episode that shows how FAIR datasets, together with the right mindset of finding and reusing existing R code can help you in scientific data analyses.
Key Points
Spreadsheets incentivise the wide data format, which may spread variables across columns.
If variables in different dataset are comparable methodologically, their variable/column names should be spelled exactly alike.
(Raw) Data can be packaged by itself (and/or alongside related cleaning and analysis code).