Learning Objectives

  • Load external data from a .csv file into a data frame.
  • Describe what a data frame is.
  • Summarize the contents of a data frame.
  • Extract a vector from a data frame

Presentation of the Survey Data

We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxon e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot

We are going to use the R function download.file() to download the CSV file that contains the survey data from Figshare, and we will use read_csv() to load into memory the content of the CSV file as an object of class data.frame. Inside the download.file command, the first argument is a character string with the source URL. The second argument is the intended destination on your local machine. You’ll need to have a folder on your machine called “data_raw” where you’ll download the file. So this command downloads a file from Figshare, names it “portal_data_joined.csv,” and adds it to a preexisting folder named “data_raw”.

download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "data_raw/portal_data_joined.csv")

If we haven’t already done so, we can type install.packages("tidyverse") straight into the console. In fact, it’s better to write this in the console than in our script for any package, as there’s no need to re-install packages every time we run the script.

Then, to load the package type:

## load the tidyverse packages
library(tidyverse)

You are now ready to load the data:

surveys <- read_csv("data_raw/portal_data_joined.csv")

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name: surveys.

In RStudio’s Environment pane, this Data object can now be inspected by clicking its name and the round > button left of it. The latter provides an overview of all columns/variables grouped by data type/class: character and numeric.

Note

read_csv assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read_csv2 function. It behaves exactly like read_csv but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read_csv() by typing ?read_csv to learn more. There is also the read_delim() for in tab separated data files.

In addition to the above versions of the csv format, you should develop the habits of looking at and record some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not splitted into three variables), and the presence of unexepected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.

Using the columns, we’re going to learn some common analysis functions later:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

But first, some theory.

What are data frames (and tibbles)?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. Tibbles are the tidyverse’s modern variant. Essentially, they are a bit nicer to work with.

Both can be created by hand, but most commonly they are generated by the read... functions; in other words, when importing spreadsheets from your hard drive (or the web).

A data frame / tibble is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

str(surveys)

Inspecting data.frame Objects

We already saw how calling surveys itself or str(surveys) provided us with content and the structure of the data frame / tibble. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:
    • dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(surveys) - returns the number of rows
    • ncol(surveys) - returns the number of columns
  • Content:
    • head(surveys) - shows the first 6 rows
    • tail(surveys) - shows the last 6 rows
    • unique(surveys$colum_name) - shows distinct values within a column
  • Names:
    • names(surveys) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(surveys) - returns the row names
  • Summary:
    • str(surveys) - structure of the object and information about the class, length and content of each column
    • summary(surveys) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Extracting vectors from data frames

Our survey data frame has rows and columns (its 2 dimensions). If we want to extract some specific column we can do:

surveys$species_id

RStudio conveniently supports this $-notation by autocompletion, which is the reason why column names should not contain spaces. If you have a dataset that does, use dataset[["column name"]].

Challenge

Based on the output of str(surveys), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • How many species have been recorded during these surveys?

Answer

str(surveys)

## * "Classes ... 'tbl' and 'data.frame'""
## * "34786 obs" = rows
## * "13 variables" = columns

unique(surveys$species_id)

## * how many species: 48

If this lesson is useful to you, consider subscribing to the Carpentries newsletter or making a donation to support the work of The Carpentries.