Starting with data

Learning Objectives

Load external data from a .csv file into a data frame.

Describe what a data frame is.

Summarize the contents of a data frame.

Extract a vector from a data frame

Presentation of the Survey Data

We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column	Description
record_id	Unique id for the observation
month	month of observation
day	day of observation
year	year of observation
plot_id	ID of a particular plot
species_id	2-letter code
sex	sex of animal (“M”, “F”)
hindfoot_length	length of the hindfoot in mm
weight	weight of the animal in grams
genus	genus of animal
species	species of animal
taxon	e.g. Rodent, Reptile, Bird, Rabbit
plot_type	type of plot

We are going to use the R function download.file() to download the CSV file that contains the survey data from Figshare, and we will use read_csv() to load into memory the content of the CSV file as an object of class data.frame. Inside the download.file command, the first argument is a character string with the source URL. The second argument is the intended destination on your local machine. You’ll need to have a folder on your machine called “data_raw” where you’ll download the file. So this command downloads a file from Figshare, names it “portal_data_joined.csv,” and adds it to a preexisting folder named “data_raw”.

download.file(url = "https://ndownloader.figshare.com/files/2292169",
              destfile = "data_raw/portal_data_joined.csv")

If we haven’t already done so, we can type install.packages("tidyverse") straight into the console. In fact, it’s better to write this in the console than in our script for any package, as there’s no need to re-install packages every time we run the script.

Then, to load the package type:

## load the tidyverse packages
library(tidyverse)

You are now ready to load the data:

surveys <- read_csv("data_raw/portal_data_joined.csv")

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name: surveys.

In RStudio’s Environment pane, this Data object can now be inspected by clicking its name and the round > button left of it. The latter provides an overview of all columns/variables grouped by data type/class: character and numeric.

Note

read_csv assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read_csv2 function. It behaves exactly like read_csv but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read_csv() by typing ?read_csv to learn more. There is also the read_delim() for in tab separated data files.

In addition to the above versions of the csv format, you should develop the habits of looking at and record some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not splitted into three variables), and the presence of unexepected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.

Using the columns, we’re going to learn some common analysis functions later:

select(): subset columns
filter(): subset rows on conditions
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statistics on grouped data
arrange(): sort results
count(): count discrete values

But first, some theory.

What are data frames (and tibbles)?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. Tibbles are the tidyverse’s modern variant. Essentially, they are a bit nicer to work with.

Both can be created by hand, but most commonly they are generated by the read... functions; in other words, when importing spreadsheets from your hard drive (or the web).

A data frame / tibble is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

str(surveys)

Inspecting `data.frame` Objects

We already saw how calling surveys itself or str(surveys) provided us with content and the structure of the data frame / tibble. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

Size:
- dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(surveys) - returns the number of rows
- ncol(surveys) - returns the number of columns
Content:
- head(surveys) - shows the first 6 rows
- tail(surveys) - shows the last 6 rows
- unique(surveys$colum_name) - shows distinct values within a column
Names:
- names(surveys) - returns the column names (synonym of colnames() for data.frame objects)
- rownames(surveys) - returns the row names
Summary:
- str(surveys) - structure of the object and information about the class, length and content of each column
- summary(surveys) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Extracting vectors from data frames

Our survey data frame has rows and columns (its 2 dimensions). If we want to extract some specific column we can do:

surveys$species_id

RStudio conveniently supports this $-notation by autocompletion, which is the reason why column names should not contain spaces. If you have a dataset that does, use dataset[["column name"]].

Challenge

Based on the output of str(surveys), can you answer the following questions?

What is the class of the object surveys?

How many rows and how many columns are in this object?

How many species have been recorded during these surveys?
Answer
str(surveys)

## * "Classes ... 'tbl' and 'data.frame'""
## * "34786 obs" = rows
## * "13 variables" = columns

unique(surveys$species_id)

## * how many species: 48

Data Carpentry, 2014-2019.

License. Contributing.

Questions? Feedback? Please start or join a discussion/issue on GitHub.

If this lesson is useful to you, consider subscribing to the Carpentries newsletter or making a donation to support the work of The Carpentries.

Starting with data

Data Carpentry contributors

Learning Objectives

Presentation of the Survey Data

Note

What are data frames (and tibbles)?

Inspecting data.frame Objects

Extracting vectors from data frames

Challenge

Answer

Inspecting `data.frame` Objects