Learning Objectives
- Load external data from a .csv file into a data frame.
- Describe what a data frame is.
- Summarize the contents of a data frame.
- Extract a vector from a data frame
We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
We are going to use the R function download.file()
to download the CSV file that contains the survey data from Figshare, and we will use read_csv()
to load into memory the content of the CSV file as an object of class data.frame
. Inside the download.file
command, the first argument is a character string with the source URL. The second argument is the intended destination on your local machine. You’ll need to have a folder on your machine called “data_raw” where you’ll download the file. So this command downloads a file from Figshare, names it “portal_data_joined.csv,” and adds it to a preexisting folder named “data_raw”.
download.file(url = "https://ndownloader.figshare.com/files/2292169",
destfile = "data_raw/portal_data_joined.csv")
If we haven’t already done so, we can type install.packages("tidyverse")
straight into the console. In fact, it’s better to write this in the console than in our script for any package, as there’s no need to re-install packages every time we run the script.
Then, to load the package type:
You are now ready to load the data:
This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name: surveys
.
In RStudio’s Environment
pane, this Data
object can now be inspected by clicking its name and the round >
button left of it. The latter provides an overview of all columns/variables grouped by data type/class: character
and numeric
.
Note
read_csv
assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;
) is used as a field delineator. If you want to read in this type of files in R, you can use theread_csv2
function. It behaves exactly likeread_csv
but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help forread_csv()
by typing?read_csv
to learn more. There is also theread_delim()
for in tab separated data files.In addition to the above versions of the csv format, you should develop the habits of looking at and record some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not splitted into three variables), and the presence of unexepected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.
Using the columns, we’re going to learn some common analysis functions later:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
and summarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete valuesBut first, some theory.
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. Tibbles are the tidyverse’s modern variant. Essentially, they are a bit nicer to work with.
Both can be created by hand, but most commonly they are generated by the read...
functions; in other words, when importing spreadsheets from your hard drive (or the web).
A data frame / tibble is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
We can see this when inspecting the structure of a data frame with the function str()
:
data.frame
ObjectsWe already saw how calling surveys
itself or str(surveys)
provided us with content and the structure of the data frame / tibble. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
dim(surveys)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(surveys)
- returns the number of rowsncol(surveys)
- returns the number of columnshead(surveys)
- shows the first 6 rowstail(surveys)
- shows the last 6 rowsunique(surveys$colum_name)
- shows distinct values within a columnnames(surveys)
- returns the column names (synonym of colnames()
for data.frame
objects)rownames(surveys)
- returns the row namesstr(surveys)
- structure of the object and information about the class, length and content of each columnsummary(surveys)
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Our survey data frame has rows and columns (its 2 dimensions). If we want to extract some specific column we can do:
RStudio conveniently supports this $
-notation by autocompletion, which is the reason why column names should not contain spaces. If you have a dataset that does, use dataset[["column name"]]
.
Challenge
Based on the output of
str(surveys)
, can you answer the following questions?
- What is the class of the object
surveys
?- How many rows and how many columns are in this object?
- How many species have been recorded during these surveys?
Data Carpentry, 2014-2019.
Questions? Feedback?
Please start or join
a discussion/issue on GitHub.
If this lesson is useful to you, consider