OverviewTeaching: 45 min
Exercises: 15 minQuestions
How do I read data into R?
How do I assign variables?
What is a data frame?
How do I calculate simple statistics like mean and median?
Where can I get help?
How can I plot my data?Objectives
Read tabular data from a file into a program.
Perform operations on a data frame of data.
Display simple graphs.
We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. The data sets are stored in comma-separated values (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days. The first few rows of our first file look like this:
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
We want to:
- Load data into memory,
- Calculate the average value of inflammation per day across all patients, and
- Plot the results.
To do all that, we’ll have to learn a little bit about programming.
Let’s import the file called
inflammation.csv into our R environment.
To import the file, first we need to tell our computer where the file is.
We do that by choosing a working directory, which is possible using
setwd("~/Desktop/FAIR-R/"). However, a more elegant way is creating an RStudio
New project >
Existing Directory >
Browse to the
folder you created during the setup.
.Rproj-fying your other work
If you have already worked with R before, but not with RStudio projects, repeat the above steps to upgrade a few of
Existing Directory. Note how double-clicking a resulting
.Rprojfile launches RStudio right in the correct working directory.
Setting the working directory correctly is very important in R.
If it’s not, we’ll often get errors that files are not where we expect them to be.
That’s why the
.Rproj files are more useful than using
Now, we can load the downloaded data file into R:
(dat <- read.csv(file = "inflammation.csv", header = FALSE))
read.csv(...) is a function call
that asks R to run the function
read.csv and assign its output to a variable.
) around it instructs R to not only do the above, but also
show the function’s output in the console.
read.csv has two arguments: the name of the file we want to read, and whether the first line of the file contains names for the columns of data.
The filename needs to be a character string (or string for short), so we put it in quotes. Assigning the second argument,
header, to be
FALSE indicates that the data file does not have column headers. In case of our
inflammation.csv example, R auto-generates column names in the sequence
V1 (for “variable 1”),
V2, and so on, until
Other Options for Reading CSV Files
read.csvactually has many more arguments that you may find useful when importing your own data in the future. You can learn more about these options in this supplementary lesson.
Loading Data with Headers
What happens if you forget to put
header = FALSE? The default value is
header = TRUE, which you can check with
help(read.csv). What do you expect will happen if you leave the default value? Before you run any code, think about what will happen to the first few rows of your data frame, and its overall size. Then run the following code and see if your expectations agree:
read.csv(file = "inflammation.csv")
Reading Different Decimal Point Formats
Depending on the country you live in, your standard can use the dot or the comma as decimal mark. Also, different devices or software can generate data with different decimal points. Take a look at
?read.csvand write the code to load a file called
commadec.txtthat has numeric values with commas as decimal mark, separated by semicolons.
A function will perform its given action on whatever value is passed to the argument(s).
For example, in this case if we provided the name of a different file to the argument
read.csv would read that instead.
We’ll learn more about the details of functions and their arguments in the next lesson.
Since we included the whole line in parentheses, the console will display the full contents of the file
inflammation.csv, which we read in assigned to the variable
Try it out.
We can think of a variable as a container with a name, such as
subject_id that contains one or more values.
We can create a new variable and assign a value to it using
Once a variable is created, we can use the variable name to refer to the value it was assigned. The variable name now acts as a tag.
To see the value of a variable, we can print it by typing the name of the variable and hitting Return (or Enter).
In general, R will print to the console any object returned by a function or operation
unless we assign it to a variable, or force-print the assignment’s output with
( parentheses around the whole epxression
Variable Naming Conventions
Historically, R programmers have used a variety of conventions for naming variables. The
.character in R can be a valid part of a variable name; thus the above assignment could have easily been
weight.kg <- 57.5. This is often confusing to R newcomers who have programmed in languages where
.has a more significant meaning. Today, most R programmers 1) start variable names with lower case letters, 2) separate words in variable names with underscores, and 3) use only lowercase letters, underscores, and numbers in variable names. The book R Packages includes a chapter on this and other style considerations.
When you assign a value to a variable, R only stores the value, not the calculation you used to create it. This is an important point if you’re used to the way a spreadsheet program automatically updates linked cells. Let’s look at an example.
For large data sets it is convenient to use the function
head to display only the first few rows of data.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18 2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6 3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10 4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7 5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17 6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 1 6 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3 2 18 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1 3 19 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2 4 17 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3 5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0 6 12 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3 V39 V40 1 0 0 2 0 1 3 1 1 4 2 1 5 1 1 6 2 1
Now that our data are loaded into R, we can start doing things with them.
First, let’s ask what type of thing
The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will use them frequently when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
What if we need the average inflammation of each patient, or the average for each day?
avg_patient_inflammation <- rowMeans(dat)
And to obtain the average inflammation of each day we will need to calculate the mean of all of the columns of the data frame.
avg_day_inflammation <- colMeans(dat)
The common row- or column-wise
mean()operations have the above shortcuts. But what if you need to perform other calculations? In base R, the family of
applyfunctions will help you do that, see
?apply. You can also learn about the modern approach with purrr
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.
Let’s take a look at
avg_day_inflammation using the function
Above, we gave the function
plot a vector of numbers corresponding to the average inflammation per day across all patients.
plot created a scatter plot where the y-axis is the average inflammation level and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 40 days of treatment.
The result is roughly a linear rise and fall, which is suspicious: based on other studies, we expect a sharper rise and slower fall.
So, either there’s a mistake in our calculations or something is wrong with our data.
In the next episode, we’ll create some functions to help us
Create a plot showing the average inflammation for each patient across all days.
Objects are created on demand whenever a value is assigned to them.
applying the mean calculation across rows or columns has shortcuts, but other statistics may need
plotto create simple visualizations.