Introduction to R and RStudio
Why R and Not Another Statistical Program?
The breadth and complexity of statistical techniques continues grow, making it difficult to learn details of all methods. Historically, techniques like analysis of variance (ANOVA) and regression, developed in separate fields, and only later were these techniques integrated into what is now the called the generalized linear model. Since then, multiple additional methods have developed, including those that relax assumptions of earlier methods. Traditional software that provides menus for selecting a specific method were useful when the number of methods were a reasonable size, but the continued development of additional methods make it very difficult to fit all in a reasonably navigable set of menus. The clicking of menus also poses problems for reproducibility. Most of the developing methods, continue to fit nicely into, or are extensions of, the generalized linear model. For example, methods such as machine learning and artificial intelligence build on this generalized linear model.
So, while it may not be as easy to get started learning statistics with a programming language, it will pay of in the long run.
Statistical Programs versus Statistical Programming Language
Here are some qualities of menu based statistical programs and statistical programming languages.
- Statistical Programs
- fixed menus
- limited procedures (at least in the menus)
- leads to compartmentalizing models (e.g. ANOVA, regression, GLM)
- Statistical Programming Languages (SPLs)
- Turing complete: if you can create an algorithm you can program it
- Very flexible
- Integration of models: One model to rule them all!
R is a statistical programming language, which means it is a programming language designed specifically to do statistics. It is widely used across many fields of science, it is free to use, and you can almost always find an implementation of a method using R. For these reasons, and many more, I teach statistics using R.
Installing R and RStudio
The goal of this chapter is to get you up and running with the R statistical programming language and the RStudio integrated development environment.
If you are reading this because you are taking one of my courses, you must decide how you want to use R and RStudio for the course. You have two basic options:
- you can install them on your own computer, or
- you can use Auburn Universities education virtual lab (VLab), online.
If you have a computer that you will be using consistently for this course, I recommend installing R and RStudio on that computer. Both are free and will be much easier to use if you install them directly on your computer. If you have decided to install the software on your computer you can skip to the following video. Note, that you must be a student within the university, and have DUO setup to use VLab. If you think you want to use the virtual lab, watch this video:
Installing R
To install R go to www.cran.r-project.org, select the appropriate operating system and follow the instructions to install R. You must have R install to use RStudio, so do this first.
Installing R on Windows
To install R on Windows, click on the “Download R for Windows” link on the CRAN page. On the next page click “base” under the Subdirectories heading. On the next page you will see a link entitled “Download R-4.2.2 For Windows” (or the latest version). Click that link to download R, and install it the way you would most software on Windows. Note, this page is also a good resource if you have problems intalling R.
Installing R on Mac
To install R on Mac, click the “Download R for macOS” link on the CRAN page. You will likely see two links toward the top of the page that look something similar to the following, one for each of the two types of processors available on macOS.
M1 chip:
R-4.2.2-arm64.pkg
Intel:
R-4.2.2.pkg
Select the one appropriate for your computer. If you do not know what type of chip you have, click on the apple icon in the top left corner of your mac and the click “About this Mac”. Under the processor heading you should see a string of characters. If in includes Intel you have an Intel chip, if not, you likely have an M1 chip. Newer computers are more likely to have M1 than older ones.
Here is a video demonstrating the installation of R:
Installing RStudio
To install RStudio go to www.posit.co and follow the links to download the free desktop version of RStudio.
Here is a video demonstrating the installation of Rstudio:
A brief Tour of RStudio
RStudio is a powerful tool, but it can be a little intimidating at first. The video below is a quick tour of the software:
R as a Statistical Programming Language
To help you understand R I describe some basic concepts important to understanding R as a statistical programming language (SPL). Such concepts will hopefully help you organize what you are learning. This is important because you will not be able to memorize all of the things you need to do to use R. But, having some general concepts should help you build a solid foundation of skills. This explanation will be a gross oversimplification of R, but it should be a good starting model you can build later.
Elements of Statistical Programming
An object is a thing that has one or more states, and one or more behaviors. Take for example you cell phone. It has many states, such as on or off, and many behaviors, such as making phone calls, sending texts, or surfing the web. Everything in R is an object. Objects in R are very similar to objects like your cell phone, in that they have states and behaviors. Our goal is to learn how to use these objects to help us do science.
There are basically two types of objects in R: data objects and function objects. Data objects store information, while function objects process or manipulate information.
Expressions
We use objects in R through expressions. An expression is simply a combination of objects that R can evaluate. So, we type something into R, R processes it and gives us the results. For example, if we type 1 + 2
into the R console, it will give us the result 3
:
1 + 2
[1] 3
So, expressions are simply objects or combinations of objects submitted to R in a way R can evaluate them.
Basic Elements of a Good SPL
- a rich set of primitive expressions
- mechanisms for combining expressions into more complex expressions
- means of abstraction, which allow for naming and manipulating compound objects
Primitive Expressions
Everything in R is an object
Primitive objects are the simplest elements of a programming language, and include:
- primitive data
- primitive functions
They can be thought of as the basic building blocks for everything else in the language.
An expression is an input that the programming language can evaluate, and consists of function and data objects.
Primitive Data Types:
Data objects are the primary means of storing information in R. R has a few basic data types:
Numeric -
numeric
int
- integers (1,2
)num
- real number (1.2, -3.1, 200.0
)
character or string -
character
"Hello world!"
,"Ten"
,'Cat'
"This is a sentence, which is a string"
"10"
( in single or double quotes, as long as they match)
Boolean or Logical
logical
TRUE
orFALSE
(use operators such as or, and and not).- They will evaluate to numbers where
FALSE
evaluates to zero, andTRUE
evaluates to one. - For example. if you enter
TRUE + 1
you will get2
in return.
mode(TRUE)
[1] "logical"
TRUE + 1
[1] 2
Primitive Functions
R uses functions to do all computations. When you open R it loads the base R functions. You can do lots of things with the base R functions. Primitive functions are built into R. Below are some of they types of primitive functions and examples.
Operators
- Arithmetic Operators
+
,-
,*
,/
,^
- Comparison (also called Boolean, Logical or Predicate) Operators
<
,>
,==
,<=
,>=
,!=
- less than, greater than, equal to, less than or equal to, greater than or equal to, not equal to
- return
TRUE
orFALSE
- Logical Operator
&
,|
,!
- also return
TRUE
orFALSE
- Other functions
mode()
length()
sum()
sqrt()
log()
exp()
- Assignment operators (assignment will be discussed below)
<-
preferred assignment operator - always use this one=
this will also work, but can be confusing (note different from==
, the comparison operator)->
is also an assignment operator, but we will not use it.
Programming Languages are Not Forgiving
Syntactically valid expressions
Expressions must be syntactically valid. This means they must be organized in a way that R understands.
- syntax (form)
- English: “cat dog boy” - not syntactically valid
- English: “cat hugs boy” - syntactically valid
- programming language:
"hi" 5
- not syntactically valid3.2*5
- syntactically valid
Semantically valid expressions
R statements must also be semantically valid. semantics has to do with meaning.
English: “I are hungry” - syntactically valid but semantic error
programming language: - 3 + “hi” - semantic error (you can’t use addition on character strings)
Chomsky: “colorless green ideas sleep furiously”
This statement is syntactically valid, but does not make sense, so makes a semantic error.
In R you have to combine expressions in a way that R “understands” and this combination should be meaningful.
Assignment
We will often want to save data in a variable. We can do that with assignment, which utilizes an assignment operator.
<- 2 x
x
[1] 2
<- "dog" pet
pet
[1] "dog"
Assignments are special expressions that are composed of three parts, a name, an assignment operator, and an expression.
For the following assignment,
<- 1:10 x
x
is the name, <-
is the assignment operator, and 1:10
is an expression. Names in R can be anything that includes letters, numbers, a period (.
) or an underscore (_
), as long as it begins with either a letter or a period. Here are some valid, followed by invalid names
# Valid
IQ
c3p0
Height_inches
weight.lbs
.hidden
# Invalid (you will get an error message)
_cat
1dog
%sales-Inches Height
There are also some names that cannot be used because they are names of primitive R objects (e.g. if
, for
, else
, in
). Type ?reserved
in the R console for a complete list.
There are at least two names that can, but should not be used. Namely the letters (T and F) which in R are short for TRUE and FALSE.
T
[1] TRUE
F
[1] FALSE
There are at least three assignment operators, as mentioned above, but it is commonly recommended that you use <-
, because it makes clear that you are taking some expression and putting it in an object. So we would say of the assignment of x <- 1:10
that x gets the integers 1 through 10, suggesting that we are putting the integers into the object x
.
Just about any expression can be passed to a name with the assignment operator.
Combining Expressions
Complex Data Types
Scalars, Vectors, Matrices, and Arrays
A scalar is a single value such as:
1
[1] 1
or
"cat"
[1] "cat"
A vector is a one-dimensional series of values. For example, the integers 1 through 5 would be a vector of length 5. In R you can create a vector as follows:
<- c(1, 2, 3, 4, 5) series1_5
Note in R there really are no scalars per se. To R a scalar is a vector of length one.
<- 4
x is.vector(x)
[1] TRUE
the is.vector()
function tests if an object is a vector. The result lets us know the primitive object 4
is indeed a vector.
length(x)
[1] 1
And it’s length is one.
Lists are the most complex primitive type of data object. A list is a series of any type of object. For example, we might want to record some personal information.
<- list(
personalInfo name = "Rosalind",
age = 6,
pet_names = c("Sparkles", "Mr. Bingo Clakerson", "scruffy"),
favorite_colors = c("pink", "purple")
)
Dataframes are special types of lists, that have the same number of values in each of the series in the list. We will use these very often for data analysis. Each row in a data frame is a different unit and each column is a different variables. So, in a data frame each column has the same number of rows, and each row has the same number of columns.
<- data.frame(
class_info name = c("Rosalind", "Emily", "Drake"),
age = c(6, 7, 5),
height = c(46, 48, 44)
)
class_info
name age height
1 Rosalind 6 46
2 Emily 7 48
3 Drake 5 44
Notice that the data frame class_info
is an object that contains other objects. If we want to use one of the objects inside a data frame we can do so by letting R know where to find that object using the $
operator. So, if we wanted to see the ages in the class_info
data frame we could do so by:
$age class_info
[1] 6 7 5
Grouping Homogeneous Data Types
- combining scalars
c()
- combining expressions
{}
- combining vectors
cbind()
rbind()
Complex Functions
- Vectorization
- Nested Functions
- Loops and Conditional execution
Abstraction
- Assignment
Data Abstraction
Functional Abstraction
Anatomy of a Function
<- function(arg_1, arg_2, ...) {
name
expression_1
expression_2
...<- expression_3
output return(output)
}
Some R Resources
Below I give links to R resources if you desire to learn more about R.
A good next step is Roger Peng’s book R Programming for Data Science, which can be read free online at https://bookdown.org/rdpeng/rprogdatascience/. You can also download a pdf or epub of the book at https://leanpub.com/rprogramming. Both these links also have links to purchase a printed copy if that works better for you.