+ - 0:00:00
Notes for current slide
Notes for next slide

Learning R

It doesn’t have to be scary!

Elizabeth Oneita Davis

San Diego Zoo Global

2020-06-30

1 / 108
2 / 108

Welcome

This mini-course will be structured around four themes.

3 / 108

Welcome

This mini-course will be structured around four themes.

1) Establishing good workflow habits

3 / 108

Welcome

This mini-course will be structured around four themes.

1) Establishing good workflow habits

2) Learning how to R

3 / 108

Welcome

This mini-course will be structured around four themes.

1) Establishing good workflow habits

2) Learning how to R

3) Performing data analysis

3 / 108

Welcome

This mini-course will be structured around four themes.

1) Establishing good workflow habits

2) Learning how to R

3) Performing data analysis

4) Graphing

3 / 108

Establishing good workflow habits

4 / 108

Every time you are working in R, it should be in a project

5 / 108

Every time you are working in R, it should be in a project

Projects are R's way of helping you keep all of your files in one place. It may seem like an unnecessary step, but trust me that it will make your life so much easier to get in the habit of working in R projects.

5 / 108

Go ahead and open R studio. In the top right you should see a button. For me, my button looks like this:

r project

6 / 108

Go ahead and open R studio. In the top right you should see a button. For me, my button looks like this:

r project

When you click on that button, it comes up with a drop down menu. One of the options is "New Project". Click on that.

6 / 108

You'll be taken through a screen that asks you a bunch of questions. Make sure you choose "Project", and then choose which directory you'll work out of. For now, go ahead and make it a folder on your desktop, named

"My First Gorgeous R Project"

(or similar)

7 / 108

Once you are in your R project, you will click the little icon in the top left that looks like this:

r script

8 / 108

You should see that the very first option that comes up is "R Script". Click on that, and you'll see a blank document (your R script) enter into your top left pane.

9 / 108

What is an R script?

10 / 108

What is an R script?

R Scripts are where you keep all of your important code. They are also where you explain your workings and the code to yourself so that you can remember to do the same thing again later.

10 / 108

This is VERY important because you may set down a project and do other things for a few days. If that happens you may forget what you've done and end up doing things over again.

11 / 108

This is one of the central tenets of modern R:

12 / 108

This is one of the central tenets of modern R:

Make your life as easy as possible!

12 / 108

To make your life as easy as possible, you should generally have R Scripts that look like this:

13 / 108

To make your life as easy as possible, you should generally have R Scripts that look like this:

#12 March 2020: This R Script is an example of what a good R script looks like
##
1+1 #I can't remember what 1+1 is- this way, R will tell me!
## [1] 2
13 / 108

Another example

R Script example

14 / 108

The conventional wisdom for R Scripts is that you should comment them out with hashes "as though you may be hit on the head at any moment".

15 / 108

How to use R studio

16 / 108

How to use R studio

R Studio has a lot of in-built tools that help with effective coding.

16 / 108

How to use R studio

R Studio has a lot of in-built tools that help with effective coding.


For example, R will automatically close parantheses and quotation marks for you; however, you need to be careful that this doesn't cause mistakes.

16 / 108

How to use R studio

R Studio has a lot of in-built tools that help with effective coding.


For example, R will automatically close parantheses and quotation marks for you; however, you need to be careful that this doesn't cause mistakes.


R Studio will also show you when you make a mistake in your R Script, through red lines underlining code and a red "X"

16 / 108

R Script example of errors

17 / 108

R Script example of errors

You can see here where R is telling you that your code is faulty.

17 / 108

R Script example of errors

You can see here where R is telling you that your code is faulty.

You can also see that R Studio will color code differently according to its attribute: a different color for non-code "objects", for comments, and for "significant" commands like telling R which library to use.

17 / 108

Error Codes

18 / 108

"Start associating error messages with joy because it probably means you are about to learn something!"

19 / 108

What do to do when you get an error code

20 / 108

What do to do when you get an error code

Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.

20 / 108

What do to do when you get an error code

Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.

As an example, what happens if you type "b" and press enter?

20 / 108

What do to do when you get an error code

Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.

As an example, what happens if you type "b" and press enter?


You should see "Error: object 'b' not found"

20 / 108

This message has three parts

21 / 108

This message has three parts

1) It tells us that it is an error

21 / 108

This message has three parts

1) It tells us that it is an error

2) The location of the error, which is object "b"

21 / 108

This message has three parts

1) It tells us that it is an error

2) The location of the error, which is object "b"

3) The problem this mistake caused: there is no object b, because I haven't created it!

21 / 108

Sometimes errors are difficult to understand

22 / 108

Sometimes errors are difficult to understand

If you can't understand the error message immediately, then you have several options:

22 / 108

Sometimes errors are difficult to understand

If you can't understand the error message immediately, then you have several options:

1) Check out the help file for the function/command you are using, if applicable

22 / 108

Sometimes errors are difficult to understand

If you can't understand the error message immediately, then you have several options:

1) Check out the help file for the function/command you are using, if applicable

2) Google it!

22 / 108

Sometimes errors are difficult to understand

If you can't understand the error message immediately, then you have several options:

1) Check out the help file for the function/command you are using, if applicable

2) Google it!

3) Ask for help from Stack Overflow (bearing in mind that it can be challenging) or from R-Ladies (if you are a woman)

22 / 108

Checking out the help file

23 / 108

Checking out the help file

The help file can be challenging to understand, so Googling may be the easiest option for you. However, it is important to be familiar with R helpfiles.

23 / 108

R help first bit

24 / 108

R help second bit

25 / 108

Moving on... What are packages?

26 / 108

Moving on... What are packages?

Packages are extra functions or groups of functions that other people have created to help make everyone's life easier. They can be downloaded from CRAN (the R repository) or from GitHub.

26 / 108

Moving on... What are packages?

Packages are extra functions or groups of functions that other people have created to help make everyone's life easier. They can be downloaded from CRAN (the R repository) or from GitHub.

This is very easy to do. Go ahead and run the following command:

#install.packages("devtools")
26 / 108

You now have the package "devtools" which will allow you to call packages from GitHub. Sometimes packages live on GitHub for somewhat mysterious reasons that I won't get into here.

27 / 108

Let's install a package from GitHub. Go ahead and run the following code:

#I have hashed it out so that I don't install it
#devtools::install_github("hadley/emo", force = "true")
28 / 108

Let's install a package from GitHub. Go ahead and run the following code:

#I have hashed it out so that I don't install it
#devtools::install_github("hadley/emo", force = "true")

You now have the "emoji" package downloaded! Try the following:

emo::ji("party")
## 🎉
28 / 108

You are ready to R! 💃

29 / 108

Getting Data into R

Go to the following site:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

30 / 108

Getting Data into R

Go to the following site:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

and download the dataset “survival”. We will read this into R using the following command:

Hanoi<-read.csv("survival.csv")
30 / 108

Getting Data into R

Go to the following site:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

and download the dataset “survival”. We will read this into R using the following command:

Hanoi<-read.csv("survival.csv")

When you read this into R, you should see the following:

30 / 108

survival dataset

31 / 108

And over in your top right pane, you will see "Hanoi" listed. It tells you how many observations there are (your n) and how many variables you have.

hanoi title

32 / 108

As you can see, we use the arrow to assign our dataset to a name. Therefore,when you type "Hanoi" it will show you the dataset that you have loaded into R.

33 / 108

As you can see, we use the arrow to assign our dataset to a name. Therefore,when you type "Hanoi" it will show you the dataset that you have loaded into R.

However, R Studio makes things very simple by cutting out this extra step. If you look at R Studio, you’ll note that in the top right corner there’s a box that has the tab “Environment”.

You’ll see that there’s a button that says “Import dataset”. You can use this button to quickly import any dataset you desire. Try it now with the "survival" dataset you just imported into R as “Hanoi”.

33 / 108
34 / 108

A quick, basic lesson

35 / 108

A quick, basic lesson

Before we go any further, I must stress that R is very, very particular. If a command goes wrong, it is 99%1 probable that you did not write the name of the dataset/vector/variable properly, OR that you did not put a parentheses or a comma, or some other form of punctuation where there should be one. The beauty of R studio is that it’s very good at putting parentheses in the right places, but errors still do occur.

[1] Accurate math

35 / 108

R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:

1+1
36 / 108

R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:

1+1

If you press enter, you'll see that it executes the code and gives you the answer, which is

## [1] 2
36 / 108

R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:

1+1

If you press enter, you'll see that it executes the code and gives you the answer, which is

## [1] 2

Of course, we all know what 1 + 1 is (but imagine if we didn’t!).

36 / 108

Assigning equations to vectors

Now, what if we wanted to assign the value of the equation 1+1 to an object in R?

a <- 1+1
a
37 / 108

Assigning equations to vectors

Now, what if we wanted to assign the value of the equation 1+1 to an object in R?

a <- 1+1
a
## [1] 2
37 / 108

Assigning equations to vectors

Now, what if we wanted to assign the value of the equation 1+1 to an object in R?

a <- 1+1
a
## [1] 2

You can see that the value of this equation is stored in the object of a. Now, perhaps you can begin to see what kinds of things you can do with this object, e.g.

37 / 108
a + 56
38 / 108
a + 56
## [1] 58
38 / 108

R is simple enough to understand if you think of it as small building blocks, such as a <- 1+1, being put together to comprise a whole (the action you are trying to accomplish).

39 / 108

R is simple enough to understand if you think of it as small building blocks, such as a <- 1+1, being put together to comprise a whole (the action you are trying to accomplish).

This may not make sense right now, but (hopefully!) will by the end of this course.

39 / 108

Using R for data analysis

40 / 108

Using R for data analysis


"Statistics [and data analysis] is a problem-solving and decision-making process that is fundamental to scientific inquiry and essential for making sound decisions."

40 / 108

Requires critical thinking (and practice!)

41 / 108

Taking a look at our dataset

42 / 108

Taking a look at our dataset

Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:

42 / 108

Taking a look at our dataset

Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:

head(Hanoi)
42 / 108

Taking a look at our dataset

Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:

head(Hanoi)
## X dose surv
## 1 1 117.5 44.00
## 2 2 117.5 55.00
## 3 3 235.0 16.00
## 4 4 235.0 13.00
## 5 5 470.0 4.00
## 6 6 470.0 1.96
42 / 108

As you can see, this shows you the top of the dataset (the "head", if you will), which is helpful if one wants to get a quick snapshot of the dataset.

43 / 108

The next command is:

44 / 108

The next command is:

names(Hanoi)
## [1] "X" "dose" "surv"
44 / 108

The next command is:

names(Hanoi)
## [1] "X" "dose" "surv"

Our variables here are "dose" and "survival". To give you some context, this dataset is about the survival of rats after being given radiation doses. "surv" is the survival rate of the batches expressed as a percentage, while "dose" is the dose of radiation administered (rads).

44 / 108

The next command is:

names(Hanoi)
## [1] "X" "dose" "surv"

Our variables here are "dose" and "survival". To give you some context, this dataset is about the survival of rats after being given radiation doses. "surv" is the survival rate of the batches expressed as a percentage, while "dose" is the dose of radiation administered (rads).

Now, it doesn't seem to make sense for our dataset to be called "Hanoi". How do you think we would go about renaming it?

44 / 108
survival <- Hanoi

Now, you should see in your top right pane, under "Environment", the dataset survival, which is exactly the same as the Hanoi dataset.

45 / 108

Summary statistics in R

46 / 108

Summary statistics in R

This is very easy to do in R. Simply type the following:

46 / 108

Summary statistics in R

This is very easy to do in R. Simply type the following:

summary(survival)
46 / 108

Summary statistics in R

This is very easy to do in R. Simply type the following:

summary(survival)
## X dose surv
## Min. : 1.00 Min. : 117.5 Min. : 0.0060
## 1st Qu.: 4.25 1st Qu.: 293.8 1st Qu.: 0.1625
## Median : 7.50 Median : 587.5 Median : 1.3300
## Mean : 7.50 Mean : 654.6 Mean :10.1250
## 3rd Qu.:10.75 3rd Qu.: 940.0 3rd Qu.:11.2800
## Max. :14.00 Max. :1410.0 Max. :55.0000
46 / 108

As you can see, this command gives you a nice little snapshot of the data, with means and medians. You can also find the mean for each specific variable by simply typing the following:

mean(survival$dose)
47 / 108

As you can see, this command gives you a nice little snapshot of the data, with means and medians. You can also find the mean for each specific variable by simply typing the following:

mean(survival$dose)
## [1] 654.6429
47 / 108

As you see, this mean is the same as the mean we see in the summary statistics we did with summary(survival).

48 / 108

You probably noticed the use of the "$" operator. This is your way of telling R where to "look".

49 / 108

You probably noticed the use of the "$" operator. This is your way of telling R where to "look".


When you type mean(survival$dose) you are telling R "I want the mean of the variable dose in the survival dataset."

49 / 108

You probably noticed the use of the "$" operator. This is your way of telling R where to "look".


When you type mean(survival$dose) you are telling R "I want the mean of the variable dose in the survival dataset."


Now on your own, try finding the median of "surv"

49 / 108

You should have typed in the following, and gotten the following result:

50 / 108

You should have typed in the following, and gotten the following result:

median(survival$surv)
50 / 108

You should have typed in the following, and gotten the following result:

median(survival$surv)
## [1] 1.33
50 / 108

51 / 108

How do you make sense of your data?

52 / 108

How do you make sense of your data?

In earlier versions of this lecture, I would tell students to investigate three key assumptions (namely, normality, homogeneity of variances, and whether samples were independent).

52 / 108

Nowadays, I don't think those are as important. You absolutely should understand your data, but for social scientists our data is 99%1 of the time going to be non-normal, and often with wide variances. And as social scientists, we should already know whether our data is dependent or independent.

53 / 108

Nowadays, I don't think those are as important. You absolutely should understand your data, but for social scientists our data is 99%1 of the time going to be non-normal, and often with wide variances. And as social scientists, we should already know whether our data is dependent or independent.

[1] Definitely not accurate math.

53 / 108

That being said, it is still important to know what your data looks like

54 / 108

We can easily visualize our data with a histogram, using the hist command:

55 / 108

We can easily visualize our data with a histogram, using the hist command:

hist(survival$surv)

55 / 108

As you can see, the data really aren’t normally distributed. In fact, the data are right-skewed (i.e. a long right tail).

56 / 108

As you can see, the data really aren’t normally distributed. In fact, the data are right-skewed (i.e. a long right tail).

This tells us already that there is something going on with our data, so we should expect to see some kind of effect, depending on what we are investigating.

56 / 108

In the old days, we used null hypothesis significance tests (NHST)

57 / 108

In the old days, we used null hypothesis significance tests (NHST)

These are now rightly criticized for obscuring the "true" effect we can see, and/or being a "lazy" way to interrogate data. Also, there are plenty of instances where it is simply unneccesary to perform NHSTs.

57 / 108

In the old days, we used null hypothesis significance tests (NHST)

These are now rightly criticized for obscuring the "true" effect we can see, and/or being a "lazy" way to interrogate data. Also, there are plenty of instances where it is simply unneccesary to perform NHSTs.

e.g....

57 / 108

kenya-conf

58 / 108

Confidence intervals

59 / 108

Confidence intervals

These are really what you should rely on the most. A confidence interval is basically just the mean/prevalence estimate/proportion (measure we're interested in), and the margin of error (ME) multiplied by 1.96 (for 95% confidence intervals, the standard in social science).

59 / 108

Confidence intervals

These are really what you should rely on the most. A confidence interval is basically just the mean/prevalence estimate/proportion (measure we're interested in), and the margin of error (ME) multiplied by 1.96 (for 95% confidence intervals, the standard in social science).

There are plenty of ways to perform confidence intervals in R, although they all are found in packages.

59 / 108

For example, let's say that we want to find the confidence interval for mean survival rate.

60 / 108

For example, let's say that we want to find the confidence interval for mean survival rate.

We will install the package gmodels, and perform the following code:

60 / 108
require(gmodels)
survival$surv <- as.numeric(survival$surv)
ci(survival$surv)
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default
## calcuation.
61 / 108
require(gmodels)
survival$surv <- as.numeric(survival$surv)
ci(survival$surv)
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default
## calcuation.
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default
## calcuation.
## Estimate CI lower CI upper Std. Error
## 10.12500000 -0.01426651 20.26426651 4.69330384
61 / 108

We can double check this against the mean of survival, which if you'll remember is found by running:

62 / 108

We can double check this against the mean of survival, which if you'll remember is found by running:

mean(survival$surv)
## [1] 10.125
62 / 108

We can double check this against the mean of survival, which if you'll remember is found by running:

mean(survival$surv)
## [1] 10.125

What this tells us is that our confidence interval of the mean is (-0.01, 10.1, 20.26): indicating that there is a huge amount of variance within this sample!

62 / 108

We can double check this against the mean of survival, which if you'll remember is found by running:

mean(survival$surv)
## [1] 10.125

What this tells us is that our confidence interval of the mean is (-0.01, 10.1, 20.26): indicating that there is a huge amount of variance within this sample!

Although of course we already knew that, after seeing how skewed our distrubution was.

62 / 108

As always, having more data points will decrease your level of error.

63 / 108

sample-depiction

64 / 108

All NHST tests are a form of linear regression. Linear regression, in turn, can encompass linear models and generalized linear models (GLMs). In social science, we will often be working with GLMs

65 / 108

GLMS

66 / 108

GLMS

Often seen as scary and traumatic...

66 / 108

GLMS

Often seen as scary and traumatic...

fainting

66 / 108

But in reality, they are pretty simple and even easy- especially with the power of R!

67 / 108

But in reality, they are pretty simple and even easy- especially with the power of R!

dolphin

67 / 108

GLMs follow the basic formula of:

68 / 108

GLMs follow the basic formula of:

y ~ x

68 / 108

GLMs follow the basic formula of:

y ~ x

Where y is your dependent variable: e.g. use of bear bile, consumption of giraffe meat, whether they've poached in the last year, etc; and x is the independent variables, such as gender, age, education, etc

68 / 108

You also need to know what your distribution looks like, so that you can choose which GLM family is appropriate:

69 / 108

You also need to know what your distribution looks like, so that you can choose which GLM family is appropriate:

binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year

69 / 108

You also need to know what your distribution looks like, so that you can choose which GLM family is appropriate:

binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year

poisson: count data, also sometimes called frequency. Used if you're exploring the number of times individuals in your sample used bear bile last year

69 / 108

You also need to know what your distribution looks like, so that you can choose which GLM family is appropriate:

binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year

poisson: count data, also sometimes called frequency. Used if you're exploring the number of times individuals in your sample used bear bile last year

gaussian: a "normal" distribution. If your dependent variable is something like height, you would use this family

69 / 108

GLM structure

With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:

70 / 108

GLM structure

With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:

bearbile ~ gender + age + province + education + religion

70 / 108

GLM structure

With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:

bearbile ~ gender + age + province + education + religion

And working your way down based on what is significant.

70 / 108

For example, maybe in your "master model" only gender and education are significant.

71 / 108

For example, maybe in your "master model" only gender and education are significant.

You should then create a reduced model:

71 / 108

For example, maybe in your "master model" only gender and education are significant.

You should then create a reduced model:

bearbile ~ gender + education

71 / 108

For example, maybe in your "master model" only gender and education are significant.

You should then create a reduced model:

bearbile ~ gender + education

Which will give you more precise estimates of significance.

71 / 108

Interpreting R's GLM output

72 / 108

Interpreting R's GLM output

Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata), where admit is whether a student got admitted to university (0,1), gre is their GRE score, gpa is of course their undergrad GPA, and rank is the relative prestige of the university.

72 / 108

Interpreting R's GLM output

Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata), where admit is whether a student got admitted to university (0,1), gre is their GRE score, gpa is of course their undergrad GPA, and rank is the relative prestige of the university.

What is the family of this GLM?

72 / 108

Interpreting R's GLM output

Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata), where admit is whether a student got admitted to university (0,1), gre is their GRE score, gpa is of course their undergrad GPA, and rank is the relative prestige of the university.

What is the family of this GLM?

Correct! It is binomial

72 / 108

output

73 / 108

Running your own GLM

74 / 108

Running your own GLM

Go ahead and explore the mtcars package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".

74 / 108

Running your own GLM

Go ahead and explore the mtcars package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".

Now create a GLM with the data. What will you be exploring? (You can create different versions if you like).

74 / 108

Running your own GLM

Go ahead and explore the mtcars package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".

Now create a GLM with the data. What will you be exploring? (You can create different versions if you like).

For example, you could have your model be glm1 <- glm(vs ~ wt + disp, mtcars, family = binomial).

74 / 108

If that was your model, you would then get the following output, when you ran summary:

75 / 108

If that was your model, you would then get the following output, when you ran summary:

glm1 <- glm(vs ~ wt + disp, mtcars, family = binomial) #making the model
summary(glm1)
##
## Call:
## glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67506 -0.28444 -0.08401 0.57281 2.08234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.60859 2.43903 0.660 0.510
## wt 1.62635 1.49068 1.091 0.275
## disp -0.03443 0.01536 -2.241 0.025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.86 on 31 degrees of freedom
## Residual deviance: 21.40 on 29 degrees of freedom
## AIC: 27.4
##
## Number of Fisher Scoring iterations: 6
75 / 108

Hopefully you felt confident in interpreting that output! From this, we see that displacement predicts what type of engine a car will have, while weight does not. Specifically, higher displacement means that the probability of a car having a "v-shaped" engine is reduced.

76 / 108

Interaction terms

77 / 108

Interaction terms

Another aspect of GLMs is "interaction terms". This is denoted by

*

e.g. gender ~ height * weight, where you are asking R to investigate the relationship of height and weight and how it predicts gender.

77 / 108

This is yet another example in R where the onus is on the operator ( i.e. you 👈 ) to have a good grasp of your data, as well as your end goal. What are you trying to prove? What exactly are you trying to understand?

78 / 108

This is yet another example in R where the onus is on the operator ( i.e. you 👈 ) to have a good grasp of your data, as well as your end goal. What are you trying to prove? What exactly are you trying to understand?

One of the many wonderful aspects of R is that it really facilitates the kind of exploration that is necessary for understanding data and wriggling your way 🐛 towards enlightenment. It's easy to try different things and make different models- so you may as well do exactly that!

78 / 108

There is so much that can be done in R. Don't be afraid of exploring, getting things wrong, and googling.

79 / 108

There is so much that can be done in R. Don't be afraid of exploring, getting things wrong, and googling.

Google will be your best friend when learning R. I still have to Google things ALL the time!

79 / 108

Also, there are tons of free resources for learning R! One great example is: https://learningstatisticswithr.com/book/introR.html.

80 / 108

Also, there are tons of free resources for learning R! One great example is: https://learningstatisticswithr.com/book/introR.html.

Cleaning data is really important and something I STILL do in Excel... when it can be done so much more easily in R! This is a great resource for learning how to clean data, step by step: https://rladiessydney.org/courses/ryouwithme/

80 / 108

Great job! You now know the basics of R!

81 / 108

Unfortunately, we aren't quite done.

82 / 108

Visualization is a really important component of analysis and how you communicate your data to the world.

83 / 108

Visualization is a really important component of analysis and how you communicate your data to the world.


The easiest way to do this is by using ggplot2, Hadley Wickham's amazing "grammar of graphics".

83 / 108

hadley

84 / 108

How cute is Hadley? He's also a super nice guy and has been instrumental in making R the inclusive environment it is today 🎉

85 / 108

Some important concepts:

1) "Data visualization is part art and part science." - Claus Wilke

86 / 108

Some important concepts:

1) "Data visualization is part art and part science." - Claus Wilke

2) Making a good visualization requires lots of practice and "material osmosis". Explore what other people are doing, get their code, and tinker!

3) ^ Tied with the above, have fun with it! Make a graph showing all the best Friends episodes or one-liners (as several people have done).

86 / 108

Types of graphs

87 / 108

Types of graphs

1) Scatterplots/line charts ->

87 / 108

Types of graphs

1) Scatterplots/line charts ->

Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.

87 / 108

Types of graphs

1) Scatterplots/line charts ->

Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.

2) Barplots ->
87 / 108

Types of graphs

1) Scatterplots/line charts ->

Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.

2) Barplots ->

Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!

87 / 108

Types of graphs

1) Scatterplots/line charts ->

Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.

2) Barplots ->

Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!

3) Boxplots ->
87 / 108

Types of graphs

1) Scatterplots/line charts ->

Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.

2) Barplots ->

Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!

3) Boxplots ->

Used to show distribution spread, and indicating whether there are outliers. Also useful for visualizing confidence intervals.

87 / 108

ugly pie chart

88 / 108

ugly pie chart

NEVER make one of these abominations.

88 / 108

🍷

89 / 108

🍷

89 / 108

Install and/or load the following packages:

90 / 108

Install and/or load the following packages:

library(tidyverse) # All the fun tidy functions!
library(paletteer)
90 / 108

Install and/or load the following packages:

library(tidyverse) # All the fun tidy functions!
library(paletteer)

Now load the data: google "Allison Horst github tidy_tuesday_5_28_19", and a link to the data will come up

90 / 108

Install and/or load the following packages:

library(tidyverse) # All the fun tidy functions!
library(paletteer)

Now load the data: google "Allison Horst github tidy_tuesday_5_28_19", and a link to the data will come up

Or copy this long freaking thing out:

wine_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## country = col_character(),
## description = col_character(),
## designation = col_character(),
## points = col_double(),
## price = col_double(),
## province = col_character(),
## region_1 = col_character(),
## region_2 = col_character(),
## taster_name = col_character(),
## taster_twitter_handle = col_character(),
## title = col_character(),
## variety = col_character(),
## winery = col_character()
## )
90 / 108

Follow along with Allison's data manipulation in her github file

91 / 108

Follow along with Allison's data manipulation in her github file

You should end up with this plot!

wine plot

91 / 108

That probably felt pretty intense, with lots of data manipulation and strange parameters while using ggplot.

92 / 108

That probably felt pretty intense, with lots of data manipulation and strange parameters while using ggplot.


But this is a "new" concept gaining traction among R teachers- maybe we should "let our students eat cake" rather than learning the boring, basic recipes first!

92 / 108

Alison wisdom

Alison Hill, R Guru

93 / 108

Alison wisdom

Alison Hill, R Guru

Yeah, it's about R markdown, but it still applies!

93 / 108

So, I've shown you a hint of what you can do with ggplot2. Like any good ggplotter, it's time for you to slice and dice code and create a Frankenstein monster of your own 🧟3

94 / 108

So, I've shown you a hint of what you can do with ggplot2. Like any good ggplotter, it's time for you to slice and dice code and create a Frankenstein monster of your own 🧟3

[3] Yes, it's a zombie. Close enough.

94 / 108

As you've built your graphs, we've already discussed bits and pieces of the ggplot syntax. However, I'll go through it again/in a little 🤏 more detail here.

95 / 108

ggplot2 syntax

96 / 108

ggplot2 syntax

You start with your "foundational" code, which is where you tell ggplot:

96 / 108

ggplot2 syntax

You start with your "foundational" code, which is where you tell ggplot:

  • the dataset you're using
96 / 108

ggplot2 syntax

You start with your "foundational" code, which is where you tell ggplot:

  • the dataset you're using

  • what variable goes on the x axis

96 / 108

ggplot2 syntax

You start with your "foundational" code, which is where you tell ggplot:

  • the dataset you're using

  • what variable goes on the x axis

  • what variable goes on the y axis

96 / 108

ggplot2 syntax

You start with your "foundational" code, which is where you tell ggplot:

  • the dataset you're using

  • what variable goes on the x axis

  • what variable goes on the y axis

  • what plot you want ggplot to generate

96 / 108

Up until yesterday, I was going to have you guys explore the iris dataset... until I saw on Twitter that this benign, kind of boring dataset was created by a total racist.

97 / 108

Up until yesterday, I was going to have you guys explore the iris dataset... until I saw on Twitter that this benign, kind of boring dataset was created by a total racist.

iris sucks

97 / 108

So we'll use something that isn't racist, and has the added benefit of being about penguins ❤️🐧.

98 / 108

So we'll use something that isn't racist, and has the added benefit of being about penguins ❤️🐧.

Load the following code:

devtools::install_github("allisonhorst/palmerpenguins")
## Using github PAT from envvar GITHUB_PAT
## Downloading GitHub repo allisonhorst/palmerpenguins@master
##
checking for file ‘/private/var/folders/lw/rzb9nnl16xq73y25y8v73zfw0000gn/T/RtmpwktobE/remotescfc46620b41c/allisonhorst-palmerpenguins-f13c212/DESCRIPTION’ ...
✓ checking for file ‘/private/var/folders/lw/rzb9nnl16xq73y25y8v73zfw0000gn/T/RtmpwktobE/remotescfc46620b41c/allisonhorst-palmerpenguins-f13c212/DESCRIPTION’
##
─ preparing ‘palmerpenguins’:
##
checking DESCRIPTION meta-information ...
✓ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ looking to see if a ‘data/datalist’ file should be added
##
─ building ‘palmerpenguins_0.1.0.tar.gz’
##
##
require(palmerpenguins)
## Loading required package: palmerpenguins
98 / 108

Let's make a plot!

99 / 108

Let's make a plot!

First, go ahead and explore the data (remember, to load the data from a package you will type data("penguins")). Use head, str, and whatever else you want that will help you understand what's going on in this dataset.

99 / 108
data("penguins")
require(ggplot2)
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species))
100 / 108

101 / 108

As you saw in the wine price graph, you can spring off this base and add in beautiful color scales, custom themes (or lovely themes people have already created), change your font size, etc.

102 / 108

As you saw in the wine price graph, you can spring off this base and add in beautiful color scales, custom themes (or lovely themes people have already created), change your font size, etc.


This is done by simply adding to your foundational base with "+" and the parameter you want to manipulate.

102 / 108

For example, let's say you want to make the text bigger on your little graph above.

103 / 108

For example, let's say you want to make the text bigger on your little graph above.

Here is what you have done:

g1 <- ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species))
103 / 108

For example, let's say you want to make the text bigger on your little graph above.

Here is what you have done:

g1 <- ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species))

Here is what you now will add:

103 / 108

For example, let's say you want to make the text bigger on your little graph above.

Here is what you have done:

g1 <- ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species))

Here is what you now will add:

ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species)) +
theme(text = element_text(size=15)) +
labs(caption = "Data by Allison Horst")
103 / 108

Easy!

104 / 108

Easy!

Building ggplots is an iterative process, where you build and tweak according to your needs.

104 / 108

Let's move into practicing on your own. You can either stick with the penguins, or explore a dataset R already has, such as the mpg dataset.

105 / 108

Let's move into practicing on your own. You can either stick with the penguins, or explore a dataset R already has, such as the mpg dataset.

e.g.

data(mpg)
105 / 108

Let's move into practicing on your own. You can either stick with the penguins, or explore a dataset R already has, such as the mpg dataset.

e.g.

data(mpg)

Remember what we did earlier. How would you explore this dataset?

105 / 108

Let's move into practicing on your own. You can either stick with the penguins, or explore a dataset R already has, such as the mpg dataset.

e.g.

data(mpg)

Remember what we did earlier. How would you explore this dataset?


Correct! head, summary, and str are all good options

105 / 108

Now go ahead and think through how you would visualize aspects of this dataset, and what you're interested in. Does it make sense to do a scatterplot? Would a bar chart be useful?

106 / 108

Now go ahead and think through how you would visualize aspects of this dataset, and what you're interested in. Does it make sense to do a scatterplot? Would a bar chart be useful?


Use the code we used for the wine price graph, but also look at google [hint: lots of people have made graphs of the mpg dataset]

106 / 108

When you think you're done and you've made a pretty graph, show your work!

107 / 108
108 / 108
2 / 108
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow