1) Establishing good workflow habits
1) Establishing good workflow habits
2) Learning how to R
1) Establishing good workflow habits
2) Learning how to R
3) Performing data analysis
1) Establishing good workflow habits
2) Learning how to R
3) Performing data analysis
4) Graphing
Projects are R's way of helping you keep all of your files in one place. It may seem like an unnecessary step, but trust me that it will make your life so much easier to get in the habit of working in R projects.
(or similar)
#12 March 2020: This R Script is an example of what a good R script looks like##1+1 #I can't remember what 1+1 is- this way, R will tell me!
## [1] 2
You can see here where R is telling you that your code is faulty.
You can see here where R is telling you that your code is faulty.
You can also see that R Studio will color code differently according to its attribute: a different color for non-code "objects", for comments, and for "significant" commands like telling R which library to use.
Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.
Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.
Read the code. This seems self-explanatory but it is where you can gain some understanding of what went wrong.
1) Check out the help file for the function/command you are using, if applicable
1) Check out the help file for the function/command you are using, if applicable
2) Google it!
1) Check out the help file for the function/command you are using, if applicable
2) Google it!
3) Ask for help from Stack Overflow (bearing in mind that it can be challenging) or from R-Ladies (if you are a woman)
Packages are extra functions or groups of functions that other people have created to help make everyone's life easier. They can be downloaded from CRAN (the R repository) or from GitHub.
Packages are extra functions or groups of functions that other people have created to help make everyone's life easier. They can be downloaded from CRAN (the R repository) or from GitHub.
This is very easy to do. Go ahead and run the following command:
#install.packages("devtools")
Let's install a package from GitHub. Go ahead and run the following code:
#I have hashed it out so that I don't install it#devtools::install_github("hadley/emo", force = "true")
Let's install a package from GitHub. Go ahead and run the following code:
#I have hashed it out so that I don't install it#devtools::install_github("hadley/emo", force = "true")
You now have the "emoji" package downloaded! Try the following:
emo::ji("party")
## 🎉
Go to the following site:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Go to the following site:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
and download the dataset “survival”. We will read this into R using the following command:
Hanoi<-read.csv("survival.csv")
Go to the following site:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
and download the dataset “survival”. We will read this into R using the following command:
Hanoi<-read.csv("survival.csv")
When you read this into R, you should see the following:
As you can see, we use the arrow to assign our dataset to a name. Therefore,when you type "Hanoi" it will show you the dataset that you have loaded into R.
As you can see, we use the arrow to assign our dataset to a name. Therefore,when you type "Hanoi" it will show you the dataset that you have loaded into R.
However, R Studio makes things very simple by cutting out this extra step. If you look at R Studio, you’ll note that in the top right corner there’s a box that has the tab “Environment”.
You’ll see that there’s a button that says “Import dataset”. You can use this button to quickly import any dataset you desire. Try it now with the "survival" dataset you just imported into R as “Hanoi”.
Before we go any further, I must stress that R is very, very particular. If a command goes wrong, it is 99%1 probable that you did not write the name of the dataset/vector/variable properly, OR that you did not put a parentheses or a comma, or some other form of punctuation where there should be one. The beauty of R studio is that it’s very good at putting parentheses in the right places, but errors still do occur.
[1] Accurate math
R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:
1+1
R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:
1+1
If you press enter, you'll see that it executes the code and gives you the answer, which is
## [1] 2
R is quite simply, an entirely new language. That being said, it operates in an intuitive manner. For instance, try the command we had previously in our R Script:
1+1
If you press enter, you'll see that it executes the code and gives you the answer, which is
## [1] 2
Of course, we all know what 1 + 1 is (but imagine if we didn’t!).
Now, what if we wanted to assign the value of the equation 1+1 to an object in R?
a <- 1+1a
Now, what if we wanted to assign the value of the equation 1+1 to an object in R?
a <- 1+1a
## [1] 2
Now, what if we wanted to assign the value of the equation 1+1 to an object in R?
a <- 1+1a
## [1] 2
You can see that the value of this equation is stored in the object of a. Now, perhaps you can begin to see what kinds of things you can do with this object, e.g.
a + 56
a + 56
## [1] 58
a <- 1+1
, being put together to comprise a whole (the action you are trying to accomplish).a <- 1+1
, being put together to comprise a whole (the action you are trying to accomplish).Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:
Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:
head(Hanoi)
Before starting analysis on a dataset, you need to familiarize yourself with the data. This can be done in R very easily. One command that's useful is the following, on our "Hanoi" dataset:
head(Hanoi)
## X dose surv## 1 1 117.5 44.00## 2 2 117.5 55.00## 3 3 235.0 16.00## 4 4 235.0 13.00## 5 5 470.0 4.00## 6 6 470.0 1.96
As you can see, this shows you the top of the dataset (the "head", if you will), which is helpful if one wants to get a quick snapshot of the dataset.
The next command is:
The next command is:
names(Hanoi)
## [1] "X" "dose" "surv"
The next command is:
names(Hanoi)
## [1] "X" "dose" "surv"
Our variables here are "dose" and "survival". To give you some context, this dataset is about the survival of rats after being given radiation doses. "surv" is the survival rate of the batches expressed as a percentage, while "dose" is the dose of radiation administered (rads).
The next command is:
names(Hanoi)
## [1] "X" "dose" "surv"
Our variables here are "dose" and "survival". To give you some context, this dataset is about the survival of rats after being given radiation doses. "surv" is the survival rate of the batches expressed as a percentage, while "dose" is the dose of radiation administered (rads).
Now, it doesn't seem to make sense for our dataset to be called "Hanoi". How do you think we would go about renaming it?
survival <- Hanoi
Now, you should see in your top right pane, under "Environment", the dataset survival
, which is exactly the same as the Hanoi
dataset.
This is very easy to do in R. Simply type the following:
This is very easy to do in R. Simply type the following:
summary(survival)
This is very easy to do in R. Simply type the following:
summary(survival)
## X dose surv ## Min. : 1.00 Min. : 117.5 Min. : 0.0060 ## 1st Qu.: 4.25 1st Qu.: 293.8 1st Qu.: 0.1625 ## Median : 7.50 Median : 587.5 Median : 1.3300 ## Mean : 7.50 Mean : 654.6 Mean :10.1250 ## 3rd Qu.:10.75 3rd Qu.: 940.0 3rd Qu.:11.2800 ## Max. :14.00 Max. :1410.0 Max. :55.0000
As you can see, this command gives you a nice little snapshot of the data, with means and medians. You can also find the mean for each specific variable by simply typing the following:
mean(survival$dose)
As you can see, this command gives you a nice little snapshot of the data, with means and medians. You can also find the mean for each specific variable by simply typing the following:
mean(survival$dose)
## [1] 654.6429
summary(survival)
.
Now on your own, try finding the median of "surv"
You should have typed in the following, and gotten the following result:
You should have typed in the following, and gotten the following result:
median(survival$surv)
You should have typed in the following, and gotten the following result:
median(survival$surv)
## [1] 1.33
In earlier versions of this lecture, I would tell students to investigate three key assumptions (namely, normality, homogeneity of variances, and whether samples were independent).
Nowadays, I don't think those are as important. You absolutely should understand your data, but for social scientists our data is 99%1 of the time going to be non-normal, and often with wide variances. And as social scientists, we should already know whether our data is dependent or independent.
Nowadays, I don't think those are as important. You absolutely should understand your data, but for social scientists our data is 99%1 of the time going to be non-normal, and often with wide variances. And as social scientists, we should already know whether our data is dependent or independent.
[1] Definitely not accurate math.
We can easily visualize our data with a histogram, using the hist
command:
We can easily visualize our data with a histogram, using the hist
command:
hist(survival$surv)
As you can see, the data really aren’t normally distributed. In fact, the data are right-skewed (i.e. a long right tail).
As you can see, the data really aren’t normally distributed. In fact, the data are right-skewed (i.e. a long right tail).
This tells us already that there is something going on with our data, so we should expect to see some kind of effect, depending on what we are investigating.
These are now rightly criticized for obscuring the "true" effect we can see, and/or being a "lazy" way to interrogate data. Also, there are plenty of instances where it is simply unneccesary to perform NHSTs.
These are now rightly criticized for obscuring the "true" effect we can see, and/or being a "lazy" way to interrogate data. Also, there are plenty of instances where it is simply unneccesary to perform NHSTs.
e.g....
We will install the package gmodels
, and perform the following code:
require(gmodels)survival$surv <- as.numeric(survival$surv)ci(survival$surv)
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default## calcuation.
require(gmodels)survival$surv <- as.numeric(survival$surv)ci(survival$surv)
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default## calcuation.
## Warning in ci.numeric(survival$surv): No class or unkown class. Using default## calcuation.
## Estimate CI lower CI upper Std. Error ## 10.12500000 -0.01426651 20.26426651 4.69330384
mean(survival$surv)
## [1] 10.125
mean(survival$surv)
## [1] 10.125
What this tells us is that our confidence interval of the mean is (-0.01, 10.1, 20.26): indicating that there is a huge amount of variance within this sample!
mean(survival$surv)
## [1] 10.125
What this tells us is that our confidence interval of the mean is (-0.01, 10.1, 20.26): indicating that there is a huge amount of variance within this sample!
Although of course we already knew that, after seeing how skewed our distrubution was.
binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year
binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year
poisson: count data, also sometimes called frequency. Used if you're exploring the number of times individuals in your sample used bear bile last year
binomial: if you have a 0,1 dependent variable. For example, "yes" or "no" to using bear bile in the last year
poisson: count data, also sometimes called frequency. Used if you're exploring the number of times individuals in your sample used bear bile last year
gaussian: a "normal" distribution. If your dependent variable is something like height, you would use this family
With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:
With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:
bearbile ~ gender + age + province + education + religion
With GLMs, you really need to think through what you're exploring. I have made many mistakes with this, and it has always come down to overcomplicating the model and/or not starting at the "top". It's best is to start with ALL of the variables you want to interrogate, something like:
bearbile ~ gender + age + province + education + religion
And working your way down based on what is significant.
You should then create a reduced model:
You should then create a reduced model:
bearbile ~ gender + education
You should then create a reduced model:
bearbile ~ gender + education
Which will give you more precise estimates of significance.
Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata)
, where admit
is whether a student got admitted to university (0,1), gre
is their GRE score, gpa
is of course their undergrad GPA, and rank
is the relative prestige of the university.
Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata)
, where admit
is whether a student got admitted to university (0,1), gre
is their GRE score, gpa
is of course their undergrad GPA, and rank
is the relative prestige of the university.
What is the family of this GLM?
Let's say our GLM is mylogit <- glm(admit ~ gre + gpa + rank, data = mydata)
, where admit
is whether a student got admitted to university (0,1), gre
is their GRE score, gpa
is of course their undergrad GPA, and rank
is the relative prestige of the university.
What is the family of this GLM?
Correct! It is binomial
Go ahead and explore the mtcars
package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".
Go ahead and explore the mtcars
package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".
Now create a GLM with the data. What will you be exploring? (You can create different versions if you like).
Go ahead and explore the mtcars
package in R. What are the variables? What do the data distributions look like? Note: "vs" means the probability of an engine being a certain type, "wt" is weight, and "disp" is "displacement".
Now create a GLM with the data. What will you be exploring? (You can create different versions if you like).
For example, you could have your model be glm1 <- glm(vs ~ wt + disp, mtcars, family = binomial)
.
summary
:summary
:glm1 <- glm(vs ~ wt + disp, mtcars, family = binomial) #making the modelsummary(glm1)
## ## Call:## glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.67506 -0.28444 -0.08401 0.57281 2.08234 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.60859 2.43903 0.660 0.510 ## wt 1.62635 1.49068 1.091 0.275 ## disp -0.03443 0.01536 -2.241 0.025 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 43.86 on 31 degrees of freedom## Residual deviance: 21.40 on 29 degrees of freedom## AIC: 27.4## ## Number of Fisher Scoring iterations: 6
Another aspect of GLMs is "interaction terms". This is denoted by
*
e.g. gender ~ height * weight
, where you are asking R to investigate the relationship of height and weight and how it predicts gender.
One of the many wonderful aspects of R is that it really facilitates the kind of exploration that is necessary for understanding data and wriggling your way 🐛 towards enlightenment. It's easy to try different things and make different models- so you may as well do exactly that!
Also, there are tons of free resources for learning R! One great example is: https://learningstatisticswithr.com/book/introR.html.
Also, there are tons of free resources for learning R! One great example is: https://learningstatisticswithr.com/book/introR.html.
Cleaning data is really important and something I STILL do in Excel... when it can be done so much more easily in R! This is a great resource for learning how to clean data, step by step: https://rladiessydney.org/courses/ryouwithme/
ggplot2
, Hadley Wickham's amazing "grammar of graphics".How cute is Hadley? He's also a super nice guy and has been instrumental in making R the inclusive environment it is today 🎉
1) "Data visualization is part art and part science." - Claus Wilke
1) "Data visualization is part art and part science." - Claus Wilke
2) Making a good visualization requires lots of practice and "material osmosis". Explore what other people are doing, get their code, and tinker!
3) ^ Tied with the above, have fun with it! Make a graph showing all the best Friends episodes or one-liners (as several people have done).
1) Scatterplots/line charts ->
1) Scatterplots/line charts ->
Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.
1) Scatterplots/line charts ->
Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.
1) Scatterplots/line charts ->
Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.
Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!
1) Scatterplots/line charts ->
Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.
Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!
1) Scatterplots/line charts ->
Generally, this is when you have two continuous/numeric variables that you want to plot against one another, to view a relationship. These are not as common in social science.
Used for visualizing frequencies and/or proportions. Very common, and not always good- but they CAN be amazing!
Used to show distribution spread, and indicating whether there are outliers. Also useful for visualizing confidence intervals.
For #tidytuesday this week, I explored the sweet deal metric (points:price ratio) across all wine types by country, for the 15 countries with the most observations in the dataset. Yeah Chilean wine.
— Allison Horst (@allison_horst) May 29, 2019
Code: https://t.co/WvPS1UtFVE pic.twitter.com/NzMv4jl7Fb
library(tidyverse) # All the fun tidy functions!library(paletteer)
library(tidyverse) # All the fun tidy functions!library(paletteer)
Now load the data: google "Allison Horst github tidy_tuesday_5_28_19", and a link to the data will come up
library(tidyverse) # All the fun tidy functions!library(paletteer)
Now load the data: google "Allison Horst github tidy_tuesday_5_28_19", and a link to the data will come up
Or copy this long freaking thing out:
wine_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:## cols(## X1 = col_double(),## country = col_character(),## description = col_character(),## designation = col_character(),## points = col_double(),## price = col_double(),## province = col_character(),## region_1 = col_character(),## region_2 = col_character(),## taster_name = col_character(),## taster_twitter_handle = col_character(),## title = col_character(),## variety = col_character(),## winery = col_character()## )
Alison Hill, R Guru
Alison Hill, R Guru
Yeah, it's about R markdown, but it still applies!
[3] Yes, it's a zombie. Close enough.
ggplot2
syntaxggplot2
syntaxYou start with your "foundational" code, which is where you tell ggplot:
ggplot2
syntaxYou start with your "foundational" code, which is where you tell ggplot:
ggplot2
syntaxYou start with your "foundational" code, which is where you tell ggplot:
the dataset you're using
what variable goes on the x axis
ggplot2
syntaxYou start with your "foundational" code, which is where you tell ggplot:
the dataset you're using
what variable goes on the x axis
what variable goes on the y axis
ggplot2
syntaxYou start with your "foundational" code, which is where you tell ggplot:
the dataset you're using
what variable goes on the x axis
what variable goes on the y axis
what plot you want ggplot to generate
iris
dataset... until I saw on Twitter that this benign, kind of boring dataset was created by a total racist.iris
dataset... until I saw on Twitter that this benign, kind of boring dataset was created by a total racist.Load the following code:
devtools::install_github("allisonhorst/palmerpenguins")
## Using github PAT from envvar GITHUB_PAT
## Downloading GitHub repo allisonhorst/palmerpenguins@master
## checking for file ‘/private/var/folders/lw/rzb9nnl16xq73y25y8v73zfw0000gn/T/RtmpwktobE/remotescfc46620b41c/allisonhorst-palmerpenguins-f13c212/DESCRIPTION’ ...✓ checking for file ‘/private/var/folders/lw/rzb9nnl16xq73y25y8v73zfw0000gn/T/RtmpwktobE/remotescfc46620b41c/allisonhorst-palmerpenguins-f13c212/DESCRIPTION’## ─ preparing ‘palmerpenguins’:## checking DESCRIPTION meta-information ...✓ checking DESCRIPTION meta-information## ─ checking for LF line-endings in source and make files and shell scripts## ─ checking for empty or unneeded directories## ─ looking to see if a ‘data/datalist’ file should be added## ─ building ‘palmerpenguins_0.1.0.tar.gz’## ##
require(palmerpenguins)
## Loading required package: palmerpenguins
First, go ahead and explore the data (remember, to load the data from a package you will type data("penguins")
). Use head
, str
, and whatever else you want that will help you understand what's going on in this dataset.
data("penguins")require(ggplot2)ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species))
Here is what you have done:
g1 <- ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species))
Here is what you have done:
g1 <- ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species))
Here is what you now will add:
Here is what you have done:
g1 <- ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species))
Here is what you now will add:
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species)) + theme(text = element_text(size=15)) + labs(caption = "Data by Allison Horst")
mpg
dataset.mpg
dataset.e.g.
data(mpg)
mpg
dataset.e.g.
data(mpg)
Remember what we did earlier. How would you explore this dataset?
mpg
dataset.e.g.
data(mpg)
Remember what we did earlier. How would you explore this dataset?
Correct! head
, summary
, and str
are all good options
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |