class: center, middle, inverse, title-slide <style> pre { background-color: lightyellow; white-space: pre-wrap; line-height: 100%; } </style> ## Programming in R: a brief introduction #### Jeremy Mack #### Lehigh University - Digital Scholarship Team <img src="./images/notes.png" alt="RStudio" height=242/> --- class: center, middle, inverse, title-slide ## Programming in R: a brief introduction #### Jeremy Mack #### Lehigh University - Digital Scholarship Team <img src="./images/rlang.png" alt="RStudio" height=150/> <img src="./images/rstudio5.png" alt="RStudio" height=150/> <img src="./images/tidyverse5.png" alt="RStudio" height=150/> <br/><br/> <br/><br/> <br/> --- ### About this presentation * This course is a **brief introduction** into R. -- * It is targeted at people that have little to no experience progamming. -- * It could be useful for people who learned R some time ago and forgot it, or who are not familiar with modern R programming (`tidyverse`). -- * It focuses on a bit of history, an introduction to the R environment, and some hands experience with **data wrangling**. -- * Slides are available on [Lehigh's Research Computing site](https://confluence.cc.lehigh.edu/display/hpc/Seminars) and Github ([slides](https://jeremymack-lu.github.io/rprog/) and [raw code](https://github.com/jeremymack-LU/rprog)) --- ### Structure of the presentation The presentation is split into seven topics: * [**Topic 1:**](<https://jeremymack-lu.github.io/rprog/#10>) What is R? Why use it? * [**Topic 2:**](<https://jeremymack-lu.github.io/rprog/#22>) What is RStudio? Why use it? * [**Topic 3:**](<https://jeremymack-lu.github.io/rprog/#32>) Getting started with R and RStudio * [**Topic 4:**](<https://jeremymack-lu.github.io/rprog/#58>) Objects in R * [**Topic 5:**](<https://jeremymack-lu.github.io/rprog/#82>) Functions in R * [**Topic 6:**](<https://jeremymack-lu.github.io/rprog/#100>) Data and Data Wrangling * [**Topic 7:**](<https://jeremymack-lu.github.io/rprog/#163>) Extras - RStudio projects, Other things to do in R, and Resources --- #### Programming in R [<center><img src="./images/r_rollercoaster.png" alt="RStudio" height=350/></center>](https://github.com/allisonhorst/stats-illustrations) --- class: center, middle, inverse #### Topic 1: What is R? Why use it? <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 1: What is R? Why use it? .pull-left[ <center><img src="./images/R_logo.png" alt="R logo" height=230</></center> ] .pull-right[ - R is a **programming language** ([one of many](<https://www.tiobe.com/tiobe-index/>)) and an **environment** for statistical computing. {{content}} ] -- - Developed by Ross Ihaka and Robert Gentleman 29 years ago; now maintained by a core team supported by the R Foundation. {{content}} -- - Dialect of the S language (S-Plus) {{content}} --- #### Topic 1: What is R? Why use it? .pull-left[ <center><img src="./images/R_logo.png" alt="R logo" height=230</></center> ] .pull-right[ - R is a **programming language** ([one of many](<https://www.tiobe.com/tiobe-index/>)) and an **environment** for statistical computing. - Developed by Ross Ihaka and Robert Gentleman 29 years ago; now maintained by a core team supported by the R Foundation. - Dialect of the S language (S-Plus) ] <img src="./images/timeline.png" align="center" alt="R timeline" /> --- #### Topic 1: What is R? Why use it? .pull-left[ <center><img src="./images/R_logo.png" alt="R logo" height=230</></center> ] .pull-right[ - **Free!** {{content}} ] -- - Rich data analysis and visualization options {{content}} -- - Available on most platforms/OS {{content}} -- - **Very active development community** + CRAN: The Comprehensive R Archive Network + User contributed packages <br/> (> 18,000) {{content}} -- - **Reproducibility** --- #### Topic 1: What is R? Why use it? <img src="./images/reproducibility.png" alt="R logo" height=265</> <img src="./images/reproducibility2.png" alt="R logo" height=265</> ##### - 2019 report by The National Academies of Sciences, Engineering, and Medicine --- #### Topic 1: What is R? Why use it? <img src="./images/reproducibility3.png" alt="R logo" height=265</> <img src="./images/reproducibility4b.png" alt="R logo" height=265</> ##### - 2018 series featured in Nature ###### - "There's nothing more reproducible than code. And unfortunately, there are few things LESS reproducible than pointing and clicking, then trying to tell someone how you did it." --- class: center, middle, inverse #### Topic 2: What is RStudio? Why use it? <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 2: What is RStudio? Why use it? .pull-left[ ![RStudio logo](https://rstudio.com/wp-content/uploads/2018/10/RStudio-Logo.png) ] .pull-right[ * RStudio is a **company** that develops **free and open tools** for R, and enterprise-ready professional products. {{content}} ] -- * **Integrated Development Environment** (IDE), or a front end platform to run R. {{content}} --- #### Topic 2: What is RStudio? Why use it? .pull-left[ ![RStudio logo](https://rstudio.com/wp-content/uploads/2018/10/RStudio-Logo.png) ] .pull-right[ * RStudio is a **company** that develops **free and open tools** for R, and enterprise-ready professional products. * **Integrated Development Environment** (IDE), or a front end platform to run R. ] <img src="./images/rstudio4.png" align="center" alt="Rstudio timeline" /> --- #### Topic 2: What is RStudio? Why use it? .pull-left[ ![RStudio logo](https://rstudio.com/wp-content/uploads/2018/10/RStudio-Logo.png) ] .pull-right[ * Like R, it's **free**! {{content}} ] -- * It can reduce the learning curve of R, by creating **organization**. {{content}} -- * Integrates nicely with other R features/applications: + Projects + Version control + R Markdown + ShinyApps --- #### Topic 2: What is RStudio? Why use it? How do R and RStudio work together? Consider a car analogy. -- .pull-left[ **RStudio - the body** - RStudio provides a frame that keeps things organized and finishings that make it visualling appealing. **R - the engine** - R runs things under the hood - it's the enginge that allows the car to drive. ] .pull-right[ <img src="./images/car_engine.jpg" alt="R logo" width=400</> ] --- class: inverse #### Review - R and RStudio: * **R** is a programming language built for statistical computing ("Engine"). + It's open source and it's free. * **RStudio** is an integrated development envrionment that makes working with R easier ("Body"). + It's developed by a company, but it's also free. --- class: center, middle, inverse #### Topic 3: Getting started with R and RStudio <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 3: Getting started with R and RStudio How do I get R and RStudio? -- * Download and local install: + You can download R on its own through the [R Project website](https://www.r-project.org). + You can download RStudio, including R, at the [RStudio website](https://rstudio.com/products/rstudio/download/). -- * R and RStudio at Lehigh: + Both R and RStudio are available on [LUapps](https://luapps.lehigh.edu). + LUapps can be accessed both on campus and off-campus (over VPN). <center><img src="./images/luapps.png" alt="LUApps website" height=200/></center> --- #### Topic 3: Getting started with R and RStudio First, let's explore RStudio. [<img src="./images/rstudio.png" align="center" alt="RStudio" height=500/>](https://luapps.lehigh.edu) --- #### Topic 3: Getting started with R and RStudio First, let's explore RStudio. <img src="./images/rstudio2.png" align="center" alt="RStudio" height=500/> --- #### Topic 3: Getting started with R and RStudio <style> pre { background-color: lightyellow; white-space: pre-wrap; line-height: 100%; } </style> Next, let's explore R - the engine under the hood. In true computer science fashion, let's first try typing: .tiny[ ```r print("Hello world!") ``` ] What happend? -- .tiny[ ``` [1] "Hello world!" ``` ] --- #### Topic 3: Getting started with R and RStudio Two things to note: - We didn't just get "Hello world!", we also got `[1]`. This is R's way of printing to the screen; it's telling us the position we're at. - We didn't need to put anything at the end of the line, we just hit return. --- #### Topic 3: Getting started with R and RStudio Now, let's try three things... Try capitalizing `Print(...)`: .tiny[ ```r Print("Hello world!") ``` ] Try putting a space between `print` and `("Hello world!")`: .tiny[ ```r print ("Hello world!") ``` ] Try just entering `"Hello world!"`: .tiny[ ```r "Hello world!" ``` ] What happened? -- .tiny[ ``` Error in Print("Hello world!"): could not find function "Print" ``` ``` [1] "Hello world!" ``` ``` [1] "Hello world!" ``` ] --- #### Topic 3: Getting started with R and RStudio Three things you just learned: - R is **case-sensitive**. - R does not care about **whitespace**. - R will **print** results by default. --- #### Topic 3: Getting started with R and RStudio You can also use R as a calculator. Let's try the following: .tiny[ ```r 2 + 2 ``` ] -- .tiny[ ``` [1] 4 ``` ] -- .tiny[ ```r 4 * 2 ``` ] -- .tiny[ ``` [1] 8 ``` ] -- .tiny[ ```r 8 / 3 ``` ] -- .tiny[ ``` [1] 2.666667 ``` ] -- .tiny[ ```r exp(log(8)-log(3)) ``` ] -- .tiny[ ``` [1] 2.666667 ``` ] --- #### Topic 3: Getting started with R and RStudio #### Assignments We often want to save the results of our calculations, rather than print them to the screen. To do so, we'll use the **assignment operator**, `<-` Here's an example: .tiny[ ```r x <- log(8) y <- log(3) ``` ] -- Now we can redo our last calculation using the assignments: .tiny[ ```r exp(x-y) ``` ``` [1] 2.666667 ``` ] *Shortcut in RStudio: Option + - (Mac OS), Alt + - (Windows OS) --- #### Topic 3: Getting started with R and RStudio #### Concatenation We will often want to work on sequences of values, rather than specific values. To do so, we'll use the **concatenation operator**, `c(...)` Here's an example: .tiny[ ```r n <- c(2, 3, 5, 8, 13, 21, 34, 55) ``` ] -- We can now apply opertions across the entire vector. For example: .tiny[ ```r n * 2 ``` ``` [1] 4 6 10 16 26 42 68 110 ``` ] --- #### Topic 3: Getting started with R and RStudio #### Logicals It can be useful to know whether our values meet certain conditions. In addition to **character values** (which we saw when we called `print("Hello world!")`), R also allows **logical values**, or `TRUE` and `FALSE`. For example, we can check which numbers in our "n" vector are double digit: .tiny[ ```r n ``` ``` [1] 2 3 5 8 13 21 34 55 ``` ```r is_double_digit <- n > 9 is_double_digit ``` ``` [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE ``` ] --- class: inverse #### Review - Getting started with R and RStudio: * RStudio has four main windows to keep things **organized**. * R is **case-sensitive**. * R is a calculator. * R can be used to store objects (**assignments**) that can be used with other functions, or compared to other objects. --- class: center, middle, inverse #### Topic 4: Objects in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 4: Objects in R R largely revolves around two things: **objects** and **functions**. <center><b>Define objects > Apply functions > Repeat!</b></center> -- For example, we can define a simple object called "n": .tiny[ ```r n <- c(2, 3, 5, 8, 13, 21, 34, 55) ``` ] -- We can then apply a function to our object. Lets say we're interested in the average, so we'll apply the mean function: .tiny[ ```r avg.n <- mean(n) avg.n ``` ``` [1] 17.625 ``` ] --- #### Topic 4: Objects in R * Objects come in many different shapes and sizes - like a number, a dataset, or the results of a statistical test. * Objects are essentially *data* that have a particular **type** and **structure**. -- * There are six basic **types** (classes) of data in R: 1. Logical 2. Double 3. Integer 4. Complex 5. Character 6. Factors - special case of Integer with Character labels --- #### Topic 4: Objects in R * Six basic **types** (classes) of data in R: .panelset[ .panel[.panel-name[Logical] Objects often created via comparison(s). .tiny2[ ```r x <- 1; y <- 2 # Create sample values x and y z <- x > y # Is x larger than y? z # Print the logical value ``` ``` [1] FALSE ``` ] .tiny2[ ```r typeof(z) # Print the data type of z ``` ``` [1] "logical" ``` ] ] .panel[.panel-name[Double*] Numbers, often approximated; default data type in R*. .tiny2[ ```r x <- 10.5 # Define object x x # Print x ``` ``` [1] 10.5 ``` ] .tiny2[ ```r typeof(x) # Print the data type of x ``` ``` [1] "double" ``` ] ] .panel[.panel-name[Integer] Whole numbers; a number that is not a fraction. .tiny2[ ```r x <- 10 # Define object x x # Print x ``` ``` [1] 10 ``` ] .tiny2[ ```r typeof(x) # Print the data type of x ``` ``` [1] "double" ``` ] .tiny2[ ```r y <- as.integer(10) # Declare as integer z <- 10L # Declare as integer by appending with "L" paste(typeof(x),typeof(y),typeof(z)) # Print the data type of x ``` ``` [1] "double integer integer" ``` ] ] .panel[.panel-name[Complex] Any number that can be written as a + bi, where *i* is the imaginary unit and a and b are real numbers. .tiny2[ ```r x <- 1 + 2i # Create a complex number x x # Print the value of x ``` ``` [1] 1+2i ``` ] .tiny2[ ```r typeof(x) # Print the data type of x ``` ``` [1] "complex" ``` ] ] .panel[.panel-name[Character] Used to represent string values in R. .tiny2[ ```r name <- "Jeremy Mack" # Assign character string name # Print the character string ``` ``` [1] "Jeremy Mack" ``` ] .tiny2[ ```r x <- as.character(3.14) # Declare number as character string y <- "3.14" # Declare as character with " " c(x,y) # Print the character strings ``` ``` [1] "3.14" "3.14" ``` ] .tiny2[ ```r c(typeof(x),typeof(y)) # Print the data type of x ``` ``` [1] "character" "character" ``` ] ] .panel[.panel-name[Factor] Fixed set of possible values (categorical variables); displayed as characters stored as integers. .tiny[ ```r x <- c("A","B","C","D") # Create a vector of factor levels x <- as.factor(x) # Declare as factor x # Print the value of x ``` ``` [1] A B C D Levels: A B C D ``` ] .tiny[ ```r typeof(x) # Print the data type of x ``` ``` [1] "integer" ``` ] .tiny[ ```r str(x) # Print the structure of x ``` ``` Factor w/ 4 levels "A","B","C","D": 1 2 3 4 ``` ] ] ] --- #### Topic 4: Objects in R * Objects come in many different shapes and sizes - like a number, a dataset, or the results of a statistical test. * Objects are essentially *data* that have a particular **type** and **structure**. * There are six basic **types** (classes) of data in R: 1. Logical 2. Double 3. Integer 4. Complex 5. Character 6. Factors - special case of Integer with Character labels --- #### Topic 4: Objects in R * Objects come in many different shapes and sizes - like a number, a dataset, or the results of a statistical test. * Objects are essentially *data* that have a particular **type** and **structure**. * There are four basic **structures** of data in R: 1. Scalar 2. Vector 3. Matrix 4. Data frames (and Tibbles) --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects.png) ] .pull-right[ * Scalar * Vector * Matrix * Data frames (and Tibbles) ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects2.png) ] .pull-right[ **Scalar objects:** 1. Hold only one value at a time. 2. Can be used to build more complex objects. ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects2.png) ] .tiny6.pull-right[ **Scalar objects:** ```r x <- 10.5 x ``` ``` [1] 10.5 ``` ```r str(x) ``` ``` num 10.5 ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects3.png) ] .pull-right[ **Vector objects:** 1. Hold several values stored as a single object. 2. Can be either numeric or character (not both!). ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects3.png) ] .tiny6.pull-right[ **Vector objects:** ```r n <- c(2,3,5,8,13,21,34,55) n ``` ``` [1] 2 3 5 8 13 21 34 55 ``` ```r str(n) ``` ``` num [1:8] 2 3 5 8 13 21 34 55 ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects3.png) ] .tiny6.pull-right[ **Vector objects:** ```r n <- c(2,3,5,8,13,21,34,"55") n ``` ``` [1] "2" "3" "5" "8" "13" "21" "34" "55" ``` ```r str(n) ``` ``` chr [1:8] "2" "3" "5" "8" "13" "21" "34" "55" ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .pull-right[ **Matrix objects:** 1. Large data structure. 2. Has 2-dimensions, representing its height (rows) and width (columns). 3. Can be either numeric or character (not both!). ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .tiny6.pull-right[ **Matrix objects:** ```r x <- 1:5 y <- 6:10 z <- 11:15 m <- cbind(x,y,z) class(m) ``` ``` [1] "matrix" "array" ``` ```r str(m) ``` ``` int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:3] "x" "y" "z" ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .pull-right[ **Data frame objects:** 1. Large data structure. 2. Has 2-dimensions, representing its height (rows) and width (columns). 3. Can be a mix of data types. ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .tiny6.pull-right[ **Data frame objects:** ```r survey <- data.frame( "id" = c(1,2,3,4,5), "sex" = c("m","m","m","f","f"), "age" = c(99,46,23,54,23)) class(survey) ``` ``` [1] "data.frame" ``` ```r str(survey) ``` ``` 'data.frame': 5 obs. of 3 variables: $ id : num 1 2 3 4 5 $ sex: chr "m" "m" "m" "f" ... $ age: num 99 46 23 54 23 ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .pull-right[ **Tibble objects:** 1. Large data structures. 2. Has 2-dimensions, representing its height (rows) and width (columns). 3. Can be a mix of data types. {{content}} ] -- 4. "Lazy data frames" {{content}} * Do less and complain more. --- #### Topic 4: Objects in R * Four basic **structures** of data in R: .pull-left[ ![R objects](./images/objects4.png) ] .tiny6.pull-right[ **Tibble objects:** ```r pacman::p_load(tibble) survey <- tibble( "id" = c(1,2,3,4,5), "sex" = c("m","m","m","f","f"), "age" = c(99,46,23,54,23)) class(survey) ``` ``` [1] "tbl_df" "tbl" "data.frame" ``` ```r str(survey) ``` ``` tibble [5 × 3] (S3: tbl_df/tbl/data.frame) $ id : num [1:5] 1 2 3 4 5 $ sex: chr [1:5] "m" "m" "m" "f" ... $ age: num [1:5] 99 46 23 54 23 ``` ] --- #### Topic 4: Objects in R * Four basic **structures** of data in R: <center><img src="./images/tidydata_1.jpg" alt="RStudio" height=350/></center> --- class: inverse #### Review - Objects in R: * Define objects > Apply functions > Repeat! * Objects are data that have a type and structure. * There are six basic types of data: 1. Logical 2. Double - default data type in R 3. Integer 4. Complex 5. Character 6. Factors - special case of Integer with Character labels * There are four basic structures: 1. Scalar 2. Vector 3. Matrix 4. Data frames (and Tibbles) --- class: center, middle, inverse #### Topic 5: Functions in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 5: Functions in R R largely revolves around two things: **objects** and **functions**. <center><b>Define objects > Apply functions > Repeat!</b></center> For example, we can define a simple object called "n": .tiny[ ```r n <- c(2, 3, 5, 8, 13, 21, 34, 55) ``` ] We can then apply a function to our object. Lets say we're interested in the average, so we'll apply the mean function: .tiny[ ```r avg.n <- mean(n) avg.n ``` ``` [1] 17.625 ``` ] --- #### Topic 5: Functions in R * Functions are procedures that typically take one or more objects as **arguments** (i.e., inputs), does something with them, then returns a new object (i.e., result). * Cooking analogy: * Functions (recipe) + Arguments (ingredients) = Result (meal) -- * There are two basic types of functions in R: 1. User-defined functions -- .tiny6[ ```r ages <- c(99,46,23,54,23) age_mean <- function(x) { summation <- sum(x) summation / length(x) } age_mean(ages) ``` ``` [1] 49 ``` ] --- #### Topic 5: Functions in R * Functions are procedures that typically take one or more objects as **arguments** (i.e., inputs), does something with them, then returns a new object (i.e., result). * Cooking analogy: * Functions (recipe) + Arguments (ingredients) = Result (meal) * There are two basic types of functions in R: 1. User-defined functions 2. Built-in functions that are loaded via **Packages** (Community development!) --- #### Topic 5: Functions (and Packages) in R .pull-right5.footnote[ <img src="./images/Rpackages.png" width=200 alt="ggplot" align="right"</> ] Packages in R: * R packages are used to install built-in functions into the R Environment. * In addition to functions, packages also include data sets, help documentation, and how-to examples (i.e., vignette). -- * When you install R for the first time, you are installing [base R](<https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html>), which includes functions written by the original authors of the R language. -- * Additional packages are developed by the [R community](<https://cran.case.edu>). -- * How do we get packages loaded into R? --- #### Topic 5: Functions (and Packages) in R How do we get packages loaded into R? * Two step process: 1. Install the package - do this once 2. Load the package into R - do this every time .panelset[ .panel[.panel-name[Programmatically] .tiny6[ ```r # Traditional methods install.packages("ggplot2") # Install - quotes are necessary library("ggplot2") # Load into R environment - quotes are optional library(ggplot2) # More efficient way install.packages("pacman") # Install pacman package library(pacman) # Load into R environment p_load(ggplot2, dplyr) # Use p_load function to install and load multiple packages ``` ] .panel[.panel-name[R Studio IDE] <center><img src="./images/explorer1.png" alt="R timeline" height=300 </></center> ]]] --- #### Topic 5: Functions (and Packages) in R * Tidyverse - collection of R packages for data science * Underlying design philosophy, grammar, and data structures * [Supported by RStudio](<https://www.tidyverse.org) <br/><br/> .right[ <img src="./images/tidyverse.png" alt="Tidyverse" height=300 </> <img src="./images/tidyverse2.png" alt="Tidyverse" height=300 </> ] --- #### Topic 5: Functions (and Packages) in R <center><img src="./images/tidy1.png" alt="RStudio" height=350/></center> --- #### Topic 5: Functions (and Packages) in R <center><img src="./images/tidy2.png" alt="RStudio" height=350/></center> --- #### Topic 5: Functions (and Packages) in R <center><img src="./images/tidy3.png" alt="RStudio" height=350/></center> --- #### Topic 5: Functions (and Packages) in R * Tidyverse packages can be installed and loaded individually: .tiny6[ ```r # Install and load dplyr package (classic, two-step way) install.packages("dplyr"); library(dplyr) ``` ] -- * Or, in bulk, with the **tidyverse** package: .tiny6[ ```r # Install and load tidyverse packages (classic, two-step way) install.packages("tidyverse"); library(tidyverse) # List tidyverse packages tidyverse_packages(include_self=FALSE) ``` ``` [1] "broom" "cli" "crayon" "dbplyr" [5] "dplyr" "dtplyr" "forcats" "googledrive" [9] "googlesheets4" "ggplot2" "haven" "hms" [13] "httr" "jsonlite" "lubridate" "magrittr" [17] "modelr" "pillar" "purrr" "readr" [21] "readxl" "reprex" "rlang" "rstudioapi" [25] "rvest" "stringr" "tibble" "tidyr" [29] "xml2" ``` ] --- class: inverse #### Review - Functions (and Packages) in R: * Define objects > Apply functions > Repeat! * Functions are used to work with objects in R and are loaded via packages. * Standard functions (**base R**) load automatically when R is opened. * There is also a **large community** of users that develop packages for R (18,900+ and counting!). * Tidyverse - collection of R packages for data science, supported by RStudio. * Packages can be loaded both programmatically and with the RStudio IDE. * Two step process: 1. Install package 2. Load package into library --- class: center, middle, inverse #### Topic 6: Data and Data Wrangling <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 6: Data and Data Wrangling <center><img src="./images/data-science.png" alt="RStudio" height=350/></center> --- #### Topic 6: Data and Data Wrangling <center><img src="./images/data-science2.png" alt="RStudio" height=350/></center> --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. * Wrangle data (Explore, Summarize, and Analyze)! --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. .panelset[ .panel[.panel-name[Programmatically] .tiny6[ ```r getwd() # Prints current working directory ``` ``` [1] "h:/" ``` ] .tiny6[ ```r setwd("h:/Lehigh") # Sets path to working directory getwd() # Prints current working directory ``` ``` [1] "h:/Lehigh" ``` ] ] .panel[.panel-name[RStudio IDE] <center><img src="./images/explorer2.png" alt="R timeline" height=300 </></center> ] ] --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. .panelset[ .panel[.panel-name[Programmatically] .tiny6[ ```r # base package options (reads to a data frame) read.table() # Reads tabular data read.csv() # Reads comma separated files read.delim() # Reads tab separated files # readr package options (reads to a tibble) read_table() # Reads tabular data read_csv() # Reads comma separated files read_delim() # Reads tab separated files # readxl package options (reads to a tibble) read_xlsx() # Reads .xlsx files read_xls() # Reads .xls files ``` ] .panel[.panel-name[R Studio IDE] <img src="./images/environment2.png" alt="RStudio" height=288/> <img src="./images/environment1.png" alt="RStudio" height=288/> ]]] --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. * Wrangle data (Explore, Summarize, and Analyze)! <center><img src="./images/data_cowboy.png" alt="R timeline" height=300 </></center> --- #### Topic 6: Data and Data Wrangling Exploring data in R * Let's use a dataset from the web to practice saving and loading data: .tiny6[ ```r # Set url link for data download url <- "https://raw.githubusercontent.com/jeremymack-LU/rprog/master/mpg.csv" # Download file to working directory download.file(url, "data/mpg.csv") # Read data into R mpg <- read.csv("data/mpg.csv") ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R * Helpful functions for exploring data in R: .tiny6[ ```r View(x) # View the dataset in a spreadsheet str(x) # Print the structure of the data frame head(x) # Print the first few rows tail(x) # Print the last few rows nrow(x) # Print the number of rows ncol(x) # Print the number of columns dim(x) # Print the dimensions (rows x columns) rownames(x) # Print row names colnames(x) # Print column names ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Let's check to make sure our data loaded correctly: ```r head(mpg, 10) ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Next, we'll check our data structure: ```r str(mpg) ``` ``` 'data.frame': 234 obs. of 11 variables: $ manufacturer: chr "audi" "audi" "audi" "audi" ... $ model : chr "a4" "a4" "a4" "a4" ... $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ... $ cyl : int 4 4 4 4 6 6 6 4 4 4 ... $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ... $ drv : chr "f" "f" "f" "f" ... $ cty : int 18 21 20 21 16 18 18 18 16 20 ... $ hwy : int 29 29 31 30 26 26 27 26 25 28 ... $ fl : chr "p" "p" "p" "p" ... $ class : chr "compact" "compact" "compact" "compact" ... ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg[1,1] # Print value in row 1, column 1 ``` ``` [1] "audi" ``` ```r mpg[6,10] # Print value in row 6, column 10 ``` ``` [1] "p" ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg[1,] # Print the first row ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact ``` ] .tiny6[ ```r mpg[c(1,5),] # Print the first and fifth row ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact ``` ] .tiny6[ ```r mpg[1:5,] # Print the first through fifth row ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg[,1] # Print column 1 by number (as a vector) ``` ] .tiny6[ ``` [1] "audi" "audi" "audi" "audi" "audi" "audi" [7] "audi" "audi" "audi" "audi" "audi" "audi" [13] "audi" "audi" "audi" "audi" "audi" "audi" [19] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [25] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [31] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [37] "chevrolet" "dodge" "dodge" "dodge" "dodge" "dodge" [43] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [49] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [55] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [61] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [67] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [73] "dodge" "dodge" "ford" "ford" "ford" "ford" [79] "ford" "ford" "ford" "ford" "ford" "ford" [85] "ford" "ford" "ford" "ford" "ford" "ford" [91] "ford" "ford" "ford" "ford" "ford" "ford" [97] "ford" "ford" "ford" "honda" "honda" "honda" [103] "honda" "honda" "honda" ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg[,'manufacturer'] # Print column 1 by name (as a vector) ``` ] .tiny6[ ``` [1] "audi" "audi" "audi" "audi" "audi" "audi" [7] "audi" "audi" "audi" "audi" "audi" "audi" [13] "audi" "audi" "audi" "audi" "audi" "audi" [19] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [25] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [31] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [37] "chevrolet" "dodge" "dodge" "dodge" "dodge" "dodge" [43] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [49] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [55] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [61] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [67] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [73] "dodge" "dodge" "ford" "ford" "ford" "ford" [79] "ford" "ford" "ford" "ford" "ford" "ford" [85] "ford" "ford" "ford" "ford" "ford" "ford" [91] "ford" "ford" "ford" "ford" "ford" "ford" [97] "ford" "ford" "ford" "honda" "honda" "honda" [103] "honda" "honda" "honda" ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg$manufacturer # Print column 1 by name (as a vector) ``` ] .tiny6[ ``` [1] "audi" "audi" "audi" "audi" "audi" "audi" [7] "audi" "audi" "audi" "audi" "audi" "audi" [13] "audi" "audi" "audi" "audi" "audi" "audi" [19] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [25] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [31] "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" "chevrolet" [37] "chevrolet" "dodge" "dodge" "dodge" "dodge" "dodge" [43] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [49] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [55] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [61] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [67] "dodge" "dodge" "dodge" "dodge" "dodge" "dodge" [73] "dodge" "dodge" "ford" "ford" "ford" "ford" [79] "ford" "ford" "ford" "ford" "ford" "ford" [85] "ford" "ford" "ford" "ford" "ford" "ford" [91] "ford" "ford" "ford" "ford" "ford" "ford" [97] "ford" "ford" "ford" "honda" "honda" "honda" [103] "honda" "honda" "honda" ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg[1] # Print column 1 by number (as a data frame) ``` ] .tiny6[ ``` manufacturer 1 audi 2 audi 3 audi 4 audi 5 audi 6 audi 7 audi 8 audi 9 audi 10 audi 11 audi 12 audi 13 audi 14 audi 15 audi 16 audi 17 audi ``` ] --- #### Topic 6: Data and Data Wrangling Exploring data in R .tiny6[ * Selecting specific data - R reads data as row x column, using brackets [r,c]: ```r mpg['manufacturer'] # Print column 1 by number (as a data frame) ``` ] .tiny6[ ``` manufacturer 1 audi 2 audi 3 audi 4 audi 5 audi 6 audi 7 audi 8 audi 9 audi 10 audi 11 audi 12 audi 13 audi 14 audi 15 audi 16 audi 17 audi ``` ] --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. * Wrangle data (~~Explore,~~ Summarize, and Analyze)! <center><img src="./images/data_cowboy.png" alt="R timeline" height=300 </></center> --- #### Topic 6: Data and Data Wrangling Summarizing data in R * Helpful functions for summarizing data in R: .tiny6[ ```r mean(x) # Calculate and return the average of the input values max(x) # Return the maximum value of the input values min(x) # Return the minimum value of the input values sd(x) # Calculate and return the standard deviation of the input values length(x) # Print the set length of the input values ``` ] --- #### Topic 6: Data and Data Wrangling Summarizing data in R * What if we wanted to summarize the highway mpg data in our dataset? .tiny6[ ```r hwy.avg <- mean(mpg$hwy) # Average value for hwy hwy.max <- max(mpg$hwy) # Maximum value for hwy hwy.min <- min(mpg$hwy) # Minimum value for hwy hwy.sd <- sd(mpg$hwy) # Standard deviation for hwy data.frame(hwy.avg, hwy.max, hwy.min, hwy.sd) # Combine objects to a data frame ``` ``` hwy.avg hwy.max hwy.min hwy.sd 1 23.44017 44 12 5.954643 ``` ] -- .tiny6[ ```r summary(mpg$hwy) # Quick summary for hwy ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 12.00 18.00 24.00 23.44 27.00 44.00 ``` ] --- #### Topic 6: Data and Data Wrangling Summarizing data in R * What happens if there are missing data? .tiny6[ ```r mpg2 <- mpg mpg2[1,"hwy"] <- NA mpg2[1,] ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 NA p compact ``` ] -- .tiny6[ ```r mean(mpg2$hwy) # Average value for mpg ``` ``` [1] NA ``` ] -- .tiny6[ * Need to pay attention to additional arguments ```r mean(mpg2$hwy, na.rm=TRUE) ``` ``` [1] 23.41631 ``` ] --- #### Topic 6: Data and Data Wrangling Summarizing data in R * What if we wanted to add a new variable (i.e., column)? .tiny6[ ```r # Let's add a new column of the average mpg mpg$avg <- (mpg$cty+mpg$hwy)/2 ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class avg 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 23.5 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 25.0 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 25.5 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 25.5 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 21.0 ``` ] --- #### Topic 6: Data and Data Wrangling Basic steps to working with data in R: * Check and/or set a working directory. * Load data. * Wrangle data (~~Explore, Summarize,~~ and Analyze)! <center><img src="./images/data_cowboy.png" alt="R timeline" height=300 </></center> --- #### Topic 6: Data and Data Wrangling Analyzing data in R * Helpful functions for analyzing data in R: .tiny6[ ```r lm(x) # Apply a linear model glm(x) # Apply a generalized linear model t.test(x) # Perform a t-test for difference between means aov(x) # Analysis of variance test prop.test(x) # Test for a difference between proportions ``` ] --- #### Topic 6: Data and Data Wrangling Analyzing data in R * How does highway mileage change with engine displacement? .tiny6[ ```r # Quick plot of the data with(mpg, plot(x=displ, y=hwy)) ``` ] <center><img src="./images/Rplot1.jpeg" alt="RStudio" height=300/></center> --- #### Topic 6: Data and Data Wrangling Analyzing data in R * How does highway mileage change with engine displacement? .tiny6[ ```r # Apply a simple linear model mod <- lm(hwy~displ, data=mpg) summary(mod) ``` ] .tiny6[ ``` Call: lm(formula = hwy ~ displ, data = mpg) Residuals: Min 1Q Median 3Q Max -7.1039 -2.1646 -0.2242 2.0589 15.0105 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.6977 0.7204 49.55 <2e-16 *** displ -3.5306 0.1945 -18.15 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.836 on 232 degrees of freedom Multiple R-squared: 0.5868, Adjusted R-squared: 0.585 F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16 ``` ] --- #### Topic 6: Data and Data Wrangling Analyzing data in R * Does highway mileage significantly differ between car classes? .tiny6[ ```r # Quick plot of the data with(mpg, plot(x=class, y=hwy)) ``` ] <center><img src="./images/Rplot2.jpeg" alt="RStudio" height=300/></center> --- #### Topic 6: Data and Data Wrangling Analyzing data in R * Does highway mileage significantly differ between car classes? .tiny6[ ```r # Apply a simple linear model with an ANOVA test mod2 <- lm(hwy~class, data=mpg) aov2 <- aov(mod2) summary(aov2) ``` ] .tiny6[ ``` Df Sum Sq Mean Sq F value Pr(>F) class 6 5683 947.2 83.39 <2e-16 *** Residuals 227 2578 11.4 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- class: inverse #### Exercise - Summarizing Data: * The *USArrests* dataset provides data on the number of arrests per 100,000 residents for violent crimes (assault, murder, and rape) in each of the 50 US states in 1973. * First assign that data to an object called "crime" .tiny[ ```r crime <- USArrests ``` ] * Using that data, try the following: 1. Calculate average number of arrests for assault. 2. Identify maximum number of arrests for assault. 3. Print the statistics for Pennsylvania. 4. Was there a linear relationship between murders and assaults?
05
:
00
--- class: inverse #### Exercise - Summarizing Data: * The *USArrests* dataset provides data on the number of arrests per 100,000 residents for violent crimes (assault, murder, and rape) in each of the 50 US states in 1973. * First assign that data to an object called "crime" .tiny[ ```r crime <- USArrests ``` ] * Using that data, try the following: 1. Calculate average number of arrests for assault. 2. Identify maximum number of arrests for assault. 3. Print the statistics for Pennsylvania. 4. Was there a linear relationship between murders and assaults? * How did you do? --- #### Exercise - Summarizing Data: * Calculate average number of arrests for assault. -- .tiny[ ```r mean(crime$Assault) ``` ``` [1] 170.76 ``` ] -- .tiny[ ```r mean(crime[,2]) ``` ``` [1] 170.76 ``` ] -- .tiny[ ```r sum(crime$Assault)/length(crime$Assault) ``` ``` [1] 170.76 ``` ] --- #### Exercise - Summarizing Data: * Identify maximum number of arrests for assault. -- .tiny[ ```r max(crime$Assault) ``` ``` [1] 337 ``` ] -- .tiny[ ```r x <- sort(crime$Assault, decreasing=TRUE) x[1] ``` ``` [1] 337 ``` ] -- .tiny[ ```r x <- sort(crime$Assault) x[50] ``` ``` [1] 337 ``` ] --- #### Exercise - Summarizing Data: * Print the statistics for Pennsylvania. -- .tiny[ ```r crime["Pennsylvania",] ``` ``` Murder Assault UrbanPop Rape Pennsylvania 6.3 106 72 14.9 ``` ] -- .tiny[ ```r crime[38,] ``` ``` Murder Assault UrbanPop Rape Pennsylvania 6.3 106 72 14.9 ``` ] --- #### Exercise - Summarizing Data: * Was there a linear relationship between murders and assaults? -- .tiny[ ```r # Quick plot of the data with(crime, plot(x=Assault, y=Murder)) ``` ] <center><img src="./images/Rplot3.jpeg" alt="RStudio" height=300/></center> --- #### Exercise - Summarizing Data: * Was there a linear relationship between murders and assaults? .tiny[ ```r # Apply a simple linear model mod <- lm(Murder~Assault, data=crime) summary(mod) ``` ] .tiny2[ ``` Call: lm(formula = Murder ~ Assault, data = crime) Residuals: Min 1Q Median 3Q Max -4.8528 -1.7456 -0.3979 1.3044 7.9256 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.631683 0.854776 0.739 0.464 Assault 0.041909 0.004507 9.298 2.6e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.629 on 48 degrees of freedom Multiple R-squared: 0.643, Adjusted R-squared: 0.6356 F-statistic: 86.45 on 1 and 48 DF, p-value: 2.596e-12 ``` ] --- class: center, middle, inverse, title-slide #### Topic 6: Data and Data Wrangling #### Tidyverse <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 6: Data and Data Wrangling - Tidyverse * Core packages - dplyr, forcats, ggplot2, purrr, readr, tibble, tidyr, stringr .pull-left4[ * **dplyr** package + Introduces consistent set of functions (verbs) + Applied across all Tidyverse packages ] .pull-right4[ <br/> <center><img src="./images/dplyr.png" alt="RStudio" height=150/></center> ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * Helpful functions - **dplyr**: .tiny6.pull-left2[ ```r filter(x) # picks cases based on their values select(x) # picks columns based on their names slice(x) # picks rows by position arrange(x) # changes the ordering of rows group_by(x) # allows operations by groups mutate(x) # adds new variables to a dataset summarise(x) # summarise multiple values count(x) # counts number of rows in a group add_row(x) # add a row of data to a data frame ``` ] .pull-right2[ <center><img src="./images/dplyr_filter.jpg" alt="RStudio" width=400/></center> <center><img src="./images/dplyr_mutate.png" alt="RStudio" width=250/></center> ] --- #### Topic 6: Data and Data Wrangling - Tidyverse Exploring data in R * Let's check to make sure our data loaded correctly: .panelset[ .panel[.panel-name[Tidyverse] .tiny6[ ```r # Read in data and check first ten rows w/ Tidyverse functions url <- "https://raw.githubusercontent.com/jeremymack-LU/rprog/master/mpg.csv" mpg <- read_csv(url) slice_head(mpg, n=10) ``` ``` # A tibble: 10 × 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 3 audi a4 2 2008 4 manu… f 20 31 p comp… 4 audi a4 2 2008 4 auto… f 21 30 p comp… 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… ``` ] ] .panel[.panel-name[base R] .tiny6[ ```r # Read in data and check first ten rows w/ base R functions url <- "https://raw.githubusercontent.com/jeremymack-LU/rprog/master/mpg.csv" mpg <- read.csv(url) head(mpg, n=10) ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact ``` ] ] ] --- #### Topic 6: Data and Data Wrangling - Tidyverse Summarizing data in R * What if we wanted to summarize the highway mpg data in our dataset? .panelset[ .panel[.panel-name[Tidyverse] .tiny6[ ```r summarize(mpg, # Data hwy.avg=mean(hwy), # Average value for hwy hwy.max=max(hwy), # Maximum value for hwy hwy.min=min(hwy), # Minimum value for hwy hwy.sd=sd(hwy)) # Standard deviation for hwy ``` ``` # A tibble: 1 × 4 hwy.avg hwy.max hwy.min hwy.sd <dbl> <dbl> <dbl> <dbl> 1 23.4 44 12 5.95 ``` ] ] .panel[.panel-name[base R] .tiny6[ ```r hwy.avg <- mean(mpg$hwy) # Average value for hwy hwy.max <- max(mpg$hwy) # Maximum value for hwy hwy.min <- min(mpg$hwy) # Minimum value for hwy hwy.sd <- sd(mpg$hwy) # Standard deviation for hwy data.frame(hwy.avg, hwy.max, hwy.min, hwy.sd) # Combine objects to a data frame ``` ``` hwy.avg hwy.max hwy.min hwy.sd 1 23.44017 44 12 5.954643 ``` ] ] ] --- #### Topic 6: Data and Data Wrangling - Tidyverse Summarizing data in R * What if we wanted to add a new variable (i.e., column)? .panelset[ .panel[.panel-name[Tidyverse] .tiny6[ ```r # Let's add a new column of the average mpg mpg <- mutate(mpg, avg=(cty+hwy)/2) ``` ``` # A tibble: 5 × 12 manufacturer model displ year cyl trans drv cty hwy fl class avg <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 23.5 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 25 3 audi a4 2 2008 4 manu… f 20 31 p comp… 25.5 4 audi a4 2 2008 4 auto… f 21 30 p comp… 25.5 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 21 ``` ] ] .panel[.panel-name[base R] .tiny6[ ```r # Let's add a new column of the average mpg mpg$avg <- (mpg$cty+mpg$hwy)/2 ``` ``` manufacturer model displ year cyl trans drv cty hwy fl class avg 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 23.5 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 25.0 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 25.5 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 25.5 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 21.0 ``` ] ] ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * Core packages - dplyr, forcats, ggplot2, purrr, readr, tibble, tidyr, stringr .pull-left4[ * **dplyr** package + Introduces consistent set of functions (verbs) + Applied across all Tidyverse packages <br/><br/> <br/><br/> <br/><br/> + Imports pipe operator (%>%) from **magrittr** package + Forwards an object, into a function ] .pull-right4[ <br/> <center><img src="./images/dplyr.png" alt="RStudio" height=150/></center> <br/><br/> <center><img src="./images/magrittr.jpg" alt="RStudio" height=170/></center> ] --- #### Topic 6: Data and Data Wrangling - Tidyverse Summarizing data in R * What if we wanted to summarize the data frame we created earlier? .tiny6.pull-left[ ```r # Summarise the hwy variable summarize(mpg, # Data hwy.avg=mean(hwy), # Average hwy hwy.max=max(hwy), # Maximum hwy hwy.min=min(hwy), # Minimum hwy hwy.sd=sd(hwy)) # Std. deviation ``` ``` # A tibble: 1 × 4 hwy.avg hwy.max hwy.min hwy.sd <dbl> <dbl> <dbl> <dbl> 1 23.4 44 12 5.95 ``` ] -- .tiny6.pull-right[ ```r # Summarise the hwy variable # Pipe mpg object into the summarize function mpg %>% summarize(hwy.avg=mean(hwy), hwy.max=max(hwy), hwy.min=min(hwy), hwy.sd=sd(hwy)) ``` ``` # A tibble: 1 × 4 hwy.avg hwy.max hwy.min hwy.sd <dbl> <dbl> <dbl> <dbl> 1 23.4 44 12 5.95 ``` ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * Core packages - dplyr, forcats, ggplot2, purrr, readr, tibble, tidyr, stringr .pull-left4[ * **dplyr** package + Introduces consistent set of functions (verbs) + Applied across all Tidyverse packages <br/><br/> <br/><br/> <br/><br/> + Imports pipe operator (%>%) from **magrittr** package + Forwards an object, into a function + Perform multiple functions, without nesting, or creating multiple objects ] .pull-right4[ <br/> <center><img src="./images/dplyr.png" alt="RStudio" height=150/></center> <br/><br/> <center><img src="./images/magrittr.jpg" alt="RStudio" height=170/></center> ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * For example, in our *mpg* dataset, let's say we're interested in the average highway mpg in cars in 2008, based on their number of cylinders. + Multiple objects approach: .tiny6.pull-left[ ```r a <- filter(mpg, year==2008) b <- group_by(a, cyl) c <- summarize(b, Avg=mean(hwy)) d <- arrange(c, desc(Avg)) print(d) ``` ] .tiny6.pull-right[ ``` # A tibble: 4 × 2 cyl Avg <dbl> <dbl> 1 4 29.3 2 5 28.8 3 6 23.5 4 8 18 ``` ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * For example, in our *mpg* dataset, let's say we're interested in the average highway mpg in cars in 2008, based on their number of cylinders. + Nested approach: .tiny6.pull-left[ ```r arrange( summarize( group_by( filter(mpg,year==2008), cyl), Avg = mean(hwy)), desc(Avg) ) ``` ] .tiny6.pull-right[ ``` # A tibble: 4 × 2 cyl Avg <dbl> <dbl> 1 4 29.3 2 5 28.8 3 6 23.5 4 8 18 ``` ] --- #### Topic 6: Data and Data Wrangling - Tidyverse * For example, in our *mpg* dataset, let's say we're interested in the average highway mpg in cars in 2008, based on their number of cylinders. + Piping approach: .tiny6.pull-left[ ```r mpg %>% filter(year==2008) %>% group_by(cyl) %>% summarize(Avg=mean(hwy)) %>% arrange(desc(Avg)) ``` ] .tiny6.pull-right[ ``` # A tibble: 4 × 2 cyl Avg <dbl> <dbl> 1 4 29.3 2 5 28.8 3 6 23.5 4 8 18 ``` ] --- class: inverse #### Review - Data and Data Wrangling: * Working directories and data can be set and loaded programmatically, or with the RStudio IDE. * R identifies data by row (observation) then column (variable). * Pay attention to function arguments - missing values can cause problems! * Tidyverse packages provides a consistent language (functions) and grammar (arguments) that integrate nicely with a piping workflow. <br/><br/> <br/><br/> .pull-right6[<img src="./images/tidyverse5.png" alt="RStudio" height=200/> <img src="./images/magrittr2.png" alt="RStudio" height=200/>] --- class: center, middle, inverse #### Topic 7: Extras - RStudio Projects, #### Other things to do in R, and Resources <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Topic 7: Extras - RStudio Projects, Other things to do in R, and Resources Basic steps to working with data in R: * Check and/or set a working directory. * Load data. * Wrangle data (Explore, Summarize, and Analyze)! --- #### Topic 7: Extras - RStudio Projects, Other things to do in R, and Resources Basic steps to working with data in R: * ~~Check and/or set a working directory.~~ Set up an RStudio Rroject. * Load data. * Wrangle data (Explore, Summarize, and Analyze)! --- #### Topic 7: Extras - RStudio Projects, Other things to do in R, and Resources R Studio projects: .pull-right2[ <center><img src="./images/cracked_setwd.png" alt="R timeline" height=125 </></center> <center><img src="./images/rproject.png" alt="R timeline" width=200 </></center> ] .pull-left2[ * **Projects** keep all files associated with a project together. * "Home" directory of the project becomes the current working directory. * Projects can **enhance reproducibility** if *paths within scripts are kept relative and not absolute*. ] --- #### Topic 7: Extras - RStudio Projects, Other things to do in R, and Resources Other things to do in R: .right-column2[ <center><img src="./images/rmarkdown.png" height=200 alt="RStudio" </center> <br/><br/> <center><img src="./images/shiny.png" height=200 alt="RStudio" </center> ] .left-column2[ * [R Markdown documents](<https://rmarkdown.rstudio.com>) - Documents - Websites - Books <br/><br/> <br/><br/> <br/><br/> * [Shiny Apps](<https://jeremymack.shinyapps.io/purpleair/>) - Web applications - Websites - Dashboards ] --- #### Topic 7: Extras - RStudio Projects, Other things to do in R, and Resources Resources: .pull-left4[ 1. [R for Data Science](https://r4ds.had.co.nz/) 2. [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/) 3. [Twitter for R Programmers](https://www.t4rstats.com/follow-some-folks.html) ] .pull-right4[ <img src="./images/tidyverse2.png" alt="Tidyverse" height=250 </> <img src="./images/rtwitter_blank.png" alt="Tidyverse" </> ] --- class: center, middle, inverse, title-slide ## Questions? <img src="./images/contact.png" alt="RStudio" height=400/>