class: center, middle, inverse, title-slide ## R tutorial: data scraping #### Jeremy Mack #### Lehigh University - Digital Scholarship Team <br/> <center><img src="./images/webscrape4.png" alt="RStudio" height=150/></center> <br/><br/><br/> --- ### About this tutorial * This tutorial will focus on **data scraping** using R. -- * It uses an example of extracting COVID-19 data from daily reports issued by the [PA Department of Health](https://www.health.pa.gov/topics/disease/coronavirus/Pages/Coronavirus.aspx). -- * Slides from an intro course on **Programming in R**, can be found on [Github](https://jeremymack-lu.github.io/rprog/). + Topics of interest include: - Topic 3 - [Getting started with R and R Studio](https://jeremymack-lu.github.io/rprog/#20) - Topic 4 - [Installing functions in R](https://jeremymack-lu.github.io/rprog/#72) - Topic 5 - [Working with data in R](https://jeremymack-lu.github.io/rprog/#83) --- class: center, middle, inverse #### First, what is data scraping? <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### What is data scraping? <center><img src="./images/webscrape1.png" alt="RStudio" height=175/></center> * **Data scraping** is a technique in which a computer program extracts data from human-readable output coming from another program. -- * Two forms of data scraping include **web scraping** and **document scraping**. --- #### What is data scraping? <center><img src="./images/webscrape2.png" alt="RStudio" height=175/></center> * **Data scraping** is a technique in which a computer program extracts data from human-readable output coming from another program. * Two forms of data scraping include **web scraping** and **document scraping**. 1. Web scraping uses tools to extract data from a web page by accessing its **text-based mark-up language** (i.e., HTML and XHTML). -- 2. Document scraping uses tools to extract data from a document file, often one thats **format is less accessible** (i.e., PDF). -- * In this tutorial, we'll extract data from both an **HTML and PDF source**, using a data scraper (**R**), and convert it into a useful format (**.txt file**). --- class: center, middle, inverse #### Data scraping using R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Data scraping using R .pull-right2[ <center><img src="./images/rvest.png" alt="RStudio" height=150/> <img src="./images/dplyr.png" alt="RStudio" height=150/></center> <center><img src="./images/tidyverse.png" alt="RStudio" height=150/></center> <center><img src="./images/readr.png" alt="RStudio" height=150/> <img src="./images/stringr.png" alt="RStudio" height=150/></center> ] .pull-left2[ Data scraping using R can be broken down into three basic steps: {{content}}] -- 1. Identify data source + Web scraping - XPath + Document scraping - file location {{content}} -- 2. Extract data in R {{content}} -- 3. Create and export data frame in R {{content}} -- Most of the programming in R will utilize a collection of packages and functions known as the [tidyverse](https://www.tidyverse.org). --- class: center, middle, inverse #### Data scraping using R Example 1 - Web scraping <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Example 1 - Web scraping .pull-right2[ <center><img src="./images/rvest.png" alt="RStudio" height=150/> <img src="./images/dplyr.png" alt="RStudio" height=150/></center> <center><img src="./images/tidyverse.png" alt="RStudio" height=150/></center>] .pull-left2[ Basic steps: 1. Identify data source + Web scraping - XPath 2. Extract data in R - **revest** 3. Create and export data frame in R - **dplyr** ] --- class: center, middle, inverse #### Step 1: Identify data source <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 1: Identify data source <center><img src="./images/xpath.png" alt="RStudio" height=175/></center> * R uses an **XPath** to locate elements on a web page. -- * XPath, short for XML path, uses an **XML path expression** to locate items. -- * We could write our own XPath, but instead, we'll use a browser window, to identify the exact XPath we're interested in. --- #### Step 1: Identify data source .pull-right4[ <center><img src="./images/xpath2.png" alt="RStudio" width=400/></center> <br/> <center><img src="./images/xpath3.png" alt="RStudio" width=400/></center> ] .pull-left4[ Basic steps (video on next slide): 1. Open web page in Google Chrome, or Safari. {{content}} ] -- 2. Right click on the item you're interested in scraping and click **Inspect**. {{content}} -- 3. In the **Elements portion of the Inspector window**, mouse over each line until the entire table is highlighted. {{content}} -- 4. Right click on the line, click **Copy**, and then click **Copy XPath**. --- #### Step 1: Identify data source <iframe width="750" height="500" src="https://www.youtube.com/embed/KAqxWsE29_c?VQ=HD1080" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- class: center, middle, inverse #### Step 2: Extract data in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 2: Extract data in R .tiny[ * Load necessary packages ```r library(tidyverse) # Load core Tidyverse packages, including dplyr library(rvest) # Additional Tidyverse packages for web scraping library(xml2) # Package to work with XML files ``` ] -- .tiny[ * Set urls for PA Department of Health web pages (cases and deaths) ```r url1 <- 'https://www.health.pa.gov/topics/disease/coronavirus/Pages/June-Archive.aspx' url2 <- 'https://www.health.pa.gov/topics/disease/coronavirus/Pages/Death-Data.aspx' ``` ] -- .tiny3[ * Set XPath for web page table(s) ```r xpath1 <- '//*[@id="ctl00_PlaceHolderMain_PageContent__ControlWrapper_RichHtmlField"]/span/table' xpath2 <- '//*[@id="ctl00_PlaceHolderMain_PageContent__ControlWrapper_RichHtmlField"]/table' ``` ] --- #### Step 2: Extract data in R .tiny[ * Scrape and create data frame for COVID-19 case data ```r cases <- url1 %>% # Scrape data read_html() %>% html_nodes(xpath=xpath1) %>% html_table() cases <- cases[[8]] # Select table number cases <- cases[-1,] # Remove first row, which contains table headers cases <- cases[,c(1:2)] # Select column for County and Cases names(cases) <- c("County", "Cases") # Add column names head(cases, 10) # View first five rows of data frame ``` ``` ## County Cases ## 2 Adams 273 ## 3 Allegheny 2003 ## 4 Armstrong 65 ## 5 Beaver 603 ## 6 Bedford 44 ## 7 Berks 4201 ## 8 Blair 53 ## 9 Bradford 46 ## 10 Bucks 5261 ## 11 Butler 247 ``` ] --- #### Step 2: Extract data in R .tiny[ * Scrape and create data frame for COVID-19 death data ```r deaths <- url2 %>% # Scrape data read_html() %>% html_nodes(xpath=xpath2) %>% html_table() deaths <- deaths[[1]] # Select table number deaths <- deaths[-1,] # Remove first row, which contains table headers deaths <- deaths[,c(1:2)] # Select column for County and Deaths names(deaths) <- c("County", "Deaths") # Add column names deaths[35,1] <- "McKean" # Fix name for McKean County (not Mckean) head(deaths, 10) # View first 10 rows of data frame ``` ``` ## County Deaths ## 2 Adams 8 ## 3 Allegheny 168 ## 4 Armstrong 5 ## 5 Beaver 74 ## 6 Bedford 2 ## 7 Berks 333 ## 8 Blair 1 ## 9 Bradford 3 ## 10 Bucks 529 ## 11 Butler 12 ``` ] --- class: center, middle, inverse #### Step 3: Create and export data frame in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 3: Create and export data frame in R .tiny[ * Merge both data frames and add today's date ```r df <- merge(cases, deaths, by="County", all.x=TRUE) # Merge by County df <- df %>% # Set data structure for variables mutate(County=as.factor(County), # Set County to factor Cases=as.numeric(Cases), # Set Cases to numeric Deaths=as.numeric(gsub(",","", Deaths))) %>% # Set Deaths to numeric mutate(Deaths=ifelse(is.na(Deaths),0, Deaths)) # Change NAs to 0 head(df, 10) # View first 10 rows of data frame ``` ``` ## County Cases Deaths ## 1 Adams 273 8 ## 2 Allegheny 2003 168 ## 3 Armstrong 65 5 ## 4 Beaver 603 74 ## 5 Bedford 44 2 ## 6 Berks 4201 333 ## 7 Blair 53 1 ## 8 Bradford 46 3 ## 9 Bucks 5261 529 ## 10 Butler 247 12 ``` ] --- #### Step 3: Create and export data frame in R .tiny[ * Export to text file ```r write.table(df, "/Users/jeremymack/Desktop/COVID19_data.txt", sep=",", row.names=FALSE) ``` * Note, you'll need to change <font size="3" color="red">"/Users/jeremymack/Desktop/"</font> in the above filepath, to your own working directory. ] --- class: center, middle, inverse #### Data scraping using R Example 2 - Document scraping <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Example 2 - Document scraping <center><img src="./images/webscrape2.png" alt="RStudio" height=175/></center> * On June 7, PA Department of Health changed their method of reporting data. -- * Reporting moved from an HTML-based table to a PDF document. -- * Data scraping methods are (**and need to be**) flexible! --- #### Example 2 - Document scraping .pull-right2[ <center><img src="./images/rvest.png" alt="RStudio" height=150/> <img src="./images/dplyr.png" alt="RStudio" height=150/></center> <center><img src="./images/tidyverse.png" alt="RStudio" height=150/></center> <center><img src="./images/readr.png" alt="RStudio" height=150/> <img src="./images/stringr.png" alt="RStudio" height=150/></center>] .pull-left2[ Basic steps: 1. Identify data source + Document scraping - file location 2. Extract data in R - **rvest**, **stringr**, **readr** 3. Create and export data frame in R - **dplyr** ] --- class: center, middle, inverse #### Step 1: Identify data source <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 1: Identify data source .tiny[ * Load necessary packages ```r library(tidyverse) # Load core Tidyverse packages, including dplyr library(rvest) # Additional Tidyverse package for web scraping library(readr) # Additional Tidyverse package for reading table data library(pdftools) # Package to work with PDF files ``` ] -- .tiny[ * Set url for PA Department of Health web page ```r page <- read_html("https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx") ``` ] -- .tiny[ * Identify PDF links on page and set urls for data tables (cases and deaths) ```r raw_list <- page %>% # Use url set as "page" html_nodes("a") %>% # Identify attributes on the page w/ css selector "a" html_attr("href") %>% # Identify "href" attributes (i.e., link destination) str_subset("\\.pdf") # Subset "href" attributes that end in .pdf raw_list ``` ] .tiny2[ ``` [1] "/topics/Documents/Diseases%20and%20Conditions/COVID-19%20County%20Data/County%20Case%20Counts_7-7-2020.pdf" [2] "/topics/Documents/Diseases%20and%20Conditions/COVID-19%20Death%20Data/Death%20by%20County%20of%20Residence%20--%202020-07-07.pdf" ``` ] .tiny[ ```r url1 <- paste("https://www.health.pa.gov", raw_list[1], sep="") url2 <- paste("https://www.health.pa.gov", raw_list[2], sep="") ``` ] --- class: center, middle, inverse #### Step 2: Extract data in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 2: Extract data in R .tiny[ * Scrape and create data frame for COVID-19 case data ```r pdf1 <- pdf_text(url1) %>% # Select the linked PDF with case data read_lines() # Read lines into a list of vectors cases <- pdf1 %>% # Select the list of vectors str_squish() %>% # Remove extra whitespace between elements str_split(pattern=" ") # Split vector string into pieces (i.e., columns) cases <- do.call(rbind, Filter(function(x) length(x)==6, cases)) # Combine list elements with 6 items cases <- cases[-1,] # Remove first row cases <- cases %>% as.data.frame() %>% # Convert to a data frame mutate(County=as.factor(V1), # Set County to factor Cases=as.numeric(as.character(V3))) %>% # Set Cases to numeric mutate(County=str_to_sentence(County)) %>% # Change County from all caps select(7:8) head(cases, 5) # View first 5 rows of data frame ``` ``` ## County Cases ## 1 Adams 358 ## 2 Allegheny 3979 ## 3 Armstrong 79 ## 4 Beaver 770 ## 5 Bedford 90 ``` ] --- #### Step 2: Extract data in R .tiny[ * Scrape and create data frame for COVID-19 death data ```r pdf2 <- pdf_text(url2) %>% # Select the linked PDF with case data read_lines() # Read lines into a list of vectors deaths <- pdf2 %>% # Select the list of vectors str_squish() %>% # Remove extra whitespace between elements str_split(pattern=" ") # Split vector string into pieces (i.e., columns) deaths <- do.call(rbind, Filter(function(x) length(x)==4, deaths)) # Combine list elements with 3 items deaths <- deaths %>% as.data.frame() %>% # Convert to a data frame mutate(County=as.factor(V1), # Set County to factor Deaths=as.numeric(V2)) %>% # Set Cases to numeric select(5:6) head(deaths, 5) # View first 5 rows of data frame ``` ``` ## County Deaths ## 1 Adams 5 ## 2 Allegheny 8 ## 3 Armstrong 30 ## 4 Beaver 34 ## 5 Bedford 23 ``` ] --- class: center, middle, inverse #### Step 3: Create and export data frame in R <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> <br/><br/> --- #### Step 3: Create and export data frame in R .tiny[ * Merge both data frames and add today's date ```r df <- merge(cases, deaths, by="County", all.x=TRUE) # Merge by County df <- df %>% # Set data structure for variables mutate(County=as.factor(County), # Set County to factor Cases=as.numeric(Cases), # Set Cases to numeric Deaths=as.numeric(gsub(",","", Deaths))) %>% # Set Deaths to numeric mutate(Deaths=ifelse(is.na(Deaths),0, Deaths)) # Change NAs to 0 head(df, 10) # View first 10 rows of data frame ``` ``` ## County Cases Deaths ## 1 Adams 358 5 ## 2 Allegheny 3979 8 ## 3 Armstrong 79 30 ## 4 Beaver 770 34 ## 5 Bedford 90 23 ## 6 Berks 4606 20 ## 7 Blair 89 1 ## 8 Bradford 59 16 ## 9 Bucks 5948 29 ## 10 Butler 370 5 ``` ] --- #### Step 3: Create and export data frame in R .tiny[ * Export to text file ```r write.table(df, "/Users/jeremymack/Desktop/COVID19_data.txt", sep=",", row.names=FALSE) ``` * Note, you'll need to change <font size="3" color="red">"/Users/jeremymack/Desktop/"</font> in the above filepath, to your own working directory. ] --- class: center, middle, inverse #### Data scraping using R <img src="./images/PA_county_cases.jpeg" alt="RStudio" height=250/> <img src="./images/PA_LV_cases.jpeg" alt="RStudio" height=250/>