Data Scraping with R

class: center, middle, inverse, title-slide

## R tutorial: data scraping
#### Jeremy Mack
#### Lehigh University - Digital Scholarship Team
 
<center><img src="./images/webscrape4.png" alt="RStudio" height=150/></center>

---
### About this tutorial

* This tutorial will focus on **data scraping** using R.
 
--
 
 * It uses an example of extracting COVID-19 data from daily reports issued by the [PA Department of Health](https://www.health.pa.gov/topics/disease/coronavirus/Pages/Coronavirus.aspx).

* Slides from an intro course on **Programming in R**,
 can be found on [Github](https://jeremymack-lu.github.io/rprog/).
  + Topics of interest include:
    
     - Topic 3 - [Getting started with R and R Studio](https://jeremymack-lu.github.io/rprog/#20)
    
     - Topic 4 - [Installing functions in R](https://jeremymack-lu.github.io/rprog/#72)
    
     - Topic 5 - [Working with data in R](https://jeremymack-lu.github.io/rprog/#83)
 
---
class: center, middle, inverse

#### First, what is data scraping?

---
#### What is data scraping?

* **Data scraping** is a technique in which a computer program extracts data from human-readable output coming from another program.
 
--

* Two forms of data scraping include **web scraping** and **document scraping**.

---
#### What is data scraping?

* **Data scraping** is a technique in which a computer program extracts data from human-readable output coming from another program.

* Two forms of data scraping include **web scraping** and **document scraping**.
 
  1. Web scraping uses tools to extract data from a web page by accessing its **text-based mark-up language** (i.e., HTML and XHTML).
 
--

2. Document scraping uses tools to extract data from a document file, often one thats **format is less accessible** (i.e., PDF).
 
--
 
 * In this tutorial, we'll extract data from both an **HTML and PDF source**, using a data scraper (**R**), and convert it into a useful format (**.txt file**).
 
---
class: center, middle, inverse

#### Data scraping using R

---
#### Data scraping using R

.pull-right2[
<center><img src="./images/rvest.png" alt="RStudio" height=150/> <img src="./images/dplyr.png" alt="RStudio" height=150/></center>
<center><img src="./images/tidyverse.png" alt="RStudio" height=150/></center>
<center><img src="./images/readr.png" alt="RStudio" height=150/> <img src="./images/stringr.png" alt="RStudio" height=150/></center>
]

.pull-left2[
Data scraping using R can be broken down into three basic steps:
{{content}}]

1. Identify data source
 
  + Web scraping - XPath
  
  + Document scraping - file location
 {{content}}
 
--

2. Extract data in R
 {{content}}
 
--
 
 3. Create and export data frame in R
 {{content}}
 
--

Most of the programming in R will utilize a collection of packages and functions known as the [tidyverse](https://www.tidyverse.org).

---
class: center, middle, inverse

#### Data scraping using R
Example 1 - Web scraping

---
#### Example 1 - Web scraping

.pull-left2[
Basic steps:
 1. Identify data source
 
  + Web scraping - XPath

2. Extract data in R - **revest**
 
 3. Create and export data frame in R - **dplyr**
]

---
class: center, middle, inverse

#### Step 1: Identify data source

---
#### Step 1: Identify data source

* R uses an **XPath** to locate elements on a web page.
 
--
 
 * XPath, short for XML path, uses an **XML path expression** to locate items.
 
--

* We could write our own XPath, but instead, we'll use a browser window, to identify the exact XPath we're interested in.

---
#### Step 1: Identify data source

.pull-right4[
<center><img src="./images/xpath2.png" alt="RStudio" width=400/></center>
 
<center><img src="./images/xpath3.png" alt="RStudio" width=400/></center>
]

.pull-left4[
Basic steps (video on next slide):

1. Open web page in Google Chrome, or Safari.
  {{content}}
]

2. Right click on the item you're interested in scraping and click **Inspect**.
  {{content}}
  
--

3. In the **Elements portion of the Inspector window**, mouse over each line until the entire table is highlighted.
  {{content}}
  
--

4. Right click on the line, click **Copy**, and then click **Copy XPath**.

---
#### Step 1: Identify data source

---
class: center, middle, inverse

#### Step 2: Extract data in R

---
#### Step 2: Extract data in R

.tiny[
* Load necessary packages

```r
library(tidyverse) # Load core Tidyverse packages, including dplyr
library(rvest)     # Additional Tidyverse packages for web scraping
library(xml2)      # Package to work with XML files
```
]

.tiny[
* Set urls for PA Department of Health web pages (cases and deaths)

```r
url1 <- 'https://www.health.pa.gov/topics/disease/coronavirus/Pages/June-Archive.aspx'
url2 <- 'https://www.health.pa.gov/topics/disease/coronavirus/Pages/Death-Data.aspx'
```
]

.tiny3[
* Set XPath for web page table(s)

```r
xpath1 <- '//*[@id="ctl00_PlaceHolderMain_PageContent__ControlWrapper_RichHtmlField"]/span/table'
xpath2 <- '//*[@id="ctl00_PlaceHolderMain_PageContent__ControlWrapper_RichHtmlField"]/table'
```
]

---
#### Step 2: Extract data in R
.tiny[
* Scrape and create data frame for COVID-19 case data

```r
cases <- url1 %>% # Scrape data
 read_html() %>%
 html_nodes(xpath=xpath1) %>%
 html_table()
cases <- cases[[8]] # Select table number
cases <- cases[-1,] # Remove first row, which contains table headers
cases <- cases[,c(1:2)] # Select column for County and Cases
names(cases) <- c("County",
 "Cases") # Add column names

head(cases, 10)                  # View first five rows of data frame
```

```
##       County Cases
## 2      Adams   273
## 3  Allegheny  2003
## 4  Armstrong    65
## 5     Beaver   603
## 6    Bedford    44
## 7      Berks  4201
## 8      Blair    53
## 9   Bradford    46
## 10     Bucks  5261
## 11    Butler   247
```
]

---
#### Step 2: Extract data in R

.tiny[
* Scrape and create data frame for COVID-19 death data

```r
deaths <- url2 %>% # Scrape data
 read_html() %>%
 html_nodes(xpath=xpath2) %>%
 html_table()
deaths <- deaths[[1]] # Select table number
deaths <- deaths[-1,] # Remove first row, which contains table headers
deaths <- deaths[,c(1:2)] # Select column for County and Deaths
names(deaths) <- c("County",
 "Deaths") # Add column names
deaths[35,1] <- "McKean" # Fix name for McKean County (not Mckean)

head(deaths, 10)                   # View first 10 rows of data frame
```

```
##       County Deaths
## 2      Adams      8
## 3  Allegheny    168
## 4  Armstrong      5
## 5     Beaver     74
## 6    Bedford      2
## 7      Berks    333
## 8      Blair      1
## 9   Bradford      3
## 10     Bucks    529
## 11    Butler     12
```
]

---
class: center, middle, inverse

#### Step 3: Create and export data frame in R

---
#### Step 3: Create and export data frame in R

.tiny[
* Merge both data frames and add today's date

```r
df <- merge(cases, deaths, by="County", all.x=TRUE) # Merge by County

df <- df %>% # Set data structure for variables
 mutate(County=as.factor(County), # Set County to factor
 Cases=as.numeric(Cases), # Set Cases to numeric
 Deaths=as.numeric(gsub(",","", Deaths))) %>% # Set Deaths to numeric
 mutate(Deaths=ifelse(is.na(Deaths),0, Deaths)) # Change NAs to 0

head(df, 10)                                           # View first 10 rows of data frame
```

```
##       County Cases Deaths
## 1      Adams   273      8
## 2  Allegheny  2003    168
## 3  Armstrong    65      5
## 4     Beaver   603     74
## 5    Bedford    44      2
## 6      Berks  4201    333
## 7      Blair    53      1
## 8   Bradford    46      3
## 9      Bucks  5261    529
## 10    Butler   247     12
```

]

---
#### Step 3: Create and export data frame in R

.tiny[
 * Export to text file

```r
write.table(df,
            "/Users/jeremymack/Desktop/COVID19_data.txt",
            sep=",",
            row.names=FALSE)
```

* Note, you'll need to change "/Users/jeremymack/Desktop/" in the above filepath, to your own working directory.

]

---
class: center, middle, inverse

#### Data scraping using R
Example 2 - Document scraping

---
#### Example 2 - Document scraping

* On June 7, PA Department of Health changed their method of reporting data.
 
--
 
 * Reporting moved from an HTML-based table to a PDF document.
 
--
 
 * Data scraping methods are (**and need to be**) flexible!

---
#### Example 2 - Document scraping

.pull-left2[
Basic steps:
 1. Identify data source
 
  + Document scraping - file location

2. Extract data in R - **rvest**, **stringr**, **readr**
 
 3. Create and export data frame in R - **dplyr**
]

---
class: center, middle, inverse

#### Step 1: Identify data source

---
#### Step 1: Identify data source

.tiny[
* Load necessary packages

```r
library(tidyverse) # Load core Tidyverse packages, including dplyr
library(rvest)     # Additional Tidyverse package for web scraping
library(readr)     # Additional Tidyverse package for reading table data
library(pdftools)  # Package to work with PDF files
```
]

.tiny[
* Set url for PA Department of Health web page

```r
page <- read_html("https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx")
```
]

.tiny[
* Identify PDF links on page and set urls for data tables (cases and deaths)

```r
raw_list <- page %>% # Use url set as "page"
 html_nodes("a") %>% # Identify attributes on the page w/ css selector "a"
 html_attr("href") %>% # Identify "href" attributes (i.e., link destination)
 str_subset("\\.pdf") # Subset "href" attributes that end in .pdf
raw_list
```
]

.tiny2[

```
[1] "/topics/Documents/Diseases%20and%20Conditions/COVID-19%20County%20Data/County%20Case%20Counts_7-7-2020.pdf"                      
[2] "/topics/Documents/Diseases%20and%20Conditions/COVID-19%20Death%20Data/Death%20by%20County%20of%20Residence%20--%202020-07-07.pdf"
```
]

.tiny[

```r
url1 <- paste("https://www.health.pa.gov", raw_list[1], sep="")
url2 <- paste("https://www.health.pa.gov", raw_list[2], sep="")
```
]

---
class: center, middle, inverse

#### Step 2: Extract data in R

---
#### Step 2: Extract data in R
.tiny[
* Scrape and create data frame for COVID-19 case data

```r
pdf1 <- pdf_text(url1) %>% # Select the linked PDF with case data
 read_lines() # Read lines into a list of vectors

cases <- pdf1 %>% # Select the list of vectors
 str_squish() %>% # Remove extra whitespace between elements
 str_split(pattern=" ") # Split vector string into pieces (i.e., columns)

cases <- do.call(rbind,
 Filter(function(x) length(x)==6, cases)) # Combine list elements with 6 items
cases <- cases[-1,] # Remove first row

cases <- cases %>%
 as.data.frame() %>% # Convert to a data frame
 mutate(County=as.factor(V1), # Set County to factor
 Cases=as.numeric(as.character(V3))) %>% # Set Cases to numeric
 mutate(County=str_to_sentence(County)) %>% # Change County from all caps
 select(7:8)

head(cases, 5)  # View first 5 rows of data frame
```

```
##      County Cases
## 1     Adams   358
## 2 Allegheny  3979
## 3 Armstrong    79
## 4    Beaver   770
## 5   Bedford    90
```
]

---
#### Step 2: Extract data in R
.tiny[
* Scrape and create data frame for COVID-19 death data

```r
pdf2 <- pdf_text(url2) %>% # Select the linked PDF with case data
 read_lines() # Read lines into a list of vectors

deaths <- pdf2 %>% # Select the list of vectors
 str_squish() %>% # Remove extra whitespace between elements
 str_split(pattern=" ") # Split vector string into pieces (i.e., columns)

deaths <- do.call(rbind,
 Filter(function(x) length(x)==4, deaths)) # Combine list elements with 3 items

deaths <- deaths %>%
 as.data.frame() %>% # Convert to a data frame
 mutate(County=as.factor(V1), # Set County to factor
 Deaths=as.numeric(V2)) %>% # Set Cases to numeric
 select(5:6)

head(deaths, 5)  # View first 5 rows of data frame
```

```
##      County Deaths
## 1     Adams      5
## 2 Allegheny      8
## 3 Armstrong     30
## 4    Beaver     34
## 5   Bedford     23
```
]

---
class: center, middle, inverse

#### Step 3: Create and export data frame in R

---
#### Step 3: Create and export data frame in R

.tiny[
* Merge both data frames and add today's date

```r
df <- merge(cases, deaths, by="County", all.x=TRUE) # Merge by County

head(df, 10)                                           # View first 10 rows of data frame
```

```
##       County Cases Deaths
## 1      Adams   358      5
## 2  Allegheny  3979      8
## 3  Armstrong    79     30
## 4     Beaver   770     34
## 5    Bedford    90     23
## 6      Berks  4606     20
## 7      Blair    89      1
## 8   Bradford    59     16
## 9      Bucks  5948     29
## 10    Butler   370      5
```

]

---
#### Step 3: Create and export data frame in R

.tiny[
 * Export to text file

```r
write.table(df,
            "/Users/jeremymack/Desktop/COVID19_data.txt",
            sep=",",
            row.names=FALSE)
```

* Note, you'll need to change "/Users/jeremymack/Desktop/" in the above filepath, to your own working directory.

]

---
class: center, middle, inverse
#### Data scraping using R