Extracting Data from PDFs into Tibbles
Suppose you want to do research, but instead of finding a data source that has already been beaten to death by a band of graduate students and assistant professors, you decide to use the Freedom of Information Act to get information that, presumably, no one else has. Terrific right? Well, as the irritating sleazebag economist would say, “there’s no free lunch”. And they are correct; there is always a cost to getting new, exciting data. In my particular case, this was learning how to extract police data from PDF documents…a lot of PDF documents. After extracting data from over 50k PDFs, I learned a few tips and tricks that can make this process much easier. In most tutorials I used (see here and here), the authors did not focus enough (or at all) on what I dubbed “unholy PDFs” (e.g. PDFs that do not have information in a table-format). There are plenty of great, easy ways to get information from PDFs that have a nice table-structure into tibbles such as the tabulizer
package, but today, I’m going to focus on these “unholy PDFs” where you cannot simply use a package’s function to get exactly what you want.
For reference, here is an example of the PDF we will be extracting data from: it is a page from a police crime log.
The goal in my research is to connect crimes to the date occurred, however, if I can get more information (such as location) this is even better. So for this post, I’m going to extract the following information from this PDF: Date Reported, Location, Event#, Incident, and Disposition. However, I’ll only go into depth on how to extract the first three categories, and “the remainder is left to the reader as an exercise”.
Getting Started
First, we need to load the necessary packages. In this case, I will be utilizing the following packages: tidyverse
and pdftools
. Let’s go ahead and load the packages in (and install if needed). Note that loading the tidyverse
package loads the stringr
, dplyr
, and ggplot2
packages. Hence, we need to load in the tidyverse
and pdftools
package with the library
function.
library(tidyverse)
library(pdftools)
Overview
I’m going to lay out the steps we are about to take so there is a clear path to what we plan on accomplishing:
- Load in the PDF (using
pdftools::pdf_text
) into a list. - Split the PDF into lines and unlist (using
stringr::str_split
andbase::unlist
). - Find patterns and break the text into smaller objects that feature the characteristics we want.
- Convert these smaller objects into tibbles.
- Append these tibbles together into a main tibble.
Let’s go!
Step 1: Loading
We’re going to load in the data using pdftools::pdf_text
. We’ll save this object with the name pdf_data
.
## load in the data
pdf_data <- pdf_text("crime_log.pdf")
## get a glimpse of your data
pdf_data %>%
head()
## [1] " Indiana University, Bloomington\n Police Department\n Student Right To Know CAD Daily Log\n\n From Jan 20, 2014 to Jan 20, 2014.\n\nDate Reported: 01/20/14 - MON at 12:22 Location : EIGENMANN HALL Event #: 14-01-20-001434\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 17:03 Location : ALL OTHER ROADWAYS/INTERS Event #: 14-01-20-001446\nDate and Time Occurred From - Occurred To 01/20/14 - MON at 17:02 - 01/20/14 - MON at 17:03\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #: 140154\nDisposition: CLOSED BY ARREST\nDate Reported: 01/20/14 - MON at 19:30 Location : EIGENMANN HALL Event #: 14-01-20-001464\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 20:22 Location : EIGENMANN HALL Event #: 14-01-20-001466\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 20:45 Location : FOSTER HARPER HALL Event #: 14-01-20-001468\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 21:38 Location : ALL OTHER NON-UNIVERSITY Event #: 14-01-20-001476\nDate and Time Occurred From - Occurred To\nIncident : ALL OTHER OFFENSES - HARASSMENT/INTIMIDATION Report #:\nDisposition: NO CASE REPORT\nDate Reported: 01/20/14 - MON at 21:53 Location : ROSE AVE RESIDENCE HALL Event #: 14-01-20-001479\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 22:30 Location : COLLINS COMMON AREA Event #: 14-01-20-001486\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 23:02 Location : FOREST QUAD Event #: 14-01-20-001487\nDate and Time Occurred From - Occurred To 01/20/14 - MON at 22:45 - 01/20/14 - MON at 23:02\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #: 140157\nDisposition: CLOSED NO ARREST.\nDate Reported: 01/20/14 - MON at 23:07 Location : FOSTER JENKINSON HALL Event #: 14-01-20-001491\nDate and Time Occurred From - Occurred To\nIncident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:\nDisposition: FAILED TO LOCATE\nDate Reported: 01/20/14 - MON at 23:35 Location : ALL OTHER OPEN AREAS Event #: 14-01-20-001494\nDate and Time Occurred From - Occurred To 01/20/14 - MON at 23:35 - 01/20/14 - MON at 23:41\nIncident : ASSAULT - OTHER ASSAULTS - SIMPLE, NOT AGGRAVATED Report #: 140159\nDisposition: CLOSED BY ARREST.\n 11 Incidents Listed.\n\n\n\n\n Print Date and Time 1/21/2014 12:23:52PM at Page No. 1\n"
Looks awful right? Take a second to look over the text and compare it to the raw PDF. The text is the same, but the formatting is ALL WRONG. Essentially, we have imported our data in as one big element in a vector…not so desirable. However, pay close attention to the \n
’s that occur. These \n
are special text characters that tell the PDF processor “new-line”. So take another look at the data and you can start to see that our text just needs to be split by these new-line characters to get a format we want.
Step 2: Split the PDF lines
I am now going to use the function stringr::str_split
to split the text into new lines with our “splitter” being the \n
character.
## splitting the data by the newline character
pdf_data <- pdf_data %>%
str_split('\n')
## taking a look at the output - notice you can scroll right!
pdf_data %>%
head()
## [[1]]
## [1] " Indiana University, Bloomington"
## [2] " Police Department"
## [3] " Student Right To Know CAD Daily Log"
## [4] ""
## [5] " From Jan 20, 2014 to Jan 20, 2014."
## [6] ""
## [7] "Date Reported: 01/20/14 - MON at 12:22 Location : EIGENMANN HALL Event #: 14-01-20-001434"
## [8] "Date and Time Occurred From - Occurred To"
## [9] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [10] "Disposition: FAILED TO LOCATE"
## [11] "Date Reported: 01/20/14 - MON at 17:03 Location : ALL OTHER ROADWAYS/INTERS Event #: 14-01-20-001446"
## [12] "Date and Time Occurred From - Occurred To 01/20/14 - MON at 17:02 - 01/20/14 - MON at 17:03"
## [13] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #: 140154"
## [14] "Disposition: CLOSED BY ARREST"
## [15] "Date Reported: 01/20/14 - MON at 19:30 Location : EIGENMANN HALL Event #: 14-01-20-001464"
## [16] "Date and Time Occurred From - Occurred To"
## [17] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [18] "Disposition: FAILED TO LOCATE"
## [19] "Date Reported: 01/20/14 - MON at 20:22 Location : EIGENMANN HALL Event #: 14-01-20-001466"
## [20] "Date and Time Occurred From - Occurred To"
## [21] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [22] "Disposition: FAILED TO LOCATE"
## [23] "Date Reported: 01/20/14 - MON at 20:45 Location : FOSTER HARPER HALL Event #: 14-01-20-001468"
## [24] "Date and Time Occurred From - Occurred To"
## [25] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [26] "Disposition: FAILED TO LOCATE"
## [27] "Date Reported: 01/20/14 - MON at 21:38 Location : ALL OTHER NON-UNIVERSITY Event #: 14-01-20-001476"
## [28] "Date and Time Occurred From - Occurred To"
## [29] "Incident : ALL OTHER OFFENSES - HARASSMENT/INTIMIDATION Report #:"
## [30] "Disposition: NO CASE REPORT"
## [31] "Date Reported: 01/20/14 - MON at 21:53 Location : ROSE AVE RESIDENCE HALL Event #: 14-01-20-001479"
## [32] "Date and Time Occurred From - Occurred To"
## [33] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [34] "Disposition: FAILED TO LOCATE"
## [35] "Date Reported: 01/20/14 - MON at 22:30 Location : COLLINS COMMON AREA Event #: 14-01-20-001486"
## [36] "Date and Time Occurred From - Occurred To"
## [37] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [38] "Disposition: FAILED TO LOCATE"
## [39] "Date Reported: 01/20/14 - MON at 23:02 Location : FOREST QUAD Event #: 14-01-20-001487"
## [40] "Date and Time Occurred From - Occurred To 01/20/14 - MON at 22:45 - 01/20/14 - MON at 23:02"
## [41] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #: 140157"
## [42] "Disposition: CLOSED NO ARREST."
## [43] "Date Reported: 01/20/14 - MON at 23:07 Location : FOSTER JENKINSON HALL Event #: 14-01-20-001491"
## [44] "Date and Time Occurred From - Occurred To"
## [45] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [46] "Disposition: FAILED TO LOCATE"
## [47] "Date Reported: 01/20/14 - MON at 23:35 Location : ALL OTHER OPEN AREAS Event #: 14-01-20-001494"
## [48] "Date and Time Occurred From - Occurred To 01/20/14 - MON at 23:35 - 01/20/14 - MON at 23:41"
## [49] "Incident : ASSAULT - OTHER ASSAULTS - SIMPLE, NOT AGGRAVATED Report #: 140159"
## [50] "Disposition: CLOSED BY ARREST."
## [51] " 11 Incidents Listed."
## [52] ""
## [53] ""
## [54] ""
## [55] ""
## [56] " Print Date and Time 1/21/2014 12:23:52PM at Page No. 1"
## [57] ""
Now this looks much better! Our new pdf_data
looks very similar to our raw PDF. However, notice in your environment that this is pdf_data
is a list featuring 1 list. A list of lists is going to make our life harder in this case, so I unlist()
.
## unlisting the data
pdf_data <- pdf_data %>% unlist()
pdf_data %>%
head(10)
## [1] " Indiana University, Bloomington"
## [2] " Police Department"
## [3] " Student Right To Know CAD Daily Log"
## [4] ""
## [5] " From Jan 20, 2014 to Jan 20, 2014."
## [6] ""
## [7] "Date Reported: 01/20/14 - MON at 12:22 Location : EIGENMANN HALL Event #: 14-01-20-001434"
## [8] "Date and Time Occurred From - Occurred To"
## [9] "Incident : NARCOTIC/DRUG LAWS - POSSESSION - MARIJUANA Report #:"
## [10] "Disposition: FAILED TO LOCATE"
Step 3: Look for Patterns
Note that the information we want all starts on new lines. For example, Date Reported, Location, and Event # all start on a newline that begins with “Date Reported:”, Incident always starts on a new line with “Incident” etc. We are going to take advantage of this fact. For demonstration purposes, I will execute my code first, and then explain after:
## converting all the data to lowercase
pdf_data <- pdf_data %>% str_to_lower() %>% str_trim()
## extracting data into new object that begin with date_reported
date_reported_lines <- pdf_data %>% str_detect("^date reported") %>% which
What happened: the first line of code converts all of the data into lower case (I do this to minimize common document errors such as forgetting to capitalize an entry), and the second line of code finds all lines that starts with the phrase “date reported” and puts them into a new object. Notice that “^date_reported” is a regular expression with the anchor “^” which tells R to match on “date reported” only at the very beginning of the line. which
tells R to extract the elements that evaluated to TRUE
and report the corresponding indices of the list. So in our example, lines 5, 9, 13, 17, 21, …, 45 all started with expression “date reported”. This is convenient because we can now tell R to just “show us these indices” with the next step:
## showing the first five lines of all lines in pdf_data that start with "date reported"
pdf_data[date_reported_lines] %>% head(5)
## [1] "date reported: 01/20/14 - mon at 12:22 location : eigenmann hall event #: 14-01-20-001434"
## [2] "date reported: 01/20/14 - mon at 17:03 location : all other roadways/inters event #: 14-01-20-001446"
## [3] "date reported: 01/20/14 - mon at 19:30 location : eigenmann hall event #: 14-01-20-001464"
## [4] "date reported: 01/20/14 - mon at 20:22 location : eigenmann hall event #: 14-01-20-001466"
## [5] "date reported: 01/20/14 - mon at 20:45 location : foster harper hall event #: 14-01-20-001468"
Step 4: Put this data into a tibble
Next we are going to use the tibble:as_tibble
function to convert this list into a tibble:
## Observe what the as_tibble() function output does
pdf_data[date_reported_lines] %>%
as_tibble() %>%
head(5)
## # A tibble: 5 × 1
## value
## <chr>
## 1 date reported: 01/20/14 - mon at 12:22 location : eigenmann hall …
## 2 date reported: 01/20/14 - mon at 17:03 location : all other roadways/int…
## 3 date reported: 01/20/14 - mon at 19:30 location : eigenmann hall …
## 4 date reported: 01/20/14 - mon at 20:22 location : eigenmann hall …
## 5 date reported: 01/20/14 - mon at 20:45 location : foster harper hall …
Notice how the tibble:as_tibble
function puts all our information into a tibble with only one column named value
. Hence, we’re going to dplyr::extract
from the value
column to get our desired information. I’m going to start with the date reported so we can get a general feel. Note: I am going to set remove = F
for the first extract
of date reported merely for demonstration purposes. I will not do this in the ones that follow.
date_reported <- pdf_data[date_reported_lines] %>%
as_tibble() %>%
extract(value, into = "date_reported", "(\\d{1,2}/\\d{1,2}/\\d{2})", remove = F)
date_reported %>%
head()
## # A tibble: 6 × 2
## value date_…¹
## <chr> <chr>
## 1 date reported: 01/20/14 - mon at 12:22 location : eigenmann hall … 01/20/…
## 2 date reported: 01/20/14 - mon at 17:03 location : all other road… 01/20/…
## 3 date reported: 01/20/14 - mon at 19:30 location : eigenmann hall … 01/20/…
## 4 date reported: 01/20/14 - mon at 20:22 location : eigenmann hall … 01/20/…
## 5 date reported: 01/20/14 - mon at 20:45 location : foster harper ha… 01/20/…
## 6 date reported: 01/20/14 - mon at 21:38 location : all other non-un… 01/20/…
## # … with abbreviated variable name ¹date_reported
Now, review R for Data Science Chapter 14.3 on regular expressions (because no one remembers these) and I’ll give you the rub on the "(\\d{1,2}/\\d{1,2}/\\d{2})"
expression. This says to match any digit 1 to 2 times, followed by a forward slash, followed by any digit 1 to two times, followed by a forward slash, followed by exactly 2 digits. Recall that the parenthesis in the extract
function tell R that whatever is inside of those parenthesis is what you want to extract (hence, you can ignore them in your interpretation). This is the format that our date reported is in.
I can similarly do this with location:
## saving a tibble named location that contains our location data
location <- pdf_data[date_reported_lines] %>%
as_tibble() %>%
extract(value, into = "location", ".{1,}location\\s:\\s(.{1,})event", remove = T)
location %>%
head()
## # A tibble: 6 × 1
## location
## <chr>
## 1 "eigenmann hall "
## 2 "all other roadways/inters "
## 3 "eigenmann hall "
## 4 "eigenmann hall "
## 5 "foster harper hall "
## 6 "all other non-university "
However, here regular expression reads ".{1,}location\\s:\\s(.{1,})event"
. In plain English, this is telling R to match on any character one or more number of times, followed by the word location, followed by a blank space, a colon, and another blank space, and then extract
any number of characters that occurs before the word event shows up. Yikes.
And with event number:
## saving our event number data into a tibble
event_number <- pdf_data[date_reported_lines] %>%
as_tibble() %>%
extract(value, into = "event_number", ".{1,}event\\s#:\\s(.{1,})", remove = T)
event_number %>%
head()
## # A tibble: 6 × 1
## event_number
## <chr>
## 1 14-01-20-001434
## 2 14-01-20-001446
## 3 14-01-20-001464
## 4 14-01-20-001466
## 5 14-01-20-001468
## 6 14-01-20-001476
The regular expression here is similar to the one above, although a little more simple: match on any character one or more times, followed by the word event, followed by a whitespace, followed by a hash-tag, followed by a colon, followed by a whitespace, and then extract
one or more characters that follow.
Step 5: Appending
Now, I will save each of these as their own tibble, and bind them all together using dplyr::bind_cols
.
## binding together the tibbles that contain all of our information
crime_log <- bind_cols(date_reported, location, event_number) %>%
select(-value) ## getting rid of the value column we saved in the first instance
crime_log
## # A tibble: 11 × 3
## date_reported location event_number
## <chr> <chr> <chr>
## 1 01/20/14 "eigenmann hall " 14-01-20-001434
## 2 01/20/14 "all other roadways/inters " 14-01-20-001446
## 3 01/20/14 "eigenmann hall " 14-01-20-001464
## 4 01/20/14 "eigenmann hall " 14-01-20-001466
## 5 01/20/14 "foster harper hall " 14-01-20-001468
## 6 01/20/14 "all other non-university " 14-01-20-001476
## 7 01/20/14 "rose ave residence hall " 14-01-20-001479
## 8 01/20/14 "collins common area " 14-01-20-001486
## 9 01/20/14 "forest quad " 14-01-20-001487
## 10 01/20/14 "foster jenkinson hall " 14-01-20-001491
## 11 01/20/14 "all other open areas " 14-01-20-001494
Done! We managed to take this crime log and extract the date reported, location, and event number from the PDF. Notice that you can follow a similar pattern for the others columns of interest.
Summary
We managed to take a raw PDF file and convert it to a tidy tibble just by using string manipulations and regular expressions. If you are comfortable with the dplyr::extract
function and regular expressions, you could actually do this in even fewer lines. Here is what my code might look like if I was doing this in my own .R file:
## load in the data
pdf_data <- pdf_text("crime_log.pdf") %>% str_split('\n') %>% unlist() %>% str_to_lower()
## extract the indices of interest
date_reported_lines <- pdf_data %>% str_detect("^date reported") %>% which
## put into a tibble
crime_log <- pdf_data[date_reported_lines] %>%
as_tibble() %>%
extract(value, into = c("date_reported", "location", "event_number"),
regex = ".{1,}(\\d{1,2}/\\d{1,2}/\\d{2}).{1,}location\\s:\\s(.{1,})event\\s#:\\s(.{1,})")