Combining PDF and web data to create a data base of school results

Gathering data from PDF and webpages using R



June 9, 2019

The task

The objective of this post is to generate a data frame with the average marks obtained by the 17-year students of each high school in the university entrance test (EvAU) in the Spanish region of Castilla-La Mancha. The average grades will be enriched with some school characteristics (e.g. public/private, number of students or location).

  1. The EvAU results by school and subject are published in PDF here (for 2017/2018). The data needs to be extracted form the pdf and sorted.

  2. There is no list with the schools characteristics but there is database that can be query here. The data needs to be queried from the database and scraped.

  3. The information from both sources need to be merged in a single file with long format.


These are the tools to be used:

  • tidyverse: Data wrangling, specially the package stringr
  • tabulizer: Extrat data from PDFs
  • rvest: Web scraping
  • stringdist: Probabilistic match using strings

Step 1: Extract data from PDF

First, we extract the tables from the PDF:

tables <- extract_tables("data/EvAU_18_CLM.pdf", encoding = "UTF-8")
tables <- read_rds("data/tables.RDS")

Then, we put together all matrices in the list tables, the output of extract_tables, into a data frame:

data_evau <- map(tables, as_tibble) %>% 

Before cleaning the data let’s have a look to identify what we need to fix.

head(data_evau, 10)

There are three columns (V1, V2 and V3) instead of one. That’s an indicator that some of the strings might be split. The next step is to put together all text in a single column text and remove all the NAs from that new column. Then remove the rows which are not referred to a center or a subject title.

data_evau <- data_evau %>% 
  mutate(text = paste(V1, V2, V3, sep = " "),
         text = str_remove_all(text, "NA")) %>% 
  filter(str_starts(text, "Asignatura") | str_starts(text, "\\("))

We extract from the column text the fields we need for the analysis (subject, center_code, center_name, pass, fail, average_mark). To extract these fields we use the stringr package which has a set of functions to work with strings in R.

data_evau <- data_evau %>% 
  rowwise() %>% 
         subject = ifelse(str_starts(text, "Asignatura") == T, text, NA_character_),
         subject = str_sub(subject, 12, str_length(subject)),
         center_code = str_sub(text, 2, 5),
         center_name = str_sub(text, 8, str_locate(text, "\\:")[1]-1),
         pass = str_sub(text, str_locate(text, "\\:")[1]+1, str_locate(text, "\\:")[1]+3),
         fail = str_sub(text, str_locate(text, "suspensos")[1]-4, str_locate(text, "suspensos")[1]-1),
         fail = str_remove_all(fail, "y"),
         average_mark = str_sub(text, str_locate(text, "Media:")[2]+1, str_locate(text, "Media:")[2]+6)
         ) %>% 
  select(-starts_with("V"), -text)

The next step has to do with the variable subject. Now the subject is mostly populated with NAs due to the structure of the pdf. The objective now is to convert the NAs to the previous valid value. To do this I use the function zoo:na.locf, you can find an instructive example here.

data_evau$subject <- zoo::na.locf(data_evau$subject)

To finalise this part we delete the rows with only subject names. To do this I filter the rows that have a valid (no NA) value in the average_mark column. Also change the variable type of pass, fail, and average_mark to numeric, substituting the commas with points in the latter case.

data_evau <- data_evau %>% 
  filter(! %>% 
  mutate(pass = as.integer(pass),
         fail = as.integer(fail),
         average_mark = as.double(str_replace(average_mark, "\\,", "\\.")))

Step 2: School characteristics

In the second step we need to create a data base with school characteristics. To do that we use the information published in this database. So in two steps, first we locate the links of each school and, afterwards, we scrape the school information from the web and put it in a data frame. Before scraping we check that this is allowed in the robots.txt of the domain, more info here.

After looking up all schools in the web a list of links to all schools showed up. We need to go one-by-one and extract the link. To do this we create and ad-hoc function which for each of the 1753 results in the list returns the name of the school and a link to access the detailed information.

extract_links <- function(i){
  url <- html_session(paste0("", i, "&idQuery=961"))
  name <- read_html(url) %>% 
    html_node(".campListNOMBRE a") %>% 
  url_school <- read_html(url) %>% 
    html_node(".campListNOMBRE a") %>% 
  return_df <- tibble(name = str_to_lower(name), url_school = url_school)
  return(tryCatch(return_df, error = function(e) NULL))

Once the function is ready we use the purrr::map_df loop over the results of the list and create a data frame binding all 1753 data frames.

list_map <- 1:1753
links_school <- map_df(list_map, extract_links)
links_school <- read_rds("data/links_school.RDS")

For this project we only want the data of the high schools so in order to avoid scraping data that I won’t use, the next step will be to merge the EVaU and the links datasets. The merge will be performed using the school name. As this could be problematic, we use a probability merge strategy.

data_evau_to_match <- data_evau %>% 
  mutate(name_match = str_remove_all(center_name, 'COLEGIO SALESIANO|COLEGIO|I.E.S.|CENTRO F.P. ESP.|ESCUELA DE ARTES|ESCUELA DE ARTE|\\.|\\"'),
         name_match = str_to_lower(name_match),
         name_match = str_trim(name_match, side = "both")) %>% 
  group_by(center_name) %>% 
  summarise(name_match = first(name_match))

links_school$match_index <- as.integer(rownames(links_school))

match_index <- amatch(data_evau_to_match$name_match, links_school$name, maxDist = 4)

data_evau_to_match <- cbind(data_evau_to_match, match_index) %>% 
  left_join(links_school, by = "match_index")

Now that we have a list with all the links to detailed information, we repeat the process to extract the information. First, we define the function to extract the required information, then I will use a map to go through each school.

extract_school_info <- function(url_school){
  url <- html_session(paste0("", url_school))
  fields <- read_html(url) %>% 
    html_nodes(".fieldDetailView") %>% 
    html_text() %>% 
    str_remove_all(": ")
  charact <- read_html(url) %>% 
    html_nodes(".valueDetailView") %>% 
    html_text() %>% 

  return_df <- tibble()
  return_df <- rbind(return_df, charact)
  colnames(return_df) <- fields
  return_df$url_school <- url_school
  return(tryCatch(return_df, error = function(e) NULL))

Now it’s time to escrape the information about the schools.

schools_urls <- unique(data_evau_to_match$url_school)[!$url_school))]
school_data <- map_df(schools_urls, extract_school_info)
school_data <- read_rds("data/school_data.RDS")

Step 3: Merge EVaU results and school characteristics

First, I merge the school characteristics I just scraped with the data_evau_to_match using the url_school

school_data <- left_join(data_evau_to_match, school_data, by = "url_school") %>% 
  select(-name_match, -match_index, -name, -url_school, -Imagen, -`Situación`, -`Teléfono`, -Fax, -Email, -Web, -Nif, -`NIF/CIF`)

Then I match school data and EVaU results and clean the final file dropping those case with NAs for the school characteristics.

final_df <- left_join(data_evau, school_data, by = "center_name") %>% 
  filter(!`Código del Centro`))

head(final_df, 10)