Combining PDF and web data to create a data base of school results
The task
The objective of this post is to generate a data frame with the average marks obtained by the 17-year students of each high school in the university entrance test (EvAU) in the Spanish region of Castilla-La Mancha. The average grades will be enriched with some school characteristics (e.g. public/private, number of students or location).
The EvAU results by school and subject are published in PDF here (for 2017/2018). The data needs to be extracted form the pdf and sorted.
There is no list with the schools characteristics but there is database that can be query here. The data needs to be queried from the database and scraped.
The information from both sources need to be merged in a single file with long format.
Tools
These are the tools to be used:
library(tidyverse)
library(tabulizer)
library(rvest)
library(stringdist)
tidyverse
: Data wrangling, specially the package stringrtabulizer
: Extrat data from PDFsrvest
: Web scrapingstringdist
: Probabilistic match using strings
Step 1: Extract data from PDF
First, we extract the tables from the PDF:
<- extract_tables("data/EvAU_18_CLM.pdf", encoding = "UTF-8") tables
<- read_rds("data/tables.RDS") tables
Then, we put together all matrices in the list tables
, the output of extract_tables
, into a data frame:
<- map(tables, as_tibble) %>%
data_evau bind_rows()
Before cleaning the data let’s have a look to identify what we need to fix.
head(data_evau, 10)
There are three columns (V1
, V2
and V3
) instead of one. That’s an indicator that some of the strings might be split. The next step is to put together all text in a single column text
and remove all the NAs from that new column. Then remove the rows which are not referred to a center or a subject title.
<- data_evau %>%
data_evau mutate(text = paste(V1, V2, V3, sep = " "),
text = str_remove_all(text, "NA")) %>%
filter(str_starts(text, "Asignatura") | str_starts(text, "\\("))
We extract from the column text
the fields we need for the analysis (subject, center_code, center_name, pass, fail, average_mark). To extract these fields we use the stringr
package which has a set of functions to work with strings in R.
<- data_evau %>%
data_evau rowwise() %>%
mutate(
subject = ifelse(str_starts(text, "Asignatura") == T, text, NA_character_),
subject = str_sub(subject, 12, str_length(subject)),
center_code = str_sub(text, 2, 5),
center_name = str_sub(text, 8, str_locate(text, "\\:")[1]-1),
pass = str_sub(text, str_locate(text, "\\:")[1]+1, str_locate(text, "\\:")[1]+3),
fail = str_sub(text, str_locate(text, "suspensos")[1]-4, str_locate(text, "suspensos")[1]-1),
fail = str_remove_all(fail, "y"),
average_mark = str_sub(text, str_locate(text, "Media:")[2]+1, str_locate(text, "Media:")[2]+6)
%>%
) select(-starts_with("V"), -text)
The next step has to do with the variable subject. Now the subject is mostly populated with NAs due to the structure of the pdf. The objective now is to convert the NAs to the previous valid value. To do this I use the function zoo:na.locf
, you can find an instructive example here.
$subject <- zoo::na.locf(data_evau$subject) data_evau
To finalise this part we delete the rows with only subject names. To do this I filter the rows that have a valid (no NA) value in the average_mark
column. Also change the variable type of pass
, fail
, and average_mark
to numeric, substituting the commas with points in the latter case.
<- data_evau %>%
data_evau filter(!is.na(average_mark)) %>%
mutate(pass = as.integer(pass),
fail = as.integer(fail),
average_mark = as.double(str_replace(average_mark, "\\,", "\\.")))
Step 2: School characteristics
In the second step we need to create a data base with school characteristics. To do that we use the information published in this database. So in two steps, first we locate the links of each school and, afterwards, we scrape the school information from the web and put it in a data frame. Before scraping we check that this is allowed in the robots.txt of the domain, more info here.
After looking up all schools in the web a list of links to all schools showed up. We need to go one-by-one and extract the link. To do this we create and ad-hoc function which for each of the 1753 results in the list returns the name of the school and a link to access the detailed information.
<- function(i){
extract_links
<- html_session(paste0("http://www.educa.jccm.es/educacion/cm/educa_jccm/BBDD_ACCESS.1.1.tkContent.27265/tkListResults?formName=SQLQueriesSearcher&nshow.sqlResults=1&position.sqlResults=", i, "&idQuery=961"))
url
<- read_html(url) %>%
name html_node(".campListNOMBRE a") %>%
html_text()
<- read_html(url) %>%
url_school html_node(".campListNOMBRE a") %>%
html_attr("href")
<- tibble(name = str_to_lower(name), url_school = url_school)
return_df
return(tryCatch(return_df, error = function(e) NULL))
}
Once the function is ready we use the purrr::map_df
loop over the results of the list and create a data frame binding all 1753 data frames.
<- 1:1753
list_map <- map_df(list_map, extract_links) links_school
<- read_rds("data/links_school.RDS") links_school
For this project we only want the data of the high schools so in order to avoid scraping data that I won’t use, the next step will be to merge the EVaU and the links datasets. The merge will be performed using the school name. As this could be problematic, we use a probability merge strategy.
<- data_evau %>%
data_evau_to_match mutate(name_match = str_remove_all(center_name, 'COLEGIO SALESIANO|COLEGIO|I.E.S.|CENTRO F.P. ESP.|ESCUELA DE ARTES|ESCUELA DE ARTE|\\.|\\"'),
name_match = str_to_lower(name_match),
name_match = str_trim(name_match, side = "both")) %>%
group_by(center_name) %>%
summarise(name_match = first(name_match))
$match_index <- as.integer(rownames(links_school))
links_school
<- amatch(data_evau_to_match$name_match, links_school$name, maxDist = 4)
match_index
<- cbind(data_evau_to_match, match_index) %>%
data_evau_to_match left_join(links_school, by = "match_index")
Now that we have a list with all the links to detailed information, we repeat the process to extract the information. First, we define the function to extract the required information, then I will use a map to go through each school.
<- function(url_school){
extract_school_info
<- html_session(paste0("http://www.educa.jccm.es", url_school))
url
<- read_html(url) %>%
fields html_nodes(".fieldDetailView") %>%
html_text() %>%
str_remove_all(": ")
<- read_html(url) %>%
charact html_nodes(".valueDetailView") %>%
html_text() %>%
str_remove_all("\t|\n|\r")
<- tibble()
return_df <- rbind(return_df, charact)
return_df colnames(return_df) <- fields
$url_school <- url_school
return_df
return(tryCatch(return_df, error = function(e) NULL))
}
Now it’s time to escrape the information about the schools.
<- unique(data_evau_to_match$url_school)[!is.na(unique(data_evau_to_match$url_school))]
schools_urls <- map_df(schools_urls, extract_school_info) school_data
<- read_rds("data/school_data.RDS") school_data
Step 3: Merge EVaU results and school characteristics
First, I merge the school characteristics I just scraped with the data_evau_to_match
using the url_school
<- left_join(data_evau_to_match, school_data, by = "url_school") %>%
school_data select(-name_match, -match_index, -name, -url_school, -Imagen, -`Situación`, -`Teléfono`, -Fax, -Email, -Web, -Nif, -`NIF/CIF`)
Then I match school data and EVaU results and clean the final file dropping those case with NAs for the school characteristics.
<- left_join(data_evau, school_data, by = "center_name") %>%
final_df filter(!is.na(`Código del Centro`))
head(final_df, 10)