Merging election results and census data from Argentina
Full script and data can be found here.
This data comes with no guarantee, please report any bug you may find.
The task
The Argentine general election will take place in October. Last August, after the mandatory simultaneous primaries elections (PASO) I tried to analyse the results alongside census data for the province of Buenos Aires (BA) and Ciudad de Buenos Aires (CABA). The election results aggregated at different levels (i.e. province, sección, circuito, polling station and ballot box) can be downloaded from the official web and the 2010 census data as well. However, the problems start here. In theory, electoral geography and census geography are two independent entites in Argentina. The census wards boundaries are drawn by INDEC, the national statistical office, while the electoral districts are delimited by judges. Most of the times, judges tend to use the census boundaries, as it is pointed in this tweet:
Son independientes pero mi impresión es que a los jueces electorales les alcanza el sentido común para trazarlos de modo que contengan segmentos relativamente homogéneos y así no agitar el conflicto social el día de la elección
— Andy Tow (@andy_tow) August 14, 2019
So far, two problems here:
- Most of the time does not mean always; sometimes the census and electoral boundaries are not the same
- There is not (or I couldn’t find) a public available lookup file to match census and electoral wards (which use different ID-coding systems just to make this task more fun!)
The task of merging the electoral and census datasets has four steps:
Clean electoral data. Here I clean the data from the 2015 presidential PASO, the 2015 presidential election, and the 2019 presidential PASO. They are in long format in several files so I have to extrat the relevant information from each file and create two versions, long and wide, of the files for each election.
Extract and clean census data. The census data at radio level was downloaded from this web using web scraping and then joint. Thanks to @manuelaristaran for his previous work gathering the radio level files of the 2010 census.
Create a geographical lookup to match census radios to electoral circuitos. This is the most delicate part of the task. The census radios are smaller (measured as total area) than the electoral circuitos so, normally, several census radios will lie within the limits of a electoral circuito. The first objective is to determine the percentage of the area of the census radios that lie within the boundaries of a circuito. Therefore, to estimate the census characteristics of electoral circuitos, we will use the overlapping surface between census radios and electoral circuitos. The second objective is to create a lookup file to match the electoral and the INDEC geographical IDs.
Join eletoral results and census. Finally, the electoral and census data are merged using the lookup file. The final file is stored and documented here.
Tools
To complete the task I will basically use the packages:
tidyverse
: All you need for data wrangling.rvest
: This package for easy web scraping saves time so you don’t need to download the files one by one.sf
: I use this package to del with map objects. Here, I draw two layers of polygons (i.e. census and electoral geography) and find out the level of overlap.
1. Cleaning electoral data
In the first step I prepare the electoral data.
1.1 2019 PASO presidential election
The original files were downloaded from here as “|” separated. I used tidyverse::read_delim()
to read-in the files. There are four data frames:
- Candidate labels
- Region labels
- Electoral results totals (e.g. voters, blanks…)
- Electoral results candidates
The objective is to create a summary file including the share of the valid vote for the two major candidates, the opositor Frente de Todos (FdT) - Alberto Fernández and the incumbement-likely-to-lose Juntos por el Cambio (JxC) - Mauricio Macri, and others. The valid vote needs to be aggregated at electoral circuito level.
## Load the files (source: https://www.resultados2019.gob.ar/)
<- read_delim("data/paso2019/descripcion_postulaciones.dsv",
paso19_cand_id delim = "|") %>%
rename_all(str_to_lower) # cadidate labels
<- read_delim("data/paso2019/descripcion_regiones.dsv",
paso19_reg_id delim = "|") %>%
rename_all(str_to_lower) # region labels
<- read_delim("data/paso2019/mesas_totales.dsv",
paso19_totals delim = "|") %>%
rename_all(str_to_lower) # electoral totals
<- read_delim("data/paso2019/mesas_totales_agrp_politica.dsv",
paso19_cand delim = "|") %>%
rename_all(str_to_lower) # electoral votes to candidates
The first step is to get the number of blank votes in each ballot box, the level at which these data frames are aggregate.
## Get blank votes to compute valid and percentages
<- paso19_totals %>%
paso19_totals_pres_blank filter(codigo_categoria == "000100000000000", # filter presidential election
== "VB") %>% # filter blank vote
contador select(codigo_mesa, valor) %>%
rename(votos_blanco = valor)
Second, prepare the labels for the regions and the parties to match with the results data in the next step.
## Prepare regional identifiers
<- paso19_reg_id %>%
paso19_reg_id mutate(codigo_distrito = codigo_region,
codigo_seccion = codigo_region,
codigo_circuito = codigo_region) %>%
rename(name = nombre_region)
## Select party names
<- paso19_cand_id %>%
paso19_cand_id filter(codigo_categoria == "000100000000000") %>%
group_by(codigo_agrupacion) %>%
summarise(nombre_agrupacion = first(nombre_agrupacion))
Third, I compute the sum of the votes to candidatures and add them to the blank votes to get the sum of valid votes. Note that in the PASO election people can vote for the candiates of one party and there might be more than one candidate per party. In theory, the main objective of this election is to select candiates within parties, however, in practice, it has become an election-dress-rehearsal and the parties just postulate one candidate.
## Votes to candidature to compute total valids
<- paso19_cand %>%
paso19_cand_pres filter(codigo_categoria == "000100000000000") %>%
group_by(codigo_mesa) %>%
mutate(votos_candidatura = sum(votos_agrupacion)) %>%
ungroup()
Fourth, the previous data frames are joint and the result is a semi-long dataset. This means that each row represents a candidature and its votes in a given ballot box, but each row also contains the total number of blank votes and valid votes for each ballot box.
## join files
<- paso19_cand_pres %>%
paso19_mesa left_join(paso19_totals_pres_blank, by = "codigo_mesa") %>%
mutate(votos_validos = votos_candidatura + votos_blanco) %>%
left_join(select(paso19_reg_id, codigo_distrito, name), by = "codigo_distrito") %>%
rename(name_distrito = name) %>%
left_join(select(paso19_reg_id, codigo_seccion, name), by = "codigo_seccion") %>%
rename(name_seccion = name) %>%
left_join(paso19_cand_id, by = "codigo_agrupacion") %>%
select(codigo_distrito, name_distrito, codigo_seccion,
name_seccion, codigo_circuito,
codigo_mesa, votos_blanco, votos_validos,
codigo_agrupacion, nombre_agrupacion, votos_agrupacion)
Fifth, the file is aggregated at circuito level.
# blank and valid aggregate
<- paso19_mesa %>%
paso19_circuito_long_totals group_by(codigo_circuito, codigo_mesa) %>%
summarise(votos_blanco = first(votos_blanco),
votos_validos = first(votos_validos)) %>%
ungroup() %>%
group_by(codigo_circuito) %>%
summarise(votos_blanco = sum(votos_blanco),
votos_validos = sum(votos_validos))
# filter for CABA and BA and match with blank and valid
<- paso19_mesa %>%
paso19_circuito_long select(-votos_blanco, -votos_validos) %>%
filter(codigo_distrito %in% c("01", "02")) %>% # filter CABA and BA
mutate(partido = recode(nombre_agrupacion,
"JUNTOS POR EL CAMBIO" = "Juntos por el Cambio",
"FRENTE DE TODOS" = "Frente de Todos",
.default = "Otros")) %>%
group_by(codigo_circuito, partido) %>%
summarise(votos_candidatura = sum(votos_agrupacion)) %>%
left_join(paso19_circuito_long_totals, by = "codigo_circuito") %>%
rename(id_circuito_elec = codigo_circuito) %>%
mutate(year = 2019) %>%
select(year, id_circuito_elec, votos_blanco, votos_validos, partido, votos_candidatura)
Then, I change the resulting long file to a wide format so parties are in columns and each row represents a circuito. Finally, I save both, the long and the wide files.
## to wide format
<- spread(paso19_circuito_long, key = partido, value = votos_candidatura) %>%
paso19_circuito_wide rename(paso19_cand_FdT = `Frente de Todos`,
paso19_cand_JxC = `Juntos por el Cambio`,
paso19_cand_Otros = Otros,
paso19_blanco = votos_blanco,
paso19_validos = votos_validos) %>%
select(-year)
## save file
write_rds(paso19_circuito_long, "data/PASO_2019_circuito_long.RDS")
write_rds(paso19_circuito_wide, "data/PASO_2019_circuito_wide.RDS")
1.2 2015 PASO presidential election
These datasets come from a different source. Three csv’s are loaded, one for the province of BA, one for the CABA, and one containing the names of the candidatures.
<- read_csv2("data/paso2015/codigosbasicospaso2015provisional/FPARTIDOS.csv") %>%
paso15_cand_id rename_all(str_to_lower)
<- read_csv2("data/paso2015/presidentepaso2015provisional/FMESPR_0101.csv") %>%
paso15_cand_caba rename_all(str_to_lower)
<- read_csv2("data/paso2015/presidentepaso2015provisional/FMESPR_0202.csv") %>%
paso15_cand_ba rename_all(str_to_lower)
<- bind_rows(paso15_cand_ba, paso15_cand_caba) # create a joint file of BA and CABA paso15_cand
First, I compute the number of valid votes of each circuito.
## get blank votes
<- paso15_cand %>%
paso15_blanks filter(`codigo votos` == 9004) %>% # code for blank votes
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
summarise(blancos = sum(as.integer(votos)))
## get valid votes
<- paso15_cand %>%
paso15_sum_votes filter(`codigo votos` < 9000) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
summarise(candidaturas = sum(as.integer(votos)))
<- left_join(paso15_sum_votes, paso15_blanks,
paso15_valid by = c("codigo provincia", "codigo departamento", "codigo circuito"))
Second, I get the votes obtained by each of the candidatures. Each row corresponds to a party in a circuito.
## get cand votes
<- paso15_cand_id %>%
paso15_cand_id mutate(`codigo votos` = as.numeric(codigo_partido)) %>%
select(-codigo_partido, -lista_interna, -agrupacion)
<- paso15_cand %>%
paso15_votes filter(`codigo votos` < 9000) %>% # codes of candidatures are lower than 9k
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, `codigo votos`) %>%
summarise(candidatura = sum(as.integer(votos))) %>%
left_join(paso15_cand_id, by = "codigo votos") %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
summarise(candidatura = sum(candidatura)) %>%
mutate(denominacion = recode(denominacion,
"ALIANZA CAMBIEMOS" = "Cambiemos",
"ALIANZA FRENTE PARA LA VICTORIA" = "Frente Para la Victoria" ,
"ALIANZA UNIDOS POR UNA NUEVA ALTERNATIVA (UNA)" = "UNA",
.default = "Otros")) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
summarise(candidatura = sum(candidatura)) %>%
rename(partido = denominacion,
votos_candidatura = candidatura) %>%
ungroup()
Third, I combine the votes data frame from the previous step and the valid votes to get a semi-long data frame.
## join files semi-long
<- paso15_votes %>%
paso15_circuito_long left_join(paso15_valid, by = c("codigo provincia", "codigo departamento", "codigo circuito")) %>%
mutate(year = 2015,
ln_circuito = str_length(`codigo circuito`),
zeroes = case_when(
== 2 ~ "0000",
ln_circuito == 3 ~ "000",
ln_circuito == 4 ~ "00",
ln_circuito == 5 ~ "0",
ln_circuito == 6 ~ ""
ln_circuito
),id_circuito_elec = paste0(`codigo provincia`, `codigo departamento`, zeroes, `codigo circuito`),
votos_validos = blancos + candidaturas) %>%
rename(votos_blanco = blancos) %>%
select(year, id_circuito_elec, partido, votos_blanco, votos_validos, votos_candidatura)
Fourth, I switch the file to wide format and save the data frames.
## join files wide
<- spread(paso15_circuito_long, key = partido, value = votos_candidatura) %>%
paso15_circuito_wide rename(paso15_cand_FPV = `Frente Para la Victoria`,
paso15_cand_Cam = Cambiemos,
paso15_cand_UNA = UNA,
paso15_cand_Otros = Otros,
paso15_blanco = votos_blanco,
paso15_validos = votos_validos) %>%
select(-year)
## join files wide
write_rds(paso15_circuito_long, "data/PASO_2015_circuito_long.RDS")
write_rds(paso15_circuito_wide, "data/PASO_2015_circuito_wide.RDS")
1.3 2015 presidential election
I repeat the process to generate two data frames of the 2015 first round presidental election.
# A.3. Create a presi 2015 file -----------------------------------------
<- read_csv2("data/paso2015/codigosbasicospaso2015provisional/FPARTIDOS.csv") %>%
pres15_cand_id rename_all(str_to_lower)
<- read_csv2("data/pres2015/FMESPR_0101.csv") %>%
pres15_cand_caba rename_all(str_to_lower)
<- read_csv2("data/pres2015/FMESPR_0202.csv") %>%
pres15_cand_ba rename_all(str_to_lower)
<- bind_rows(pres15_cand_ba, pres15_cand_caba) pres15_cand
First, I compute the number of valid votes of each circuito.
## get blank votes
<- pres15_cand %>%
pres15_blanks filter(`codigo votos` == 9004) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
summarise(blancos = sum(as.integer(votos)))
## get valid votes
<- pres15_cand %>%
pres15_sum_votes filter(`codigo votos` < 9000) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
summarise(candidaturas = sum(as.integer(votos)))
<- left_join(pres15_sum_votes, pres15_blanks, by = c("codigo provincia", "codigo departamento", "codigo circuito")) pres15_valid
Second, I get the votes obtained by each of the candidatures. Each row corresponds to a party in a circuito.
## get cand votes
<- pres15_cand_id %>%
pres15_cand_id mutate(`codigo votos` = as.numeric(codigo_partido)) %>%
select(-codigo_partido, -lista_interna, -agrupacion)
<- pres15_cand %>%
pres15_votes filter(`codigo votos` < 9000) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, `codigo votos`) %>%
summarise(candidatura = sum(as.integer(votos))) %>%
mutate(`codigo votos` = as.numeric(`codigo votos`)) %>%
left_join(pres15_cand_id, by = "codigo votos") %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
summarise(candidatura = sum(candidatura)) %>%
mutate(denominacion = recode(denominacion,
"ALIANZA CAMBIEMOS" = "Cambiemos",
"ALIANZA FRENTE PARA LA VICTORIA" = "Frente Para la Victoria" ,
"ALIANZA UNIDOS POR UNA NUEVA ALTERNATIVA (UNA)" = "UNA",
.default = "Otros")) %>%
group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
summarise(candidatura = sum(candidatura)) %>%
rename(partido = denominacion,
votos_candidatura = candidatura) %>%
ungroup()
Third, I combine the votes data frame from the previous step with the valid votes to get a semi-long data frame.
## join files semi-long
<- pres15_votes %>%
pres15_circuito_long left_join(pres15_valid, by = c("codigo provincia", "codigo departamento", "codigo circuito")) %>%
mutate(year = 2015,
ln_circuito = str_length(`codigo circuito`),
zeroes = case_when(
== 2 ~ "0000",
ln_circuito == 3 ~ "000",
ln_circuito == 4 ~ "00",
ln_circuito == 5 ~ "0",
ln_circuito == 6 ~ ""
ln_circuito
),id_circuito_elec = paste0(`codigo provincia`, `codigo departamento`, zeroes, `codigo circuito`),
votos_validos = blancos + candidaturas) %>%
rename(votos_blanco = blancos) %>%
select(year, id_circuito_elec, partido, votos_blanco, votos_validos, votos_candidatura)
Fourth, I switch the file to wide format and save the files to use them later.
## transform to wide format
<- spread(pres15_circuito_long, key = partido, value = votos_candidatura) %>%
pres15_circuito_wide rename(pres15_cand_FPV = `Frente Para la Victoria`,
pres15_cand_Cam = Cambiemos,
pres15_cand_UNA = UNA,
pres15_cand_Otros = Otros,
pres15_blanco = votos_blanco,
pres15_validos = votos_validos) %>%
select(-year)
## save files
write_rds(pres15_circuito_long, "data/Pres_2015_circuito_long.RDS")
write_rds(pres15_circuito_wide, "data/Pres_2015_circuito_wide.RDS")
2. Prepare census data
The second part of the task consists of collecting census data at census radio level. There is an official platform to download census data called REDATAM, however, I found easier to use the pre-processed files generated by @jazzido which are in a very convenient format for this task.
- I use the packege
rvest
to identify all different files located in http://dump.jazzido.com/CNPHV2010-RADIO/. Each link on that web corresponds to a variable and within the variable dataset you have the counts for each category. For example, in a given census radio there might be 54 men and 43 women; these two variables will be found in the same table.
# B) Prepare 2010 census information
# B.1 census variables ------------------------------------------------------
rm(list= ls())
library(rvest)
source("source/source.R")
## scrape the urls (source: http://dump.jazzido.com/CNPHV2010-RADIO/)
<- "http://dump.jazzido.com/CNPHV2010-RADIO/"
url <- read_html(url)
url <- url %>%
url_list html_nodes("a") %>%
html_attr("href")
<- url_list[str_ends(url_list, ".csv")]
url_list <- url_list[!str_detect(url_list, "HOGAR.NHOG|VIVIENDA.V00|HOGAR.H15|PERSONA.P03")] url_list
- I create an index file called
prefix_list
to associate each link with a prefix variable name.
# Prepare a list of varnames
<- tibble(url_end = url_list,
prefix_list prefix = c("vi_cal_cons_", "vi_cal_servbas_", "vi_cal_mat_",
"vi_tipo_", "vi_nho_", "vi_urban_",
"vi_tipo_part_", "vi_tipo_ocupa_", "ho_nbi_",
"ho_suelo_", "ho_techo_", "ho_rev_int_",
"ho_agua_", "ho_agua_beber_", "ho_aseo_",
"ho_cadena_", "ho_desague_", "ho_combus_",
"ho_refri_", "ho_compu_",
"ho_celu_", "ho_fijo_", "ho_hacinam_",
"ho_nhogar_", "ho_prop_", "ho_npers_",
"per_siteco_", "per_edad_gru_",
"per_edad_quin_", "per_rela_jefe_",
"per_sexo_", "per_inmig_",
"per_leer_", "per_escuela_", "per_educa_",
"per_educa_completo_", "per_compu_"),
descr = c("Calidad constructiva de la vivienda",
"Calidad de Conexiones a Servicios Básicos",
"Calidad de los materiales",
"Tipo de vivienda agrupado",
"Cantidad de Hogares en la Vivienda",
"Area Urbano - Rural",
"Tipo de vivienda particular",
"Condición de ocupación",
"Al menos un indicador NBI",
"Material predominante de los pisos",
"Material predominante de la cubierta exterior del techo",
"Revestimiento interior o cielorraso del techo",
"Tenencia de agua",
"Procedencia del agua para beber y cocinar",
"Tiene baño / letrina",
"Tiene botón, cadena, mochila para limpieza del inodoro",
"Desagüe del inodoro",
"Baño / letrina de uso exclusivo",
"Combustible usado principalmente para cocinar",
"Heladera",
"Computadora",
"Teléfono celular",
"Teléfono de línea",
"Hacinamiento",
"Régimen de tenencia",
"Total de Personas en el Hogar",
"Condición de actividad",
"Edad en grandes grupos",
"Edades quinquenales",
"Relación o parentesco con el jefe(a) del hogar",
"Sexo",
"En que país nació",
"Sabe leer y escribir",
"Condición de asistencia escolar",
"Nivel educativo que cursa o cursó",
"Completó el nivel",
"Utiliza computadora"))
- I use a
map2()
function so for each pair of prefix and url fromprefix_list
I apply the functionread_arg_census()
which can be found in the source code script. Basically, this function takes the url, downloads the information, and rename the variables using the prefixes. The result is a list which contains the data extractions for each prefix and url. To convert that list in a data frame I use the fucntionpurrr::reduce()
that does a left join based on the radio id code with each element of the list.
## collect data and save
<- map2(prefix_list$url_end, prefix_list$prefix, ~ read_arg_census(end_url = .x, prefix = .y))
data_censo_radio
<- data_censo_radio %>%
data_censo_joint reduce(left_join, by = "radio")
write_rds(data_censo_joint, "data/data_censo_radio.RDS")
- To compute relative measures for each radio we need to extract the census totals of dwelling units, households, and persons. To do this I use another custom function
read_arg_totales()
.
## collect totals to compute proportions and save
<- map2(c("VIVIENDA-INCALCONS.csv", "VIVIENDA-URP.csv", "HOGAR-ALGUNBI.csv", "PERSONA-P02.csv"),
data_totales c("vi_", "vi_urban_", "ho_", "per_"), ~ read_arg_totales(end_url = .x, prefix = .y))
<- data_totales %>%
data_totales reduce(left_join, by = "radio")
write_rds(data_totales, "data/data_censo_totales_radio.RDS")
3. Create a lookups data frames to connect census data and electoral results
Once the electoral data and the census data are ready, it’s time to create a lookup data frame to establish how these two geographies combine.
3.1 Match census radios to electoral circuitos
The first step is to draw both maps to establish the level of overlap. The census geography and the electoral maps are loaded by the custom function intersect_polygons()
. In that function, both maps are compared using sf::st_intersection()
and the overlapping surface percentage is computed using sf::area()
. The full code of this function can be found in the source script.
# C) Create lookups to match census "radios" and electoral "circuitos"
rm(list= ls())
library(sf)
source("source/source.R")
# C.1. Generate map lookup to compute stats for circuitos ------------------
<- list.dirs("data/censo/") # shapes census
files_census <- files_census[2:3]
files_census <- list.dirs("data/circuitos/")[2:3] # shapes electoral
files_circuitos
<- map2(files_census, files_circuitos, intersect_polygons)
intersec_list <- reduce(intersec_list, bind_rows) %>%
censo_elec_lookup filter(por_radio > .001) # clean empty por_radio
write_rds(censo_elec_lookup, "data/censo_elec_lookup.RDS")
3.2 Generate a lookup of muni codes
Another problem is that the geographical IDs are different for the electoral records and for the census geography. In this case, the province codes, as we are only using two, BA and CABA are not problematic. However, the codes of the electoral sections/local authorities need to be matched. Unfortunately, I could not find a lookup so I had to create it myself.
I took a map from the 2015 PASO election which presents the polling stations including the INDEC and the election IDs. I combined both to see where they overlap and find out the correspondence between the INDEC and electoral codes. To do this I built the function gen_geo_lookup()
which can be found in the source script.
# C.2. Generate lookup of electoral sections ----------------------------------
<- map(.x = files_circuitos,
elec_lookup ~gen_geo_lookup(map1 = "data/paso2015/establecimientos.geojson",
map2 = .x)
)
<- reduce(elec_lookup, bind_rows)
lookup_elec_indec
<- lookup_elec_indec %>%
lookup_elec_indec group_by(id_seccion_elec) %>%
mutate(count_id_elec = n(), max_count = max(count)) %>%
filter(count_id_elec == 1 | count_id_elec > 1 & max_count == count) %>%
select(starts_with("id")) %>%
filter(id_seccion_elec != "16001")
write_rds(lookup_elec_indec, "data/elec_indec_lookup.RDS")
3.3 Create a single lookup file
Finally, I merge together the two data frames to build a single lookup dataset. This will be the base to match the electoral results and census data at the next stage.
# C.3. Create lookup file ---------------------------------------------------
# only for the circuito PASO 19
rm(list = ls())
<- read_rds("data/censo_elec_lookup.RDS")
lookup_censo <- read_rds("data/elec_indec_lookup.RDS")
lookup_elec_indec <- read_rds("data/PASO_2019_circuito_wide.RDS")
elec_paso19
<- lookup_censo %>%
lookup_all mutate(id_seccion_indec = str_sub(id_circuito, 1, 5)) %>%
left_join(lookup_elec_indec, by = "id_seccion_indec") %>%
mutate(id_circuito_elec = paste0(id_seccion_elec, str_sub(id_circuito, 6, 11))) %>%
left_join(elec_paso19, by = "id_circuito_elec") %>%
filter(!is.na(paso19_blanco)) %>%
select(-starts_with("paso19"))
write_rds(lookup_all, "data/all_lookup.RDS")
4. Create a joint file with census and electoral data
Once the two data sources, electoral and census data, and the lookup are ready, it is time to create the joint final data frame.
# D) Create final files
rm(list= ls())
# D.1. Load all files and join ----------------------------------------------
<- read_rds("data/data_censo_radio.RDS") %>%
censo rename(id_radio = radio)
<- read_rds("data/data_censo_totales_radio.RDS") %>%
totales rename(id_radio = radio)
<- read_rds("data/all_lookup.RDS") %>%
lkup mutate(por_radio = as.double(por_radio))
<- read_rds("data/PASO_2019_circuito_wide.RDS")
elec_paso19 <- read_rds("data/PASO_2015_circuito_wide.RDS")
elec_paso15 <- read_rds("data/Pres_2015_circuito_wide.RDS") elec_pres15
To merge the files, I start with the lookup, census data, and census totals. These three files are the base to estimate the census counts for each electoral circuito. Then, the census variables are multiplied by the column por_radio
which represents the proportion of overlap between the census radio and the electoral circuito. Following this, the file is grouped by id_circuito_elec
and all other variables are summed up. Finally, relative measures of the census variables are derived using the census totals.
## join files
<- reduce(list(lkup, censo, totales), left_join, by = "id_radio")
joint
## compute circuito stats
<- joint %>%
joint_circuito ungroup() %>%
mutate_at(vars(starts_with("vi"), starts_with("ho"), starts_with("per"), starts_with("TOTAL")), list(~ .*por_radio)) %>%
select(-por_radio, -id_radio, -id_circuito, -id_seccion_indec, -id_seccion_elec) %>%
group_by(id_circuito_elec) %>%
summarise_all(sum, na.rm = T) %>%
ungroup() %>%
mutate_at(vars(starts_with("vi"), starts_with("ho"), starts_with("per"), starts_with("TOTAL")), round, 0) %>%
mutate_at(vars(starts_with("vi_urban_")), list(~ ./vi_urban_TOTAL*100)) %>%
mutate_at(vars(starts_with("vi"), -starts_with("vi_urban_")), list(~ ./vi_TOTAL*100)) %>%
mutate_at(vars(starts_with("ho")), list(~ ./ho_TOTAL*100)) %>%
mutate_at(vars(starts_with("per")), list(~ ./per_TOTAL*100)) %>%
select(-ends_with("TOTAL"))
To finalise the base of the joint file, I merge the census data at circuito level with the election data and I compute the relative number of votes for each circuito.
## join circuito
<- reduce(list(joint_circuito, elec_paso19, elec_pres15, elec_paso15),
joint_circuito by = "id_circuito_elec")
left_join,
<- joint_circuito %>%
joint_circuito mutate_at(vars(starts_with("paso15_cand")), list(~ ./paso15_validos*100)) %>%
mutate_at(vars(starts_with("pres15_cand")), list(~ ./pres15_validos*100)) %>%
mutate_at(vars(starts_with("paso19_cand")), list(~ ./paso19_validos*100))
The last step consists of cleaning the final data frame. I select a set of variables that I specified in a csv. Finally, I save the versions of the wide file.
# D.2. Clean joint file ---------------------------------------------------
## get file with the census variables selected
<- read_csv2("data/varnames.csv")
varsel
## vars to select
<- varsel %>%
vars_keep filter(in_file == 1) %>%
pull(vars)
<- varsel %>%
varnames filter(in_file == 1) %>%
pull(new_name)
## select variables and rename
<- joint_circuito %>%
joint_circuito select(one_of(vars_keep))%>%
rename_all(~ varnames) %>%
mutate(province = case_when(
str_sub(id_circuito_elec, 1, 2) == "01" ~ "CABA",
str_sub(id_circuito_elec, 1, 2) == "02" ~ "Buenos Aires"
%>%
)) select(id_circuito_elec, province, everything())
## save final file
write_rds(joint_circuito, "arg_elec_censo_wide.RDS")
write_csv2(joint_circuito, "arg_elec_censo_wide.csv")
::write.xlsx(joint_circuito, "arg_elec_censo_wide.xlsx") openxlsx