Merging election results and census data from Argentina

Methods
Merging election results and census data from the 2019 presidential election in Argentina
Author

Pablo

Published

September 27, 2019

Full script and data can be found here.

This data comes with no guarantee, please report any bug you may find.

The task

The Argentine general election will take place in October. Last August, after the mandatory simultaneous primaries elections (PASO) I tried to analyse the results alongside census data for the province of Buenos Aires (BA) and Ciudad de Buenos Aires (CABA). The election results aggregated at different levels (i.e. province, sección, circuito, polling station and ballot box) can be downloaded from the official web and the 2010 census data as well. However, the problems start here. In theory, electoral geography and census geography are two independent entites in Argentina. The census wards boundaries are drawn by INDEC, the national statistical office, while the electoral districts are delimited by judges. Most of the times, judges tend to use the census boundaries, as it is pointed in this tweet:

So far, two problems here:

  • Most of the time does not mean always; sometimes the census and electoral boundaries are not the same
  • There is not (or I couldn’t find) a public available lookup file to match census and electoral wards (which use different ID-coding systems just to make this task more fun!)

The task of merging the electoral and census datasets has four steps:

  1. Clean electoral data. Here I clean the data from the 2015 presidential PASO, the 2015 presidential election, and the 2019 presidential PASO. They are in long format in several files so I have to extrat the relevant information from each file and create two versions, long and wide, of the files for each election.

  2. Extract and clean census data. The census data at radio level was downloaded from this web using web scraping and then joint. Thanks to @manuelaristaran for his previous work gathering the radio level files of the 2010 census.

  3. Create a geographical lookup to match census radios to electoral circuitos. This is the most delicate part of the task. The census radios are smaller (measured as total area) than the electoral circuitos so, normally, several census radios will lie within the limits of a electoral circuito. The first objective is to determine the percentage of the area of the census radios that lie within the boundaries of a circuito. Therefore, to estimate the census characteristics of electoral circuitos, we will use the overlapping surface between census radios and electoral circuitos. The second objective is to create a lookup file to match the electoral and the INDEC geographical IDs.

  4. Join eletoral results and census. Finally, the electoral and census data are merged using the lookup file. The final file is stored and documented here.

Tools

To complete the task I will basically use the packages:

  • tidyverse: All you need for data wrangling.
  • rvest: This package for easy web scraping saves time so you don’t need to download the files one by one.
  • sf: I use this package to del with map objects. Here, I draw two layers of polygons (i.e. census and electoral geography) and find out the level of overlap.

1. Cleaning electoral data

In the first step I prepare the electoral data.

1.1 2019 PASO presidential election

The original files were downloaded from here as “|” separated. I used tidyverse::read_delim() to read-in the files. There are four data frames:

  • Candidate labels
  • Region labels
  • Electoral results totals (e.g. voters, blanks…)
  • Electoral results candidates

The objective is to create a summary file including the share of the valid vote for the two major candidates, the opositor Frente de Todos (FdT) - Alberto Fernández and the incumbement-likely-to-lose Juntos por el Cambio (JxC) - Mauricio Macri, and others. The valid vote needs to be aggregated at electoral circuito level.


## Load the files (source: https://www.resultados2019.gob.ar/)
paso19_cand_id <- read_delim("data/paso2019/descripcion_postulaciones.dsv", 
                             delim = "|") %>% 
  rename_all(str_to_lower) # cadidate labels
paso19_reg_id <- read_delim("data/paso2019/descripcion_regiones.dsv", 
                            delim = "|") %>% 
  rename_all(str_to_lower) # region labels
paso19_totals <- read_delim("data/paso2019/mesas_totales.dsv", 
                            delim = "|") %>% 
  rename_all(str_to_lower) # electoral totals
paso19_cand <- read_delim("data/paso2019/mesas_totales_agrp_politica.dsv", 
                          delim = "|") %>% 
  rename_all(str_to_lower) # electoral votes to candidates

The first step is to get the number of blank votes in each ballot box, the level at which these data frames are aggregate.

## Get blank votes to compute valid and percentages
paso19_totals_pres_blank <- paso19_totals %>% 
  filter(codigo_categoria == "000100000000000", # filter presidential election
         contador == "VB") %>% # filter blank vote
  select(codigo_mesa, valor) %>% 
  rename(votos_blanco = valor)

Second, prepare the labels for the regions and the parties to match with the results data in the next step.

## Prepare regional identifiers
paso19_reg_id <- paso19_reg_id %>% 
  mutate(codigo_distrito = codigo_region,
         codigo_seccion = codigo_region,
         codigo_circuito = codigo_region) %>% 
  rename(name = nombre_region)

## Select party names
paso19_cand_id <-  paso19_cand_id %>% 
  filter(codigo_categoria == "000100000000000") %>% 
  group_by(codigo_agrupacion) %>%
  summarise(nombre_agrupacion = first(nombre_agrupacion))

Third, I compute the sum of the votes to candidatures and add them to the blank votes to get the sum of valid votes. Note that in the PASO election people can vote for the candiates of one party and there might be more than one candidate per party. In theory, the main objective of this election is to select candiates within parties, however, in practice, it has become an election-dress-rehearsal and the parties just postulate one candidate.


## Votes to candidature to compute total valids
paso19_cand_pres <- paso19_cand %>% 
  filter(codigo_categoria == "000100000000000") %>% 
  group_by(codigo_mesa) %>% 
  mutate(votos_candidatura = sum(votos_agrupacion)) %>%
  ungroup() 

Fourth, the previous data frames are joint and the result is a semi-long dataset. This means that each row represents a candidature and its votes in a given ballot box, but each row also contains the total number of blank votes and valid votes for each ballot box.

## join files 
paso19_mesa <- paso19_cand_pres %>% 
  left_join(paso19_totals_pres_blank, by = "codigo_mesa") %>% 
  mutate(votos_validos = votos_candidatura + votos_blanco) %>%
  left_join(select(paso19_reg_id, codigo_distrito, name), by = "codigo_distrito") %>% 
  rename(name_distrito = name) %>% 
  left_join(select(paso19_reg_id, codigo_seccion, name), by = "codigo_seccion") %>% 
  rename(name_seccion = name) %>%
  left_join(paso19_cand_id, by = "codigo_agrupacion") %>% 
  select(codigo_distrito, name_distrito, codigo_seccion, 
         name_seccion, codigo_circuito, 
         codigo_mesa, votos_blanco, votos_validos, 
         codigo_agrupacion, nombre_agrupacion, 
         votos_agrupacion)

Fifth, the file is aggregated at circuito level.

# blank and valid aggregate
paso19_circuito_long_totals <- paso19_mesa %>%
group_by(codigo_circuito, codigo_mesa) %>% 
  summarise(votos_blanco = first(votos_blanco), 
            votos_validos = first(votos_validos)) %>% 
  ungroup() %>% 
  group_by(codigo_circuito) %>% 
  summarise(votos_blanco = sum(votos_blanco),
            votos_validos = sum(votos_validos))

# filter for CABA and BA and match with blank and valid
paso19_circuito_long <- paso19_mesa %>%
  select(-votos_blanco, -votos_validos) %>% 
  filter(codigo_distrito %in% c("01", "02")) %>% # filter CABA and BA
  mutate(partido = recode(nombre_agrupacion, 
                          "JUNTOS POR EL CAMBIO" = "Juntos por el Cambio",
                          "FRENTE DE TODOS" = "Frente de Todos",
                          .default = "Otros")) %>% 
  group_by(codigo_circuito, partido) %>% 
  summarise(votos_candidatura = sum(votos_agrupacion)) %>%
  left_join(paso19_circuito_long_totals, by = "codigo_circuito") %>% 
  rename(id_circuito_elec = codigo_circuito) %>% 
  mutate(year = 2019) %>% 
  select(year, id_circuito_elec, votos_blanco, votos_validos, partido, votos_candidatura)

Then, I change the resulting long file to a wide format so parties are in columns and each row represents a circuito. Finally, I save both, the long and the wide files.

## to wide format
paso19_circuito_wide <- spread(paso19_circuito_long, key = partido, value = votos_candidatura) %>% 
  rename(paso19_cand_FdT = `Frente de Todos`,
         paso19_cand_JxC = `Juntos por el Cambio`,
         paso19_cand_Otros = Otros,
         paso19_blanco = votos_blanco,  
         paso19_validos = votos_validos) %>% 
  select(-year)

## save file  
write_rds(paso19_circuito_long, "data/PASO_2019_circuito_long.RDS")
write_rds(paso19_circuito_wide, "data/PASO_2019_circuito_wide.RDS")

1.2 2015 PASO presidential election

These datasets come from a different source. Three csv’s are loaded, one for the province of BA, one for the CABA, and one containing the names of the candidatures.

paso15_cand_id <- read_csv2("data/paso2015/codigosbasicospaso2015provisional/FPARTIDOS.csv") %>% 
    rename_all(str_to_lower)
paso15_cand_caba <- read_csv2("data/paso2015/presidentepaso2015provisional/FMESPR_0101.csv") %>% 
    rename_all(str_to_lower)
paso15_cand_ba <- read_csv2("data/paso2015/presidentepaso2015provisional/FMESPR_0202.csv") %>% 
    rename_all(str_to_lower)

paso15_cand <- bind_rows(paso15_cand_ba, paso15_cand_caba) # create a joint file of BA and CABA

First, I compute the number of valid votes of each circuito.

  ## get blank votes
  paso15_blanks <- paso15_cand %>% 
    filter(`codigo votos` == 9004) %>% # code for blank votes
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
    summarise(blancos = sum(as.integer(votos))) 
  
  ## get valid votes
  paso15_sum_votes <- paso15_cand %>% 
    filter(`codigo votos` < 9000) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
    summarise(candidaturas = sum(as.integer(votos)))
  
  paso15_valid <- left_join(paso15_sum_votes, paso15_blanks, 
                            by = c("codigo provincia", "codigo departamento", "codigo circuito"))

Second, I get the votes obtained by each of the candidatures. Each row corresponds to a party in a circuito.

  ## get cand votes
  paso15_cand_id <- paso15_cand_id %>% 
    mutate(`codigo votos` = as.numeric(codigo_partido)) %>% 
    select(-codigo_partido, -lista_interna, -agrupacion)
  
  paso15_votes <- paso15_cand %>% 
    filter(`codigo votos` < 9000) %>% # codes of candidatures are lower than 9k
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, `codigo votos`) %>%
    summarise(candidatura = sum(as.integer(votos))) %>% 
    left_join(paso15_cand_id, by = "codigo votos") %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
    summarise(candidatura = sum(candidatura)) %>% 
    mutate(denominacion = recode(denominacion,
                            "ALIANZA CAMBIEMOS" = "Cambiemos",
                            "ALIANZA FRENTE PARA LA VICTORIA" = "Frente Para la Victoria" ,
                            "ALIANZA UNIDOS POR UNA NUEVA ALTERNATIVA (UNA)" =  "UNA",  
                            .default = "Otros")) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
    summarise(candidatura = sum(candidatura)) %>% 
    rename(partido = denominacion,
           votos_candidatura = candidatura) %>% 
    ungroup()

Third, I combine the votes data frame from the previous step and the valid votes to get a semi-long data frame.

  ## join files semi-long
  paso15_circuito_long <- paso15_votes %>%
    left_join(paso15_valid, by = c("codigo provincia", "codigo departamento", "codigo circuito")) %>% 
    mutate(year = 2015,
           ln_circuito = str_length(`codigo circuito`),
           zeroes = case_when(
             ln_circuito == 2 ~ "0000",
             ln_circuito == 3 ~ "000",
             ln_circuito == 4 ~ "00",
             ln_circuito == 5 ~ "0",
             ln_circuito == 6 ~ ""
           ),
           id_circuito_elec = paste0(`codigo provincia`, `codigo departamento`, zeroes, `codigo circuito`),
           votos_validos = blancos + candidaturas) %>% 
    rename(votos_blanco = blancos) %>% 
    select(year, id_circuito_elec, partido, votos_blanco, votos_validos, votos_candidatura)

Fourth, I switch the file to wide format and save the data frames.

  ## join files wide
  paso15_circuito_wide <- spread(paso15_circuito_long, key = partido, value = votos_candidatura) %>% 
    rename(paso15_cand_FPV = `Frente Para la Victoria`,
           paso15_cand_Cam = Cambiemos,
           paso15_cand_UNA = UNA,
           paso15_cand_Otros = Otros,
           paso15_blanco = votos_blanco,  
           paso15_validos = votos_validos) %>% 
    select(-year)
  
  ## join files wide
  write_rds(paso15_circuito_long, "data/PASO_2015_circuito_long.RDS")
  write_rds(paso15_circuito_wide, "data/PASO_2015_circuito_wide.RDS")

1.3 2015 presidential election

I repeat the process to generate two data frames of the 2015 first round presidental election.

  # A.3. Create a presi 2015 file -----------------------------------------
  
  pres15_cand_id <- read_csv2("data/paso2015/codigosbasicospaso2015provisional/FPARTIDOS.csv") %>% 
    rename_all(str_to_lower)
  pres15_cand_caba <- read_csv2("data/pres2015/FMESPR_0101.csv") %>% 
    rename_all(str_to_lower)
  pres15_cand_ba <- read_csv2("data/pres2015/FMESPR_0202.csv") %>% 
    rename_all(str_to_lower)
  
  pres15_cand <- bind_rows(pres15_cand_ba, pres15_cand_caba)

First, I compute the number of valid votes of each circuito.

  ## get blank votes
  pres15_blanks <- pres15_cand %>% 
    filter(`codigo votos` == 9004) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
    summarise(blancos = sum(as.integer(votos))) 
  
  ## get valid votes
  pres15_sum_votes <- pres15_cand %>% 
    filter(`codigo votos` < 9000) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`) %>%
    summarise(candidaturas = sum(as.integer(votos)))
  
  pres15_valid <- left_join(pres15_sum_votes, pres15_blanks, by = c("codigo provincia", "codigo departamento", "codigo circuito"))

Second, I get the votes obtained by each of the candidatures. Each row corresponds to a party in a circuito.

  ## get cand votes
  pres15_cand_id <- pres15_cand_id %>% 
    mutate(`codigo votos` = as.numeric(codigo_partido)) %>% 
    select(-codigo_partido, -lista_interna, -agrupacion)
  
  pres15_votes <- pres15_cand %>% 
    filter(`codigo votos` < 9000) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, `codigo votos`) %>%
    summarise(candidatura = sum(as.integer(votos))) %>% 
    mutate(`codigo votos` = as.numeric(`codigo votos`)) %>% 
    left_join(pres15_cand_id, by = "codigo votos") %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
    summarise(candidatura = sum(candidatura)) %>% 
    mutate(denominacion = recode(denominacion,
                                 "ALIANZA CAMBIEMOS" = "Cambiemos",
                                 "ALIANZA FRENTE PARA LA VICTORIA" = "Frente Para la Victoria" ,
                                 "ALIANZA UNIDOS POR UNA NUEVA ALTERNATIVA (UNA)" =  "UNA", 
                                 .default = "Otros")) %>% 
    group_by(`codigo provincia`, `codigo departamento`, `codigo circuito`, denominacion) %>%
    summarise(candidatura = sum(candidatura)) %>% 
    rename(partido = denominacion,
           votos_candidatura = candidatura) %>% 
    ungroup()

Third, I combine the votes data frame from the previous step with the valid votes to get a semi-long data frame.

  ## join files semi-long
  pres15_circuito_long <- pres15_votes %>%
    left_join(pres15_valid, by = c("codigo provincia", "codigo departamento", "codigo circuito")) %>% 
    mutate(year = 2015,
           ln_circuito = str_length(`codigo circuito`),
           zeroes = case_when(
             ln_circuito == 2 ~ "0000",
             ln_circuito == 3 ~ "000",
             ln_circuito == 4 ~ "00",
             ln_circuito == 5 ~ "0",
             ln_circuito == 6 ~ ""
           ),
           id_circuito_elec = paste0(`codigo provincia`, `codigo departamento`, zeroes, `codigo circuito`),
           votos_validos = blancos + candidaturas) %>% 
    rename(votos_blanco = blancos) %>% 
    select(year, id_circuito_elec, partido, votos_blanco, votos_validos, votos_candidatura)

Fourth, I switch the file to wide format and save the files to use them later.

  ## transform to wide format
  pres15_circuito_wide <- spread(pres15_circuito_long, key = partido, value = votos_candidatura) %>% 
    rename(pres15_cand_FPV = `Frente Para la Victoria`,
           pres15_cand_Cam = Cambiemos,
           pres15_cand_UNA = UNA,
           pres15_cand_Otros = Otros,
           pres15_blanco = votos_blanco,  
           pres15_validos = votos_validos) %>% 
    select(-year)
  
  ## save files
  write_rds(pres15_circuito_long, "data/Pres_2015_circuito_long.RDS")
  write_rds(pres15_circuito_wide, "data/Pres_2015_circuito_wide.RDS")

2. Prepare census data

The second part of the task consists of collecting census data at census radio level. There is an official platform to download census data called REDATAM, however, I found easier to use the pre-processed files generated by @jazzido which are in a very convenient format for this task.

  1. I use the packege rvest to identify all different files located in http://dump.jazzido.com/CNPHV2010-RADIO/. Each link on that web corresponds to a variable and within the variable dataset you have the counts for each category. For example, in a given census radio there might be 54 men and 43 women; these two variables will be found in the same table.
  # B) Prepare 2010 census information
  
  # B.1 census variables ------------------------------------------------------
  
  rm(list= ls())
  library(rvest)
  source("source/source.R")
  
  ## scrape the urls (source: http://dump.jazzido.com/CNPHV2010-RADIO/)
  url <- "http://dump.jazzido.com/CNPHV2010-RADIO/"
  url <- read_html(url)
  url_list <- url %>% 
    html_nodes("a") %>% 
    html_attr("href")
  
  url_list <- url_list[str_ends(url_list, ".csv")] 
  url_list <- url_list[!str_detect(url_list, "HOGAR.NHOG|VIVIENDA.V00|HOGAR.H15|PERSONA.P03")]
  1. I create an index file called prefix_list to associate each link with a prefix variable name.
  # Prepare a list of varnames
  prefix_list <- tibble(url_end = url_list,
                        prefix = c("vi_cal_cons_", "vi_cal_servbas_", "vi_cal_mat_", 
                                   "vi_tipo_", "vi_nho_", "vi_urban_", 
                                   "vi_tipo_part_", "vi_tipo_ocupa_", "ho_nbi_", 
                                   "ho_suelo_", "ho_techo_", "ho_rev_int_", 
                                   "ho_agua_", "ho_agua_beber_", "ho_aseo_", 
                                   "ho_cadena_", "ho_desague_", "ho_combus_", 
                                   "ho_refri_", "ho_compu_", 
                                   "ho_celu_", "ho_fijo_", "ho_hacinam_", 
                                   "ho_nhogar_", "ho_prop_", "ho_npers_", 
                                   "per_siteco_", "per_edad_gru_", 
                                   "per_edad_quin_", "per_rela_jefe_", 
                                   "per_sexo_", "per_inmig_", 
                                   "per_leer_", "per_escuela_", "per_educa_", 
                                   "per_educa_completo_", "per_compu_"),
                        descr = c("Calidad constructiva de la vivienda",
                                  "Calidad de Conexiones a Servicios Básicos",
                                  "Calidad de los materiales",
                                  "Tipo de vivienda agrupado",
                                  "Cantidad de Hogares en la Vivienda",
                                  "Area Urbano - Rural",
                                  "Tipo de vivienda particular",
                                  "Condición de ocupación",
                                  "Al menos un indicador NBI",
                                  "Material predominante de los pisos",
                                  "Material predominante de la cubierta exterior del techo",
                                  "Revestimiento interior o cielorraso del techo",
                                  "Tenencia de agua",
                                  "Procedencia del agua para beber y cocinar",
                                  "Tiene baño / letrina",
                                  "Tiene botón, cadena, mochila para limpieza del inodoro",
                                  "Desagüe del inodoro",
                                  "Baño / letrina de uso exclusivo",
                                  "Combustible usado principalmente para cocinar",
                                  "Heladera",
                                  "Computadora",
                                  "Teléfono celular",
                                  "Teléfono de línea",
                                  "Hacinamiento",
                                  "Régimen de tenencia",
                                  "Total de Personas en el Hogar",
                                  "Condición de actividad",
                                  "Edad en grandes grupos",
                                  "Edades quinquenales",
                                  "Relación o parentesco con el jefe(a) del hogar",
                                  "Sexo",
                                  "En que país nació",
                                  "Sabe leer y escribir",
                                  "Condición de asistencia escolar",
                                  "Nivel educativo que cursa o cursó",
                                  "Completó el nivel",
                                  "Utiliza computadora"))
  1. I use a map2() function so for each pair of prefix and url from prefix_list I apply the function read_arg_census() which can be found in the source code script. Basically, this function takes the url, downloads the information, and rename the variables using the prefixes. The result is a list which contains the data extractions for each prefix and url. To convert that list in a data frame I use the fucntion purrr::reduce() that does a left join based on the radio id code with each element of the list.
  ## collect data and save
  data_censo_radio <-  map2(prefix_list$url_end, prefix_list$prefix, ~ read_arg_census(end_url = .x, prefix = .y))
  
  data_censo_joint <- data_censo_radio %>% 
    reduce(left_join, by = "radio")
  
  write_rds(data_censo_joint, "data/data_censo_radio.RDS")
  1. To compute relative measures for each radio we need to extract the census totals of dwelling units, households, and persons. To do this I use another custom function read_arg_totales().
  ## collect totals to compute proportions and save
  data_totales <- map2(c("VIVIENDA-INCALCONS.csv", "VIVIENDA-URP.csv", "HOGAR-ALGUNBI.csv", "PERSONA-P02.csv"), 
                       c("vi_", "vi_urban_", "ho_", "per_"), ~ read_arg_totales(end_url = .x, prefix = .y))
  
  data_totales <- data_totales %>% 
    reduce(left_join, by = "radio")
  
  write_rds(data_totales, "data/data_censo_totales_radio.RDS")

3. Create a lookups data frames to connect census data and electoral results

Once the electoral data and the census data are ready, it’s time to create a lookup data frame to establish how these two geographies combine.

3.1 Match census radios to electoral circuitos

The first step is to draw both maps to establish the level of overlap. The census geography and the electoral maps are loaded by the custom function intersect_polygons(). In that function, both maps are compared using sf::st_intersection() and the overlapping surface percentage is computed using sf::area(). The full code of this function can be found in the source script.

  # C) Create lookups to match census "radios" and electoral "circuitos"
  
  rm(list= ls())
  library(sf)
  source("source/source.R")
  
  # C.1. Generate map lookup to compute stats for circuitos ------------------
  files_census <- list.dirs("data/censo/") # shapes census 
  files_census <- files_census[2:3] 
  files_circuitos <- list.dirs("data/circuitos/")[2:3] # shapes electoral
  
  intersec_list <- map2(files_census, files_circuitos, intersect_polygons) 
  censo_elec_lookup <- reduce(intersec_list, bind_rows) %>% 
    filter(por_radio > .001) # clean empty por_radio
  
  write_rds(censo_elec_lookup, "data/censo_elec_lookup.RDS")

3.2 Generate a lookup of muni codes

Another problem is that the geographical IDs are different for the electoral records and for the census geography. In this case, the province codes, as we are only using two, BA and CABA are not problematic. However, the codes of the electoral sections/local authorities need to be matched. Unfortunately, I could not find a lookup so I had to create it myself.

I took a map from the 2015 PASO election which presents the polling stations including the INDEC and the election IDs. I combined both to see where they overlap and find out the correspondence between the INDEC and electoral codes. To do this I built the function gen_geo_lookup() which can be found in the source script.

  # C.2. Generate lookup of electoral sections ----------------------------------
  elec_lookup <- map(.x = files_circuitos, 
                     ~gen_geo_lookup(map1 = "data/paso2015/establecimientos.geojson", 
                                     map2 = .x)
                     )  
  
  lookup_elec_indec <- reduce(elec_lookup, bind_rows)
  
  lookup_elec_indec <- lookup_elec_indec %>%
  group_by(id_seccion_elec) %>%
  mutate(count_id_elec = n(), max_count = max(count)) %>%
  filter(count_id_elec == 1 | count_id_elec > 1 & max_count == count) %>%
    select(starts_with("id")) %>% 
    filter(id_seccion_elec != "16001") 
  
  write_rds(lookup_elec_indec, "data/elec_indec_lookup.RDS")

3.3 Create a single lookup file

Finally, I merge together the two data frames to build a single lookup dataset. This will be the base to match the electoral results and census data at the next stage.

  # C.3. Create lookup file ---------------------------------------------------
  
  # only for the circuito PASO 19
  rm(list = ls())
  
  lookup_censo <- read_rds("data/censo_elec_lookup.RDS")
  lookup_elec_indec <- read_rds("data/elec_indec_lookup.RDS")
  elec_paso19 <- read_rds("data/PASO_2019_circuito_wide.RDS") 
  
  lookup_all <- lookup_censo %>% 
    mutate(id_seccion_indec = str_sub(id_circuito, 1, 5)) %>% 
    left_join(lookup_elec_indec, by = "id_seccion_indec") %>% 
    mutate(id_circuito_elec = paste0(id_seccion_elec, str_sub(id_circuito, 6, 11))) %>% 
    left_join(elec_paso19, by = "id_circuito_elec") %>% 
    filter(!is.na(paso19_blanco)) %>% 
    select(-starts_with("paso19"))
  
  write_rds(lookup_all, "data/all_lookup.RDS")

4. Create a joint file with census and electoral data

Once the two data sources, electoral and census data, and the lookup are ready, it is time to create the joint final data frame.

  # D) Create final files
  
  rm(list= ls())
  
  # D.1. Load all files and join ----------------------------------------------
  
  censo <- read_rds("data/data_censo_radio.RDS") %>% 
    rename(id_radio = radio)
  totales <- read_rds("data/data_censo_totales_radio.RDS") %>% 
    rename(id_radio = radio)
  lkup <- read_rds("data/all_lookup.RDS") %>% 
    mutate(por_radio = as.double(por_radio))
  elec_paso19 <- read_rds("data/PASO_2019_circuito_wide.RDS")
  elec_paso15 <- read_rds("data/PASO_2015_circuito_wide.RDS")
  elec_pres15 <- read_rds("data/Pres_2015_circuito_wide.RDS")

To merge the files, I start with the lookup, census data, and census totals. These three files are the base to estimate the census counts for each electoral circuito. Then, the census variables are multiplied by the column por_radio which represents the proportion of overlap between the census radio and the electoral circuito. Following this, the file is grouped by id_circuito_elec and all other variables are summed up. Finally, relative measures of the census variables are derived using the census totals.

## join files
  joint <- reduce(list(lkup, censo, totales), left_join, by = "id_radio")  

## compute circuito stats
  joint_circuito <- joint %>%
    ungroup() %>% 
    mutate_at(vars(starts_with("vi"), starts_with("ho"), starts_with("per"), starts_with("TOTAL")), list(~ .*por_radio)) %>% 
    select(-por_radio, -id_radio, -id_circuito, -id_seccion_indec, -id_seccion_elec) %>% 
    group_by(id_circuito_elec) %>% 
    summarise_all(sum, na.rm = T) %>% 
    ungroup() %>% 
    mutate_at(vars(starts_with("vi"), starts_with("ho"), starts_with("per"), starts_with("TOTAL")), round, 0) %>%
    mutate_at(vars(starts_with("vi_urban_")), list(~ ./vi_urban_TOTAL*100)) %>%
    mutate_at(vars(starts_with("vi"), -starts_with("vi_urban_")), list(~ ./vi_TOTAL*100)) %>%
    mutate_at(vars(starts_with("ho")), list(~ ./ho_TOTAL*100)) %>%
    mutate_at(vars(starts_with("per")), list(~ ./per_TOTAL*100)) %>% 
    select(-ends_with("TOTAL"))

To finalise the base of the joint file, I merge the census data at circuito level with the election data and I compute the relative number of votes for each circuito.

  ## join circuito
  joint_circuito <- reduce(list(joint_circuito, elec_paso19, elec_pres15, elec_paso15), 
                           left_join, by = "id_circuito_elec")
    
  joint_circuito <- joint_circuito %>% 
    mutate_at(vars(starts_with("paso15_cand")), list(~ ./paso15_validos*100)) %>% 
    mutate_at(vars(starts_with("pres15_cand")), list(~ ./pres15_validos*100)) %>% 
    mutate_at(vars(starts_with("paso19_cand")), list(~ ./paso19_validos*100)) 

The last step consists of cleaning the final data frame. I select a set of variables that I specified in a csv. Finally, I save the versions of the wide file.

  # D.2. Clean joint file ---------------------------------------------------
  
  ## get file with the census variables selected
  varsel <- read_csv2("data/varnames.csv")
  
  ## vars to select
  vars_keep <- varsel %>% 
    filter(in_file == 1) %>% 
    pull(vars)
  
  varnames <- varsel %>% 
    filter(in_file == 1) %>% 
    pull(new_name)
  
  ## select variables and rename
  joint_circuito <- joint_circuito %>% 
    select(one_of(vars_keep))%>% 
    rename_all(~ varnames) %>% 
    mutate(province = case_when(
      str_sub(id_circuito_elec, 1, 2) == "01" ~ "CABA",
      str_sub(id_circuito_elec, 1, 2) == "02" ~ "Buenos Aires"
    )) %>% 
    select(id_circuito_elec, province, everything())
  
  ## save final file
  write_rds(joint_circuito, "arg_elec_censo_wide.RDS")
  write_csv2(joint_circuito, "arg_elec_censo_wide.csv")
  openxlsx::write.xlsx(joint_circuito, "arg_elec_censo_wide.xlsx")