Geocoding Brazilian Polling Stations with Administrative Data Sets

This document outlines an approach to geocoding Brazilian polling station that heavily relies on administrative datasets. In addition to detailing our approach, we also provide some evidence on the error of our method and how it compares to the Google Maps Geocoding API.

Our general approach is to generate a series of potential coordinates from a variety of administrative datasets. We use a machine learning model trained on a subset of the data with coordinates provided by Supreme Electoral Tribunal (TSE) to choose among the candidate coordinates. Inputs to this model are mostly measures of the quality of string matches between the polling station address and administrative data sources, as well as other characteristics of the address and municipality of the polling station. For each polling station, we select the coordinates with the predicted smallest error among the possible coordinates.

Data Sources

To geocode the polling stations, we leverage three main data sources:

Cadastro Nacional de Endereços para Fins Estatísticos (CNEFE) from the 2010 and the 2022 editions of the Census.
Cadastro Nacional de Endereços para Fins Estatísticos from the 2017 Agricultural Census.
Catálogo de Escolas from INEP.

The CNEFE datasets are national databases of addresses prepared by IBGE for the census and include detailed data on streets and addresses. The 2010 and 2017 versions includes private addresses, as well as listings of government buildings (such as schools) and the names of local establishments (such as the names of schools or businesses). The 2017¹ version only includes agricultural properties. Addresses in rural census tracts (setores censitários) in the 2010 CNEFE have longitude and latitude, while all agricultural properties in the 2017 CNEFE are geocoded. The 2022 CNEFE geocodes all addresses.

The 2010 Census data did not include coordinates for addresses in urban census tracts. To partially overcome this issue, we compute the centroid of the census tract and assign this coordinate to each property in the urban census tract. Because urban census tracts tend to be compact, tract centroid should still be fairly close to the true coordinates. Nevertheless, this imputation step will lead to more error for urban addresses than rural addresses when the chosen coordinate is from the 2010 CNEFE data.

The INEP data is a catalog of private and public schools with addresses and longitude and latitude.²

String Matching

To geocode polling stations, we use fuzzy string matching to match polling stations to coordinates in the administrative datasets by name, address, street, or neighborhood. This string matching procedure generates several candidate coordinates. To choose among these possible coordinates, we use a Random Forest model trained on a sample of polling stations with coordinates provided by the election authorities.

The general approach is a follows:

Normalize³ name and address of polling station.
Normalize addresses and school names in administrative datasets.
Find the “medioid” (i.e. the median point) for all unique streets and neighborhoods in the CNEFE datasets.
Compute the normalized Levenshtein string distance between polling station name and the names of schools in the INEP and CNEFE data in the same municipality as the polling station.
Compute the string distance between the address of polling stations and address of schools in INPE and CNEFE data.
Compute the string distance between the street name and neighborhood name of the polling station and street and neighborhood names from the CNEFE datasets.

The string matching procedure above generates 12 different potential matches.

Choosing Among Potential Matches

After string matching, we use a boosted tree model to predict the distance between the possible coordinates and the true coordinates. We treat the coordinates provided by the election authorities as the “ground truth”. This distance is modeled as a function of the following set of covariates:

Normalized Levenshtein string distance.
Coordinate data source
Indicator for whether the address mentions the city center (“centro”)
Indicator for whether the address mentions being in the countryside (includes the word “rural”)
Indicator for whether the address mentions a school
Log of municipal population
Proportion of the population classified as rural
Area of the municipality

We use the implementation of the boosted tree model provided in the lightgbm package and use the tidymodels framework for preprocessing and training. We train our model on half of the polling stations with ground truth coordinates and use the other half for testing. We tune the hyperpareters of the model using adaptive resampling⁴ and 10 fold cross-validation. After tuning, we train the model on all polling stations with ground truth coordinates. We then use this model to predict the distance between the true coordinates and the candidate coordinates. For each polling station, we choose the candidate coordinate with the smallest predicted distance.

Example of String Matching

To illustrate the string matching procedure, the table below shows shows the string matching procedure for one polling station where the coordinates are known. “String distance” is the normalized Levenshtein string distance between the address component and its potential match. “Predicted distance” is the distance from the truth predicted by the Random forest model. The blue row shows the selected match, which is the potential match with the smallest predicted distance. The last column labeled “Error (km)” is the difference between the known geocoded coordinates and the coordinates from the selected match.

Data	Polling Station String	Match	String Distance	Predicted Error (km)	True Error (km)
Example of String Matching
Polling Station Name is EMEF EMILIO GARRASTAZU MEDICI. Polling Station Address is LOCALIDADE DE IRAQUARA
INEP School Name	emilio garrastazu medici	f santa lucia	0.75	6.76	19.92
INEP School Address	de iraquara iraquara	avenida francisca martins oliveira sn liberdade	0.74	4.41	18.76
2010 CNEFE School Name	emilio garrastazu medici	solidariedade	0.75	7.65	19.73
2010 CNEFE School Address	de iraquara iraquara	estrada pa 324 outro lado	0.72	5.10	8.89
2022 CNEFE School Name	emilio garrastazu medici	solidariedade	0.75	8.56	19.76
2022 CNEFE School Address	de iraquara iraquara	estrada pa 438	0.70	6.70	18.35
2017 CNEFE Street	de iraquara	NA	NA	NA	NA
2010 CNEFE Street	de iraquara	ramal do iraquara	0.41	1.60	2.43
2022 CNEFE Street	de iraquara	rua iraquara	0.25	0.71	0.12
2017 CNEFE Neighborhood	iraquara	NA	NA	NA	NA
2010 CNEFE Neighborhood	iraquara	iraquara colonia	0.50	1.97	2.92
2022 CNEFE Neighborhood	iraquara	iraquara	0.00	0.32	0.00
Highlighted row is selected match.

Estimating Geocoding Error

To estimate the accuracy of our procedure, we use a subset of 21,688 polling stations with coordinates to check the error of the coordinates generated by our procedure. These “ground truth” coordinates were provided by the TSE, which for 2018 has coordinates for 47,042. We used the Google Maps API (as of October 2020) to geocode a large subset of these polling stations to compare our error rate to theirs. We report the error rate at 3 quantiles: 25th, 50th (the median), and the 75th percentile.

In addition to reporting the error rate for the full sample, we also split the sample into polling stations located in rural census tracts versus urban census tracts. Finally, we compare our error rate to coordinates generated using the Google Maps API.

Quantile	String Matching			Google Maps
Geocoding Error
Quantile	Full Sample	Urban	Rural	Full Sample	Urban	Rural
25th	0.01	0.01	0.02	0.05	0.03	1.22
Median	0.04	0.03	0.23	0.31	0.14	6.81
75th	0.46	0.18	3.01	3.85	0.56	20.17
Error in kilometers.

As can be seen in the above table, the median error for our method is about 0.04 km. This median error is lower than the median error of coordinates produced by the Google Maps Geocoding API, which is 0.31 km.

When we separate the sample by rural or urban, we see that our method is more accurate for both types of polling station. The difference is particularly large for rural polling stations, where our median error rate is 0.23 km and the Google median error rate is 6.81 km, more than a 30 fold difference.

Details on the 2017 CNEFE can be found at this link ↩︎
The data can be found at this link ↩︎
We remove common, but uninformative words, such as “povoado” and “localidade”. We standardize common street abbreviations such as replacing “Av” with “Avenida”. Finally, for polling station names, we remove words most common in school names, such as “unidade escolar” and “colegio estadual”. These are very common, yet not used consistently and as a result, are relatively uninformative. We found that removing them improves matching performance.↩︎
See the finetune package reference materials for more information on adaptive resampling.↩︎