This document outlines an approach to geocoding Brazilian polling station that heavily relies on administrative datasets. In addition to detailing our approach, we also provide some evidence on the error of our method and how it compares to the Google Maps Geocoding API.
Our general approach is to generate a series of potential coordinates from a variety of administrative datasets. We use a machine learning model trained on a subset of the data with coordinates provided by Supreme Electoral Tribunal (TSE) to choose among the candidate coordinates. Inputs to this model are mostly measures of the quality of string matches between the polling station address and administrative data sources, as well as other characteristics of the address and municipality of the polling station. For each polling station, we select the coordinates with the predicted smallest error among the possible coordinates.
To geocode the polling stations, we leverage three main data sources:
The CNEFE datasets are national databases of addresses prepared by IBGE for the census and include detailed data on streets and addresses. The 2010 and 2017 versions includes private addresses, as well as listings of government buildings (such as schools) and the names of local establishments (such as the names of schools or businesses). The 20171 version only includes agricultural properties. Addresses in rural census tracts (setores censitários) in the 2010 CNEFE have longitude and latitude, while all agricultural properties in the 2017 CNEFE are geocoded. The 2022 CNEFE geocodes all addresses.
The 2010 Census data did not include coordinates for addresses in urban census tracts. To partially overcome this issue, we compute the centroid of the census tract and assign this coordinate to each property in the urban census tract. Because urban census tracts tend to be compact, tract centroid should still be fairly close to the true coordinates. Nevertheless, this imputation step will lead to more error for urban addresses than rural addresses when the chosen coordinate is from the 2010 CNEFE data.
The INEP data is a catalog of private and public schools with addresses and longitude and latitude.2
To geocode polling stations, we use fuzzy string matching to match polling stations to coordinates in the administrative datasets by name, address, street, or neighborhood. This string matching procedure generates several candidate coordinates. To choose among these possible coordinates, we use a Random Forest model trained on a sample of polling stations with coordinates provided by the election authorities.
The general approach is a follows:
Normalize3 name and address of polling station.
Normalize addresses and school names in administrative datasets.
Find the “medioid” (i.e. the median point) for all unique streets and neighborhoods in the CNEFE datasets.
Compute the normalized Levenshtein string distance between polling station name and the names of schools in the INEP and CNEFE data in the same municipality as the polling station.
Compute the string distance between the address of polling stations and address of schools in INPE and CNEFE data.
Compute the string distance between the street name and neighborhood name of the polling station and street and neighborhood names from the CNEFE datasets.
The string matching procedure above generates 12 different potential matches.
After string matching, we use a boosted tree model to predict the distance between the possible coordinates and the true coordinates. We treat the coordinates provided by the election authorities as the “ground truth”. This distance is modeled as a function of the following set of covariates:
We use the implementation of the boosted tree model provided in the
lightgbm
package and use the tidymodels
framework
for preprocessing and training. We train our model on half of the
polling stations with ground truth coordinates and use the other half
for testing. We tune the hyperpareters of the model using adaptive
resampling4 and 10 fold cross-validation. After tuning,
we train the model on all polling stations with ground truth
coordinates. We then use this model to predict the distance between the
true coordinates and the candidate coordinates. For each polling
station, we choose the candidate coordinate with the smallest predicted
distance.
To illustrate the string matching procedure, the table below shows shows the string matching procedure for one polling station where the coordinates are known. “String distance” is the normalized Levenshtein string distance between the address component and its potential match. “Predicted distance” is the distance from the truth predicted by the Random forest model. The blue row shows the selected match, which is the potential match with the smallest predicted distance. The last column labeled “Error (km)” is the difference between the known geocoded coordinates and the coordinates from the selected match.
Example of String Matching | |||||
Polling Station Name is ESCOLA MUNICIPAL ACRE. Polling Station Address is AVENIDA OITO DE AGOSTO | |||||
Data | Polling Station String | Match | String Distance | Predicted Error (km) | True Error (km) |
---|---|---|---|---|---|
INEP School Name | acre | acre | 0.00 | 0.02 | 0.02 |
INEP School Address | avenida oito de agosto centro | ave 08 de agosto 196 predio escolar centro | 0.52 | 0.21 | 0.02 |
2010 CNEFE School Name | acre | acre | 0.00 | 0.16 | 0.08 |
2010 CNEFE School Address | avenida oito de agosto centro | avenida oito de agosto sn | 0.17 | 0.22 | 0.54 |
2022 CNEFE School Name | acre | acre | 0.00 | 0.02 | 0.01 |
2022 CNEFE School Address | avenida oito de agosto centro | avenida oito de agosto 196 | 0.21 | 0.03 | 0.01 |
2017 CNEFE Street | avenida oito de agosto | estrada de ferro | 0.68 | 14.49 | 31.51 |
2010 CNEFE Street | avenida oito de agosto | avenida oito de agosto | 0.00 | 0.49 | 0.24 |
2022 CNEFE Street | avenida oito de agosto | avenida oito de agosto | 0.00 | 0.35 | 0.28 |
2017 CNEFE Neighborhood | centro | assentamento | 0.67 | 23.63 | 16.22 |
2010 CNEFE Neighborhood | centro | centro | 0.00 | 0.30 | 0.36 |
2022 CNEFE Neighborhood | centro | centro | 0.00 | 0.35 | 0.31 |
Highlighted row is selected match. |
To estimate the accuracy of our procedure, we use a subset of 21,688 polling stations with coordinates to check the error of the coordinates generated by our procedure. These “ground truth” coordinates were provided by the TSE, which for 2018 has coordinates for 47,042. We used the Google Maps API (as of October 2020) to geocode a large subset of these polling stations to compare our error rate to theirs. We report the error rate at 3 quantiles: 25th, 50th (the median), and the 75th percentile.
In addition to reporting the error rate for the full sample, we also split the sample into polling stations located in rural census tracts versus urban census tracts. Finally, we compare our error rate to coordinates generated using the Google Maps API.
Geocoding Error | ||||||
Quantile | String Matching | Google Maps | ||||
---|---|---|---|---|---|---|
Full Sample | Urban | Rural | Full Sample | Urban | Rural | |
25th | 0.01 | 0.01 | 0.02 | 0.05 | 0.03 | 1.22 |
Median | 0.04 | 0.03 | 0.22 | 0.31 | 0.14 | 6.81 |
75th | 0.44 | 0.17 | 2.83 | 3.85 | 0.56 | 20.17 |
Error in kilometers. |
As can be seen in the above table, the median error for our method is about 0.04 km. This median error is lower than the median error of coordinates produced by the Google Maps Geocoding API, which is 0.31 km.
When we separate the sample by rural or urban, we see that our method is more accurate for both types of polling station. The difference is particularly large for rural polling stations, where our median error rate is 0.22 km and the Google median error rate is 6.81 km, more than a 30 fold difference.
We remove common, but uninformative words, such as “povoado” and “localidade”. We standardize common street abbreviations such as replacing “Av” with “Avenida”. Finally, for polling station names, we remove words most common in school names, such as “unidade escolar” and “colegio estadual”. These are very common, yet not used consistently and as a result, are relatively uninformative. We found that removing them improves matching performance.↩︎
See the finetune
package reference materials
for more information on adaptive resampling.↩︎