
healthbR provides easy access to Brazilian public health survey data directly from R. The package downloads, caches, and processes data from official sources, returning clean, analysis-ready tibbles following tidyverse conventions.
Currently supported data sources:
Planned for future releases:
You can install the development version of healthbR from GitHub:
# install.packages("pak")
pak::pak("SidneyBissoli/healthbR")library(healthbR)
# list available VIGITEL survey years
vigitel_years()
#> [1] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#> [16] 2021 2022 2023# load data for a single year
df <- vigitel_data(2023)
# load data for multiple years
df <- vigitel_data(2021:2023)# list variables available in a specific year
vigitel_variables(2023)
# get the data dictionary with variable descriptions
dict <- vigitel_dictionary()
# search for specific variables
dict |>
dplyr::filter(stringr::str_detect(variable_name, "peso"))VIGITEL uses complex survey sampling. Use the pesorake
weight variable for proper inference:
library(dplyr)
library(srvyr)
# create survey design
vigitel_svy <- df |>
as_survey_design(weights = pesorake)
# calculate weighted prevalence
vigitel_svy |>
group_by(cidade) |>
summarize(
prevalence = survey_mean(diab == 1, na.rm = TRUE),
n = unweighted(n())
)healthbR offers three strategies for handling large datasets efficiently:
Convert Excel files to Parquet format for 10-20x faster loading:
# convert downloaded files to parquet (one-time operation
vigitel_convert_to_parquet(2020:2023)
# subsequent loads are much faster
df <- vigitel_data(2020:2023)Download multiple years simultaneously (requires optional packages):
# install optional packages for parallel processing
install.packages(c("furrr", "future"))
# uses furrr for parallel processing (2-4 workers)
df <- vigitel_data(2015:2023)For very large datasets, use lazy evaluation to process data without loading everything into memory:
# returns an Arrow Dataset (not loaded into RAM)
df_lazy <- vigitel_data(2020:2023, lazy = TRUE)
# filter and select before collecting
result <- df_lazy |>
dplyr::filter(cidade == 1) |>
dplyr::select(q6, q8_anos, pesorake, diab, hart) |>
dplyr::collect()All data is downloaded from official Brazilian Ministry of Health repositories:
If you use healthbR in your research, please cite it:
citation("healthbR")Contributions are welcome! Please open an issue to discuss proposed changes or submit a pull request.
Please note that the healthbR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
MIT © Sidney da Silva Pereira Bissoli