healthbR

Lifecycle: experimental CRAN status R-CMD-check

Overview

healthbR provides easy access to Brazilian public health survey data directly from R. The package downloads, caches, and processes data from official sources, returning clean, analysis-ready tibbles following tidyverse conventions.

Currently supported data sources:

Planned for future releases:

Installation

You can install the development version of healthbR from GitHub:

# install.packages("pak")
pak::pak("SidneyBissoli/healthbR")

Usage

Check available years

library(healthbR)

# list available VIGITEL survey years
vigitel_years()
#> [1] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#> [16] 2021 2022 2023

Download and load data

# load data for a single year
df <- vigitel_data(2023)

# load data for multiple years
df <- vigitel_data(2021:2023)

Explore variables

# list variables available in a specific year
vigitel_variables(2023)

# get the data dictionary with variable descriptions
dict <- vigitel_dictionary()

# search for specific variables
dict |>
  dplyr::filter(stringr::str_detect(variable_name, "peso"))

Survey analysis with srvyr

VIGITEL uses complex survey sampling. Use the pesorake weight variable for proper inference:

library(dplyr)
library(srvyr)

# create survey design
vigitel_svy <- df |>
  as_survey_design(weights = pesorake)

# calculate weighted prevalence
vigitel_svy |>
  group_by(cidade) |>
  summarize(
    prevalence = survey_mean(diab == 1, na.rm = TRUE),
    n = unweighted(n())
  )

Performance optimization

healthbR offers three strategies for handling large datasets efficiently:

Convert Excel files to Parquet format for 10-20x faster loading:

# convert downloaded files to parquet (one-time operation
vigitel_convert_to_parquet(2020:2023)

# subsequent loads are much faster
df <- vigitel_data(2020:2023)

2. Parallel downloads

Download multiple years simultaneously (requires optional packages):

# install optional packages for parallel processing
install.packages(c("furrr", "future"))

# uses furrr for parallel processing (2-4 workers)
df <- vigitel_data(2015:2023)

3. Lazy evaluation with Arrow

For very large datasets, use lazy evaluation to process data without loading everything into memory:

# returns an Arrow Dataset (not loaded into RAM)
df_lazy <- vigitel_data(2020:2023, lazy = TRUE)

# filter and select before collecting
result <- df_lazy |>
  dplyr::filter(cidade == 1) |>
  dplyr::select(q6, q8_anos, pesorake, diab, hart) |>
  dplyr::collect()

Data sources

All data is downloaded from official Brazilian Ministry of Health repositories:

Citation

If you use healthbR in your research, please cite it:

citation("healthbR")

Contributing

Contributions are welcome! Please open an issue to discuss proposed changes or submit a pull request.

Code of Conduct

Please note that the healthbR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

MIT © Sidney da Silva Pereira Bissoli