get_ideb() has a new signature:
get_ideb(level, stage, metric, year, quiet). The old
positional usage get_ideb(year, level, stage) still works
with a deprecation warning, but the year parameter now
filters IDEB editions instead of selecting which file to download.get_ideb() now returns data in tidy long format instead
of wide format. Output columns depend on the metric
parameter ("indicador", "aprovacao",
"nota", "meta").get_ideb() now supports 5 geographic levels:
"escola", "municipio", "estado",
"regiao", and "brasil" (previously only escola
and municipio).get_ideb() always downloads the most recent IDEB file
available, which contains the full historical series. The
year parameter filters editions.get_ideb_series() is deprecated. Use
get_ideb(level, stage, metric) instead.list_ideb_available() now returns level,
stage, and metric columns (previously returned
year, level, stage).uf parameter has been removed from
get_ideb(). Filter the result with
dplyr::filter() instead.download_inep_file() timeout is now configurable via
options(educabR.download_timeout = N) (seconds; default
600). Raise it when downloading large microdata (e.g. ENEM
participantes at ~1.6 GB) over a slow link (issue #7).read_inep_file() now warns before reading files larger
than 500 MB entirely into memory, suggesting n_max or UF
filters to reduce memory pressure. Suppressible with
quiet = TRUE; all get_*() callers propagate
their quiet argument (issue #5).get_ideb() no longer consumes several GB of RAM for
school-level reads (issue #1). The xlsx is now read with column
projection: only the vl_* columns matching the requested
metric (and year, when given) are parsed; the
others are skipped at the readxl C++ layer. INEP’s NA tokens
("", "-", "ND") are also passed
to read_excel(na = ...) so the missing-value strings never
get allocated as R character vectors. For
level = "escola", stage = "anos_iniciais", metric = "indicador",
this cuts the in-memory result from ~133 MB to ~37 MB (4 years) or ~19
MB (1 year), with proportional drops in peak memory during reshape.read_ideb_excel(), read_excel_safe()) and the
FUNDEB enrollment OData fetcher (fetch_fundeb_enrollment())
are now normalized to UTF-8 NFC, matching the behavior already in
read_inep_file(). Previously, equality comparisons against
literals such as filter(rede == "Pública") could silently
return zero rows on Windows because the source-file encoding produced
non-canonical strings. The shared helper
normalize_utf8_nfc() is now applied at every read
entrypoint so all four code paths agree. Affects
get_ideb(), get_cpc(), get_igc(),
get_fundeb_enrollment().read_excel_safe() (used by get_cpc() and
get_igc()) now passes INEP’s missing-value tokens
("", "-", "ND", en/em dashes) to
readxl::read_excel(na = ...) so those cells are loaded as
NA instead of character strings cleaned up post-hoc (issue
#4). Previously, a column whose first rows were all "-"
could be inferred as logical and later numeric values
silently dropped. clean_dash_values() remains as a safety
net but is now largely redundant for CPC/IGC.download_inep_file() now verifies downloaded files
before caching them (issue #3). Three checks run after the bytes hit
disk: file size against the server’s Content-Length (1%
tolerance, catches truncated downloads), HTML-masquerade detection on
the first 64 bytes (catches INEP maintenance pages served with HTTP
200), and ZIP magic-bytes (PK\x03\x04) for
.zip destinations (catches proxy corruption). On any
failure the corrupt file is deleted and the user gets a clear error
telling them to retry, instead of a cryptic readxl /
readr failure on the next call.validate_year() now rejects vectors and non-numeric
input with a clear error pointing at purrr::map_dfr() for
multi-year composition (issue #2). Previously, passing
c(2017, 2019) to any of the 13 affected getters
(get_cpc, get_idd, get_igc,
get_capes, get_saeb, get_enem,
get_enem_itens, get_enade,
get_encceja, get_fundeb_distribution,
get_fundeb_enrollment, get_censo_escolar,
get_censo_superior) hit either a cryptic
length > 1 error (R ≥ 4.2) or silently used only the
first element (R < 4.2). get_ideb() is unaffected — it
intentionally accepts year vectors.extract_zip() cleaned up: removed dead
if (TRUE) branch and an unreachable
cli_abort(); the muffle on extraction warnings was
tightened from the broad erro|error pattern to the two
specific messages that motivated it (issue #6).RoxygenNote bumped to 8.0.0 and man/*.Rd
regenerated; systemfonts and textshaping
declared in Suggests: to silence the cosmetic
R CMD check NOTE about packages pulled transitively by
pkgdown.R CMD check warnings cleared: em-dashes in
cli_abort() message strings in
R/utils-download.R are now written with Unicode escapes (R
requires ASCII-only in code strings; comments are exempt).available_years() now dynamically discovers available
years by querying data sources (HEAD requests for INEP, OData queries
for FNDE). Results are cached per session. Falls back to a hardcoded
list when offline.available_years() now accepts
"fundeb_enrollment" as a separate dataset name. Previously,
"fundeb" was shared between distribution and
enrollment.CO_*, CD_*) are now read as
character instead of numeric across all datasets. This prevents loss of
leading zeros in codes like municipality codes, course codes, and
institution codes.get_enade() failing for 9 of 19 available years.
INEP uses inconsistent URLs for ENADE: _LGPD suffix for
2012-2019, .rar format for 2022. Added hardcoded URL map
(enade_urls) with all 19 correct URLs.get_fundeb_enrollment() accepting years with no
data in the FNDE API. The API currently only has data for
2017-2018.clear_cache() failing to delete files on Windows
when they were memory-mapped by readr. Now deletes entire directories
and warns about locked files..rar archive extraction support via 7-Zip.
find_7z() searches common Windows install paths when
7z is not in PATH.strip_diacriticals() internal helper for
encoding-safe text matching.read_inep_file() now auto-detects code columns
(CO_*, CD_*) from the file header and reads
them as character. No user action required.read_ideb_excel() and read_excel_safe()
(CPC/IGC) now convert code columns to character after reading.get_fundeb_distribution(): Download FUNDEB resource
distribution data (years 2007-2026). Reads all sheets from STN Excel
files and returns tidy long-format data with monthly transfer amounts by
state, funding source, destination (states/municipalities), and table
type (fundeb/adjustment).get_fundeb_enrollment(): Download FUNDEB enrollment
data. Fetches from FNDE OData API with automatic pagination. Results
cached as CSV.uf,
source (FPE, FPM, ICMS, etc.), and destination
(“uf” or “municipio”).https://www.tesourotransparente.gov.br) and FNDE
(https://www.fnde.gov.br).get_capes(): Download CAPES graduate education data
(years 2013-2024)."programas"), students
("discentes"), faculty ("docentes"), courses
("cursos"), and theses/dissertations catalog
("catalogo").https://dadosabertos.capes.gov.br).get_cpc(): Download CPC data (years 2007-2019,
2021-2023; no 2020 edition).readxl package.get_igc(): Download IGC data (years 2007-2019,
2021-2023; no 2020 edition).read_excel_safe(): Internal helper to read Excel files
with error handling.get_enem_escola(): Download ENEM results aggregated by
school (2005-2015).get_idd(): Download IDD microdata (years 2014-2019,
2021-2023; no 2020 edition).extract_archive() utility.get_encceja(): Download ENCCEJA microdata (years
2014-2024).get_enade(): Download ENADE microdata.get_censo_superior(): Download Higher Education Census
microdata (years 2009-2024)."ies"),
courses ("cursos"), students ("alunos"), and
faculty ("docentes").list_censo_superior_files(): List available files in a
downloaded census.uf parameter.get_saeb(): Download SAEB microdata (years 2011, 2013,
2015, 2017, 2019, 2021, 2023)."aluno"), school ("escola"), principal
("diretor"), and teacher ("professor")
questionnaires.level parameter.iconv()
instead of validEnc()."Latin-1" encoding name to "latin1"
for Windows codepage compatibility.type parameter for split
files ("participantes", "resultados").dt_*).vl_* columns from character to numeric,
handling "-", "ND", and comma decimals.get_ideb_series() now shows per-year progress
indication (e.g., “processing IDEB 2017 (1/4)”) and propagates the
quiet parameter to inner get_ideb()
calls.get_enem_itens() now has keep_zip
parameter for consistency with get_enem() and
get_censo_escolar().README.md) as default; Portuguese
version renamed to README.pt-br.md with cross-links between
both.@param year ranges in documentation to match
available_years():
get_enem() / get_enem_itens(): 2009-2023
-> 1998-2024get_censo_escolar(): 2007-2024 -> 1995-2024@family tags to group related functions in help
pages (ENEM, IDEB, School Census, cache).getting-started.Rmd).README.pt-br.md.enem_summary(): statistics calculation,
NA handling, grouping by variable, and error on missing score
columns.validate_data(): empty data, few
columns, missing expected columns per dataset.\donttest with \dontrun in all
examples per CRAN request.set_cache_dir() example that created a directory
in the user’s home (~/educabR_cache) during CRAN checks.
Now uses tempdir() in examples.First public release.
get_ideb(): Download IDEB data (years 2017, 2019, 2021,
2023).get_ideb_series(): Download IDEB historical series
across multiple years.list_ideb_available(): List available year/stage/level
combinations.get_enem(): Download ENEM microdata (years
1998-2024).get_enem_itens(): Download ENEM item response
data.enem_summary(): Calculate summary statistics for ENEM
scores.get_censo_escolar(): Download School Census microdata
(years 1995-2024).list_censo_files(): List available files in a
downloaded census.set_cache_dir(): Set custom cache directory.get_cache_dir(): Get current cache directory.clear_cache(): Clear cached files.list_cache(): List cached files with metadata.available_years(): Get available years for each
dataset.