Contents

1 Introduction

The Zarr specification defines a format for chunked, compressed, N-dimensional arrays. It’s design allows efficient access to subsets of the stored array, and supports both local and cloud storage systems. Zarr is experiencing increasing adoption in a number of scientific fields, where multi-dimensional data are prevalent. In particular as a back-end to the The Open Microscopy Environment’s OME-NGFF format for storing bioimaging data in the cloud.

Rarr is intended to be a simple interface to reading and writing individual Zarr arrays. It is developed in R and C with no reliance on external libraries or APIs for interfacing with the Zarr arrays. Additional compression libraries (e.g. blosc) are bundled with Rarr to provide support for datasets compressed using these tools.

1.1 Limitations with Rarr

If you know about Zarr arrays already, you’ll probably be aware they can be stored in hierarchical groups, where additional meta data can explain the relationship between the arrays. Currently, Rarr is not designed to be aware of these hierarchical Zarr array collections. However, the component arrays can be read individually by providing the path to them directly.

Currently, there are also limitations on the Zarr datatypes that can be accessed using Rarr. For now most numeric types can be read into R, although in some instances e.g. 64-bit integers there is potential for loss of information. Writing is more limited with support only for datatypes that are supported natively in R and only using the column-first representation.

1.2 Example data

The are some example Zarr arrays included with the package. These were created using the Zarr Python implementation and are primarily intended for testing the functionality of Rarr. You can use the code below to list the complete set on your system, however it’s a long list so we don’t show the output here.

list.dirs(
  system.file("extdata", "zarr_examples", package = "Rarr"),
  recursive = TRUE
) |>
  grep(pattern = "zarr$", value = TRUE)

2 Quick start guide

2.1 Installation and setup

If you want to quickly get started reading an existing Zarr array with the package, this section should have the essentials covered. First, we need to install Rarr1 you only need to do the installation step once with the commands below.

## we need BiocManager to perform the installation
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
## install Rarr
BiocManager::install("Rarr")

Once Rarr is installed, we have to load it into our R session:

library(Rarr)

Rarr can be used to read files either on local disk or on remote S3 storage systems. First lets take a look at reading from a local file.

2.2 Reading a from a local Zarr array

To demonstrate reading a local file, we’ll pick the example file containing 32-bit integers arranged in the “column first” ordering.

zarr_example <- system.file(
  "extdata", "zarr_examples", "column-first", "int32.zarr",
  package = "Rarr"
)

2.2.1 Exploring the data

We can get an summary of the array properties, such as its shape and datatype, using the function zarr_overview()2 This is essentially reading and formatting the array metadata that accompanies any Zarr array..

zarr_overview(zarr_example)
## Type: Array
## Path: /tmp/RtmpHaxDDq/Rinst31e1e6703dbd16/Rarr/extdata/zarr_examples/column-first/int32.zarr
## Shape: 30 x 20 x 10
## Chunk Shape: 10 x 10 x 5
## No. of Chunks: 12 (3 x 2 x 2)
## Data Type: int32
## Endianness: little
## Compressor: blosc

You can use this to check that the location is a valid Zarr array, and that the shape and datatype of the array content are what you are expecting. For example, we can see in the output above that the data type (int32) corresponds to what we expect.

2.2.2 Reading the Zarr array

The summary information retrieved above is required, as to read the array with Rarr you need to know the shape and size of the array (unless you want to read the entire array). From the previous output we can see our example array has three dimensions of size 30 x 20 x 10. We can select the subset we want to extract using a list. The list must have the same length as the number of dimensions in our array, with each element of the list corresponding to the indices you want to extract in that dimension.

index <- list(1:4, 1:2, 1)

We then extract the subset using read_zarr_array():

read_zarr_array(zarr_example, index = index)
## , , 1
## 
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    0
## [3,]    1    0
## [4,]    1    0

2.3 Reading from S3 storage

Reading files in S3 storage works in a very similar fashion to local disk. This time the path needs to be a URL to the Zarr array. We can again use zarr_overview() to quickly retrieve the array metadata.

s3_address <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0076A/10501752.zarr/0"
zarr_overview(s3_address)
## Type: Array
## Path: https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0076A/10501752.zarr/0/
## Shape: 50 x 494 x 464
## Chunk Shape: 1 x 494 x 464
## No. of Chunks: 50 (50 x 1 x 1)
## Data Type: float64
## Endianness: little
## Compressor: blosc

The output above indicates that the array is stored in 50 chunks, each containing a slice of the overall data. In the example below we use the index argument to extract the first and tenth slices from the array. Choosing to read only 2 of the 50 slices is much faster than if we opted to download the entire array before accessing the data.

z2 <- read_zarr_array(s3_address, index = list(c(1, 10), NULL, NULL))

We then plot our two slices on top of one another using the image() function.

## plot the first slice in blue
image(log2(z2[1, , ]),
  col = hsv(h = 0.6, v = 1, s = 1, alpha = 0:100 / 100),
  asp = dim(z2)[2] / dim(z2)[3], axes = FALSE
)
## overlay the tenth slice in green
image(log2(z2[2, , ]),
  col = hsv(h = 0.3, v = 1, s = 1, alpha = 0:100 / 100),
  asp = dim(z2)[2] / dim(z2)[3], axes = FALSE, add = TRUE
)