Introduction

When performing statistical analysis on any set of genomic ranges it is often important to compare focal sets to null sets that are carefully matched for possible covariates that may influence the analysis. To address this need, the nullranges package implements matchRanges(), an efficient and convenient tool for selecting a covariate-matched set of null hypothesis ranges from a pool of background ranges within the Bioconductor framework.

In this vignette, we provide an overview of matchRanges() and its associated functions. We start with a simulated example generated with the utility function makeExampleMatchedDataSet(). We also provide an overview of the class struture and a guide for choosing among the supported matching methods. To see matchedRanges() used in real biological examples, visit the Case study I: CTCF occupancy, and Case study II: CTCF orientation vignettes.

Using matchRanges()

We will use a simulated data set to demonstrate matching across covarying features:

## GRanges object with 10500 ranges and 3 metadata columns:
##           seqnames      ranges strand |  feature1  feature2    feature3
##              <Rle>   <IRanges>  <Rle> | <logical> <numeric> <character>
##       [1]     chr1       1-100      * |      TRUE   2.87905           c
##       [2]     chr1       2-101      * |      TRUE   3.53965           c
##       [3]     chr1       3-102      * |      TRUE   7.11742           c
##       [4]     chr1       4-103      * |      TRUE   4.14102           a
##       [5]     chr1       5-104      * |      TRUE   4.25858           c
##       ...      ...         ...    ... .       ...       ...         ...
##   [10496]     chr1 10496-10595      * |     FALSE   1.23578           b
##   [10497]     chr1 10497-10596      * |     FALSE   1.69671           a
##   [10498]     chr1 10498-10597      * |     FALSE   6.11140           a
##   [10499]     chr1 10499-10598      * |     FALSE   2.21657           d
##   [10500]     chr1 10500-10599      * |     FALSE   5.33003           b
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Our simulated dataset has 3 features: logical feature1, numeric feature2, and character/factor feature3. We can use matchRanges() to compare ranges where feature1 is TRUE to ranges where feature1 is FALSE, matched by feature2 and/or feature3:

## MatchedGRanges object with 500 ranges and 3 metadata columns:
##         seqnames      ranges strand |  feature1  feature2    feature3
##            <Rle>   <IRanges>  <Rle> | <logical> <numeric> <character>
##     [1]     chr1   4373-4472      * |     FALSE  8.959578           d
##     [2]     chr1   9740-9839      * |     FALSE  0.959336           e
##     [3]     chr1   7755-7854      * |     FALSE  2.107003           c
##     [4]     chr1   8266-8365      * |     FALSE  6.231860           d
##     [5]     chr1   4298-4397      * |     FALSE  6.955316           c
##     ...      ...         ...    ... .       ...       ...         ...
##   [496]     chr1   2443-2542      * |     FALSE   1.12276           b
##   [497]     chr1   2455-2554      * |     FALSE   3.38518           c
##   [498]     chr1   1285-1384      * |     FALSE   1.58546           c
##   [499]     chr1 10137-10236      * |     FALSE   9.39272           c
##   [500]     chr1   6119-6218      * |     FALSE  10.22412           c
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The resulting MatchedGRanges object is a set of null hypothesis ranges selected from our pool of options that is the same length as our input focal ranges and matched for covar features 2 and 3. These matched ranges print and behave just as normal GRanges would:

## MatchedGRanges object with 500 ranges and 3 metadata columns:
##         seqnames      ranges strand |  feature1  feature2    feature3
##            <Rle>   <IRanges>  <Rle> | <logical> <numeric> <character>
##     [1]     chr1     511-610      * |     FALSE  5.545186           c
##     [2]     chr1     513-612      * |     FALSE  2.221684           b
##     [3]     chr1     534-633      * |     FALSE  1.563458           b
##     [4]     chr1     565-664      * |     FALSE  0.932659           c
##     [5]     chr1     577-676      * |     FALSE  3.256908           c
##     ...      ...         ...    ... .       ...       ...         ...
##   [496]     chr1 10377-10476      * |     FALSE  0.795032           c
##   [497]     chr1 10380-10479      * |     FALSE  0.977984           b
##   [498]     chr1 10409-10508      * |     FALSE  3.662119           c
##   [499]     chr1 10455-10554      * |     FALSE  6.815473           c
##   [500]     chr1 10483-10582      * |     FALSE  3.724147           c
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can change the type argument of makeExampleMatchedDataSet to input data.frames, data.tables, DataFrames, GRanges and GInteractions objects - all of which work as inputs for matchRanges. These produce either MatchedDataFrame, MatchedGRanges, or MatchedGInteractions objects. For more information about the Matched class structure and available methods, see the Class structure section below or the help documentation for each class, ?MatchedDataFrame, ?MatchedGRanges, or ?MatchedGInteractions.

matchRanges() uses propensity scores to select matches using one of three available matching options: “nearest”, “rejection”, or “stratified” with or without replacement. For more information see the section on [Choosing the method parameter] below.

Assessing quality of matching

We can assess the quality of Matched classes with overview(), plotCovariate(), and plotPropensity(). overview() provides a quick assessment of overall matching quality by reporting the mean and standard deviation for covariates and propensity scores of the focal, pool, matched, and unmatched sets. For factor, character, or logical covariates (e.g. categorical covariates) the N per set (frequency) is returned. It also reports the mean difference in focal-matched sets:

## MatchedGRanges object: 
##        set     N feature2.mean feature2.sd feature3.a feature3.b feature3.c
##      focal   500           4.1         1.9         66        157        206
##    matched   500           4.5         2.7         34        160        234
##       pool 10000           6.0         3.4       4248       3121       1117
##  unmatched  9500           6.1         3.5       4214       2961        883
##  feature3.d feature3.e ps.mean ps.sd
##          49         22   0.100 0.076
##          53         19   0.110 0.078
##         992        522   0.045 0.051
##         939        503   0.041 0.047
## --------
## focal - matched: 
##  feature2.mean feature2.sd feature3.a feature3.b feature3.c feature3.d
##          -0.42       -0.84         32         -3        -28         -4
##  feature3.e ps.mean   ps.sd
##           3 -0.0057 -0.0019

Visualizing propensity scores can show how well sets were matched overall:

The distributions of features can be visualized in each set with plotCovariate():