plyranges 1.4.4
Ranges
revisitedIn Bioconductor there are two classes, IRanges
and GRanges
,
that are standard data structures for representing genomics data.
Throughout this document I refer to either of these classes as Ranges
if an
operation can be performed on either class, otherwise I explicitly mention if
a function is appropriate for an IRanges
or GRanges
.
Ranges
objects can either represent sets of integers as IRanges
(which have start, end and width attributes)
or represent genomic intervals (which have additional attributes, sequence name,
and strand) as GRanges
. In addition, both types of Ranges
can store information
about their intervals as metadata columns (for example GC content
over a genomic interval).
Ranges
objects follow the tidy data principle: each row
of a Ranges
object corresponds to an interval, while each column will represent
a variable about that interval, and generally each object will represent a
single unit of observation (like gene annotations).
Consequently, Ranges
objects provide a powerful representation for reasoning
about genomic data. In this vignette, you will learn more about Ranges
objects
and how via grouping, restriction and summarisation you can perform common data
tasks.
Ranges
To construct an IRanges
we require that there are at least two columns that represent at either a
starting coordinate, finishing coordinate or the width of the interval.
suppressPackageStartupMessages(library(plyranges))
set.seed(100)
df <- data.frame(start=c(2:-1, 13:15),
width=c(0:3, 2:0))
# produces IRanges
rng <- df %>% as_iranges()
rng
## IRanges object with 7 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 2 1 0
## [2] 1 1 1
## [3] 0 1 2
## [4] -1 1 3
## [5] 13 14 2
## [6] 14 14 1
## [7] 15 14 0
To construct a GRanges
we require a column that represents that sequence name (
contig or chromosome id), and an optional column to represent the strandedness
of an interval.
# seqname is required for GRanges, metadata is automatically kept
grng <- df %>%
transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
strand = sample(c("+", "-"), 7, replace = TRUE),
gc = runif(7)) %>%
as_granges()
grng
## GRanges object with 7 ranges and 1 metadata column:
## seqnames ranges strand | gc
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr2 2-1 - | 0.76255108229816
## [2] chr1 1 - | 0.669021712383255
## [3] chr2 0-1 + | 0.204612161964178
## [4] chr2 -1-1 - | 0.357524853432551
## [5] chr1 13-14 - | 0.359475114848465
## [6] chr1 14 - | 0.690290528349578
## [7] chr2 15-14 - | 0.535811153938994
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Sometimes you want to modify a genomic interval by altering the width of the
interval while leaving the start, end or midpoint of the coordinates unaltered.
This is achieved with the mutate
verb along with anchor_*
adverbs.
The act of anchoring fixes either the start, end, center coordinates of the
Range
object, as shown in the figure and code below and anchors are used in
combination with either mutate
or stretch
.
rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8)))
grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8),
seqnames = "seq1",
strand = c("+", "*", "-")))
mutate(rng, width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 10 10
## [2] 2 11 10
## [3] 3 12 10
mutate(anchor_start(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 10 10
## [2] 2 11 10
## [3] 3 12 10
mutate(anchor_end(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -4 5 10
## [2] -7 2 10
## [3] -1 8 10
mutate(anchor_center(rng), width = 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -2 7 10
## [2] -3 6 10
## [3] 1 10 10
mutate(anchor_3p(grng), width = 10) # leave negative strand fixed
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 -4-5 +
## [2] seq1 -7-2 *
## [3] seq1 3-12 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mutate(anchor_5p(grng), width = 10) # leave positive strand fixed
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 1-10 +
## [2] seq1 2-11 *
## [3] seq1 -1-8 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Similarly, you can modify the width of an interval using the stretch
verb.
Without anchoring, this function will extend the interval in either direction
by an integer amount. With anchoring, either the start, end or midpoint are
preserved.
rng2 <- stretch(anchor_center(rng), 10)
rng2
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -4 10 15
## [2] -3 7 11
## [3] -2 13 16
stretch(anchor_end(rng2), 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -14 10 25
## [2] -13 7 21
## [3] -12 13 26
stretch(anchor_start(rng2), 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -4 20 25
## [2] -3 17 21
## [3] -2 23 26
stretch(anchor_3p(grng), 10)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 -9-5 +
## [2] seq1 -8-2 *
## [3] seq1 3-18 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
stretch(anchor_5p(grng), 10)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 1-15 +
## [2] seq1 2-12 *
## [3] seq1 -7-8 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Ranges
can be shifted left or right. If strand information is available
we can also shift upstream or downstream.
shift_left(rng, 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] -9 -5 5
## [2] -8 -8 1
## [3] -7 -2 6
shift_right(rng, 10)
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 11 15 5
## [2] 12 12 1
## [3] 13 18 6
shift_upstream(grng, 10)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 -9--5 +
## [2] seq1 2 *
## [3] seq1 13-18 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift_downstream(grng, 10)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] seq1 11-15 +
## [2] seq1 2 *
## [3] seq1 -7--2 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Ranges
plyranges
introduces a new class of Ranges
called RangesGrouped
,
this is a similar idea to the grouped data.frame\tibble
in dplyr
.
Grouping can act on either the core components or the metadata columns of
a Ranges
object.
It is most effective when combined with other verbs such as mutate()
,
summarise()
, filter()
, reduce_ranges()
or disjoin_ranges()
.
grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
strand = sample(c("+", "-"), 7, replace = TRUE),
gc = runif(7),
start = 1:7,
width = 10) %>%
as_granges()
grng_by_strand <- grng %>%
group_by(strand)
## Warning: `as_quosure()` requires an explicit environment as of rlang 0.3.0.
## Please supply `env`.
## This warning is displayed once per session.
grng_by_strand
## GRanges object with 7 ranges and 1 metadata column:
## Groups: strand [2]
## seqnames ranges strand | gc
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr2 1-10 - | 0.889453538926318
## [2] chr2 2-11 + | 0.180407245177776
## [3] chr1 3-12 - | 0.629390850430354
## [4] chr2 4-13 + | 0.989564136601985
## [5] chr1 5-14 - | 0.130288870073855
## [6] chr1 6-15 - | 0.330660525709391
## [7] chr2 7-16 - | 0.865120546659455
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Ranges
The verb filter
can be used to restrict rows in the Ranges
.
Note that grouping will cause the filter
to act within each group
of the data.
grng %>% filter(gc < 0.3)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gc
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr2 2-11 + | 0.180407245177776
## [2] chr1 5-14 - | 0.130288870073855
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
# filtering by group
grng_by_strand %>% filter(gc == max(gc))
## GRanges object with 2 ranges and 1 metadata column:
## Groups: strand [2]
## seqnames ranges strand | gc
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr2 1-10 - | 0.889453538926318
## [2] chr2 4-13 + | 0.989564136601985
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
We also provide the convenience methods filter_by_overlaps
and
filter_by_non_overlaps
for restricting by any overlapping Ranges
.
ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>%
as_iranges()
ir1 <- data.frame(start = 2:6, width = 3:7) %>%
as_iranges()
ir0
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 5 9 5
## [2] 10 14 5
## [3] 15 19 5
## [4] 20 24 5
ir1
## IRanges object with 5 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 2 4 3
## [2] 3 6 4
## [3] 4 8 5
## [4] 5 10 6
## [5] 6 12 7
ir0 %>% filter_by_overlaps(ir1)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 5 9 5
## [2] 10 14 5
ir0 %>% filter_by_non_overlaps(ir1)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 15 19 5
## [2] 20 24 5
Ranges
The summarise
function will return a DataFrame
because the information required
to return a Ranges
object is lost. It is often most useful to use summarise()
in combination with the group_by()
family of functions.
ir1 <- ir1 %>%
mutate(gc = runif(length(.)))
ir0 %>%
group_by_overlaps(ir1) %>%
summarise(gc = mean(gc))
## DataFrame with 2 rows and 2 columns
## query gc
## <integer> <numeric>
## 1 1 0.675554534769617
## 2 2 0.635795166483149
Ranges
A join acts on two GRanges objects, a query and a subject.
query <- data.frame(seqnames = "chr1",
strand = c("+", "-"),
start = c(1, 9),
end = c(7, 10),
key.a = letters[1:2]) %>%
as_granges()
subject <- data.frame(seqnames = "chr1",
strand = c("-", "+"),
start = c(2, 6),
end = c(4, 8),
key.b = LETTERS[1:2]) %>%
as_granges()
The join operator is relational in the sense that metadata
from the query and subject ranges is retained in the joined range. All join operators in the plyranges
DSL generate a set of hits based on overlap or proximity of ranges and use those hits to merge the two
datasets in different ways.
There are four supported matching algorithms: overlap,
nearest, precede, and follow. We can further restrict the
matching by whether the query is completely within the subject, and
adding the directed suffix ensures that matching ranges have the
same direction (strand).