Contents

1 Abstract

The coupling of chromosome conformation capture (3C)-based and next-generation sequencing (NGS) enable high-throughput detection of long-range genomic interactions via the generation of novel ligation products between DNA sequences that are closely juxtaposed in vivo. These interactions may involve promoter regions, enhancers and other regulatory and structural elements of chromosomes, and can reveal key details in the regulation of gene expression. 3C-seq is a variant of the method for the detection of interactions between one chosen genomic element (viewpoint) and the rest of the genome. We present a R/Bioconductor package designed to perform 3C-seq data analysis in a number of different experimental designs, with or without a control experiment. The package can also be used to perform data analysis for the experiment with replicates. The package provides functions to perform 3C-seq data normalization, statistical analysis for cis/trans interactions and visualization to facilitate the identification of genomic regions that physically interact with the given viewpoints of interest. The r3Cseq package greatly facilitates hypothesis generation and the interpretation of experimental results.

2 Introduction

This vignette describes how to use the r3Cseq package. r3Cseq (S. Thongjuea et al. 2013) is a Bioconductor-compliant R package designed to facilitate the identification of interaction regions generated by chromosome conformation capture and next-generation sequencing (3C-seq). The fundamental principles of 3C-seq briefly described in the following (E. Soler et al. 2010) optional caption text, isolated cells are treated with a cross-linking agent to preserve in vivo nuclear proximity between DNA sequences. The DNA isolated from these cells is then digested using a primary restriction enzyme, typically a 6-base pairs cutting enzyme such as HindIII, EcoRI or BamHI. The digested product is then ligated under dilute conditions to favor intra-molecular over inter-molecular ligation events. This digested and ligated chromatin yields composite sequences representing (distal) genomic regions that are in close physical proximity in the cell nucleus. The digested and ligated chromatin is then de-crosslinked and subjected to a second restriction digest using either Nla III or Dpn II (a 4-cutter) as a secondary restriction enzyme to decrease the fragment sizes. The digested DNA is then ligated again under diluted conditions, creating small circular fragments. These fragments are inverse PCR-amplified using primers specific for a genomic region of interest (eg. promoter, enhancer, or any other element potentially involved in long-range interactions), termed the “viewpoint”. The amplified fragments are then sequenced using massively parallel high-throughput sequencing. Because the 3C-seq procedure hybrid DNA molecules being a combination of viewpoint-specific primers followed by sequences dervied from the ligated interaction fragments. As such, these composite sequences are unmappable and need to be trimmed to removed the viewpoint sequences, thus leaving only the capture sequence fragments for mapping. After trimming, reads are mapped against a reference genome using alignment software such as Bowtie. A mapped read file generated by the mapping software is then transformed to the BAM file and analyzed using the r3Cseq package.

3 Preparation input files

The required input is the BAM file, obtained as an output from the mapping software. The BAM file name must not contain any special symbols such as ’*‘,’%‘,’$‘,’#‘,’!‘,’@’, and “-”. The “-” and “.” must be replaced by using “_“. The file name must be short and simple. For example, the original file name”%RIKO-R1-T5_S15_L001.trimmed_experiment.sorted.bam" must be changed to “RIKO_R1_T5_S15_L001.bam”.

The represented identifier for a reference genome shown in each input BAM file is important to properly run r3Cseq. The represented identifier for each chromosome must be in “chr[1..19XYM]” format for the mouse reference genome and “chr[1..22XYM]” format for the human reference genome. Therefore, before using the package, a user has to check the identifier for the reference genome. If the identifier for each chromosome found in the mapped file is not in a proper format for example ‘mm9_ref_chr01.fa’, the Unix command like ‘sed’ might be used to replace ‘mm9_ref_chr01.fa’ to ‘chr1’.

4 Getting started

3C-seq data generated by (Stadhouders et al. 2012) will be used for the demonstration. The current version of r3Cseq supports mouse, human, and rat genomes. Therefore, the package requires one of the followings BSgenome packages to be installed; BSgenome.Mmusculus.UCSC.mm9.masked, BSgenome.Mmusculus.UCSC.mm10.masked, BSgenome.Hsapiens.UCSC.hg18.masked,BSgenome.Hsapiens.UCSC.hg19.masked, and BSgenome.Rnorvegicus.UCSC.rn5.masked.

Loading the r3Cseq package into R.

library(r3Cseq)

There are 2 example data sets found in the package.

data(Myb_prom_FL)
data(Myb_prom_FB)

Myb_prom_FL, the 3C-seq data contains the aligned reads of the Myb promoter interactions signal in fetal liver. It was stored in the ‘GRanges’ object processed by the ‘Rsamtools’ package.

Myb_prom_FB, the 3C-seq data contains the aligned reads of Myb promoter interactions signal in fetal brain.

We will next perform r3Cseq to discover interaction regions, which possibly interact with the promoter region of Myb gene in both fetal liver and brain (Stadhouders et al. 2012).

5 r3Cseq object initialization

In this section, we will analyze 3C-seq data, which were derived from fetal liver (high levels of the Myb gene expression) and fetal brain (expressing low level of the Myb gene). The latter will be used as a negative control. More examples of r3Cseq data analysis can be found on the r3Cseq website http://r3cseq.genereg.net. We firstly initialize the r3Cseq object.

my3Cseq.obj<-new("r3Cseq",organismName='mm9',isControlInvolved=TRUE,
viewpoint_chromosome='chr10',viewpoint_primer_forward='TCTTTGTTTGATGGCATCTGTT',
viewpoint_primer_reverse='AAAGGGGAGGAGAAGGAGGT',expLabel="Myb_prom_FL",
contrLabel="MYb_prom_FB",restrictionEnzyme='HindIII')

The description of input parameters is described in the r3Cseq help page. We next add raw reads from “Myb_prom_FL” and “Myb_prom_FB” to the existing my3Cseq.obj.

expRawData(my3Cseq.obj)<-exp.GRanges
contrRawData(my3Cseq.obj)<-contr.GRanges
my3Cseq.obj
## An object of class "r3Cseq"
## Slot "alignedReadsBamExpFile":
## [1] ""
## 
## Slot "alignedReadsBamContrFile":
## [1] ""
## 
## Slot "expLabel":
## [1] "Myb_prom_FL"
## 
## Slot "contrLabel":
## [1] "MYb_prom_FB"
## 
## Slot "expLibrarySize":
## integer(0)
## 
## Slot "contrLibrarySize":
## integer(0)
## 
## Slot "expReadLength":
## integer(0)
## 
## Slot "contrReadLength":
## integer(0)
## 
## Slot "expRawData":
## GRanges object with 2478476 ranges and 0 metadata columns:
##             seqnames          ranges strand
##                <Rle>       <IRanges>  <Rle>
##         [1]     chr1 3005887-3005930      +
##         [2]     chr1 3005887-3005930      +
##         [3]     chr1 3005887-3005930      +
##         [4]     chr1 3005887-3005930      +
##         [5]     chr1 3005887-3005930      +
##         ...      ...             ...    ...
##   [2478472]     chrM     12341-12384      +
##   [2478473]     chrM     12341-12384      +
##   [2478474]     chrM     12341-12384      +
##   [2478475]     chrM     12341-12384      +
##   [2478476]     chrM     12341-12384      +
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## Slot "contrRawData":
## GRanges object with 2838018 ranges and 0 metadata columns:
##             seqnames          ranges strand
##                <Rle>       <IRanges>  <Rle>
##         [1]     chr1 3046711-3046754      -
##         [2]     chr1 3046711-3046754      -
##         [3]     chr1 3046711-3046754      -
##         [4]     chr1 3046711-3046754      -
##         [5]     chr1 3402788-3402831      +
##         ...      ...             ...    ...
##   [2838014]     chrM     11987-12030      +
##   [2838015]     chrM     11987-12030      +
##   [2838016]     chrM     11987-12030      +
##   [2838017]     chrM     11987-12030      +
##   [2838018]     chrM     11990-12033      +
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## Slot "organismName":
## [1] "mm9"
## 
## Slot "restrictionEnzyme":
## [1] "HindIII"
## 
## Slot "viewpoint_chromosome":
## [1] "chr10"
## 
## Slot "viewpoint_primer_forward":
## [1] "TCTTTGTTTGATGGCATCTGTT"
## 
## Slot "viewpoint_primer_reverse":
## [1] "AAAGGGGAGGAGAAGGAGGT"
## 
## Slot "expReadCount":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "contrReadCount":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "expRPM":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "contrRPM":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "expInteractionRegions":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "contrInteractionRegions":
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## Slot "isControlInvolved":
## [1] TRUE

6 Getting reads per restriction fragments/user defined window size

To get number of reads per restriction fragment, function getReadCountPerRestrictionFragment will be performed.

getReadCountPerRestrictionFragment(my3Cseq.obj)
## [1] "Fragmenting genome by....HindIII...."
## [1] "Count processing in the experiment is done."
## [1] "Count processing in the control is done."

The package also provides the function getReadCountPerWindow to count number of reads per non-overlapping window size defined by a user.

7 3C-seq data normalization

We next perform data normalization.

calculateRPM(my3Cseq.obj)   
## [1] "Normal RPM calculation is done."

8 Getting interaction regions

After normalization, the getInteractions function will be performed.

getInteractions(my3Cseq.obj,fdr=0.05)   
## [1] "Calculation is done. Use function 'expInteractionRegions' or 'contrInteractionRegions' to get the result."

To see the result of interaction regions, Two functions expInteractionRegions and contrInteractionRegions can be used to access the slot of r3Cseq object. To get the result of interaction regions for the experiment, expInteractionRegions will be performed.

fetal.liver.interactions<-expInteractionRegions(my3Cseq.obj)
fetal.liver.interactions
## GRanges object with 19502 ranges and 5 metadata columns:
##           seqnames          ranges strand |    nReads               RPMs
##              <Rle>       <IRanges>  <Rle> | <integer>          <numeric>
##       [1]    chr10 3009255-3014345      * |         3 0.0605737657161655
##       [2]    chr10 3067271-3068193      * |       122   11.8504107319424
##       [3]    chr10 3127452-3132355      * |       101   9.05557622062603
##       [4]    chr10 3132360-3134033      * |        70   5.37270060859088
##       [5]    chr10 3141247-3145770      * |       113   10.6253420244925
##       ...      ...             ...    ... .       ...                ...
##   [19498]     chrY 1229367-1231265      * |         2 0.0340049639221147
##   [19499]     chrY 1279671-1287427      * |         2 0.0340049639221147
##   [19500]     chrY 1293808-1295589      * |       137   13.9779584889841
##   [19501]     chrY 1686797-1687184      * |         2 0.0340049639221147
##   [19502]     chrY 2641919-2642337      * |         1 0.0126734717809562
##                             z           p.value   q.value
##                     <numeric>         <numeric> <numeric>
##       [1]  -0.173110045707998                 1         1
##       [2]   0.106335303551167 0.915316322182506         1
##       [3]  0.0407624174864911 0.967485300942551         1
##       [4] -0.0368364336893386                 1         1
##       [5]  0.0673961352730922 0.946266345800746         1
##       ...                 ...               ...       ...
##   [19498]  -0.480154868476184                 1         1
##   [19499]  -0.480154868476184                 1         1
##   [19500]   0.247537676244485 0.804492137104243         1
##   [19501]  -0.480154868476184                 1         1
##   [19502]  -0.485545183622263                 1         1
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

To get the result of interaction regions for the control, contrInteractionRegions will be performed.

fetal.brain.interactions<-contrInteractionRegions(my3Cseq.obj)
fetal.brain.interactions
## GRanges object with 9684 ranges and 5 metadata columns:
##          seqnames          ranges strand |    nReads              RPMs
##             <Rle>       <IRanges>  <Rle> | <integer>         <numeric>
##      [1]    chr10 3019822-3022796      * |       425  44.8518378170537
##      [2]    chr10 3063618-3067025      * |       525  59.1090414506868
##      [3]    chr10 3087994-3096004      * |       309  29.5772330684001
##      [4]    chr10 3138895-3141226      * |       750  94.1872849616863
##      [5]    chr10 3148290-3157485      * |         4 0.101138955526621
##      ...      ...             ...    ... .       ...               ...
##   [9680]     chrY 1686797-1687184      * |         1  0.01653813879084
##   [9681]     chrY 1690876-1691262      * |         1  0.01653813879084
##   [9682]     chrY 2371312-2391988      * |         1  0.01653813879084
##   [9683]     chrY 2884699-2887174      * |         1  0.01653813879084
##   [9684]     chrY 2890136-2899310      * |         1  0.01653813879084
##                           z           p.value   q.value
##                   <numeric>         <numeric> <numeric>
##      [1]   0.22836348764507  0.81936367216419         1
##      [2]  0.381850355068183 0.702572365787877         1
##      [3] 0.0709858197887684 0.943409041188086         1
##      [4]   0.72291877311053 0.469729789135378         1
##      [5] -0.363637905503913                 1         1
##      ...                ...               ...       ...
##   [9680] -0.474103528881472                 1         1
##   [9681] -0.474103528881472                 1         1
##   [9682] -0.474103528881472                 1         1
##   [9683] -0.474103528881472                 1         1
##   [9684] -0.474103528881472                 1         1
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

9 Getting the viewpoint information

To see the viewpoint information, getViewpoint function can be used. getViewpoint will return the GRanges object of the viewpoint information.

viewpoint<-getViewpoint(my3Cseq.obj)
viewpoint
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr10 20880662-20880775      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

10 Visualization of 3C-seq data

r3Cseq package provides visualization functions. These functions are plotOverviewInteractions, plotInteractionsNearViewpoint, plotInteractionsPerChromosome, and PlotDomainogramNearViewpoint.

plotOverviewInteractions function shows the overview of interaction regions distributed across genome.

plotOverviewInteractions(my3Cseq.obj)   

plotInteractionsNearViewpoint function shows the zoom in of interaction regions located close to the viewpoint.

plotInteractionsNearViewpoint(my3Cseq.obj)

plotInteractionsPerChromosome function shows the interaction regions found in the chromosome10.

plotInteractionsPerChromosome(my3Cseq.obj,"chr10")

plotDomainogramNearViewpoint function shows the domainogram of interactions found in cis. This function may takes minutes to produce the domainogram plot.

plotDomainogramNearViewpoint(my3Cseq.obj,view="both")
## [1] "Analyzing interaction regions for each window...."
## [1] "Analyzing interaction regions for each window...."

11 Associate interaction signals to the Refseq genes

getExpInteractionsInRefseq and getContrInteractionsInRefseq functions can be used to detect the list of genes that contain significant interaction signals in their proximity.

detected_genes<-getExpInteractionsInRefseq(my3Cseq.obj)
head(detected_genes)
##     chromosome gene_name total_nReads total_RPMs
## 170      chr10       Myb        43217  20539.955
## 142      chr10     Hbs1l        34310  16031.869
## 28       chr10      Ahi1        29738  11912.850
## 134       chr4    Gm5506        15792   3939.331
## 32       chr10   Aldh8a1         9422   3633.383
## 136       chr4    Gm5801        13482   3350.087

12 Export interactions to the bedGraph format

export3Cseq2bedGraph function exports all interactions from the GRanges object to the bedGraph format, which can be uploaded to the UCSC genome browser.

export3Cseq2bedGraph(my3Cseq.obj)   
## [1] "File Myb_prom_FL.RPMs.bedGraph.gz ' is created."
## [1] "File MYb_prom_FB.RPMs.bedGraph.gz ' is created."

13 Summary report

generate3CseqReport function generates the summary report from r3Cseq analysis results. The report contains a pdf file for all plots and text files of interaction regions. This function may takes minutes to produce the report.

generate3CseqReport(my3Cseq.obj)    
## [1] "Analyzing interaction regions for each window...."
## [1] "Analyzing interaction regions for each window...."
## [1] "File Myb_prom_FL.interaction.txt ' is created."
## [1] "File MYb_prom_FB.interaction.txt ' is created."
## [1] "File Myb_prom_FL.RPMs.bedGraph.gz ' is created."
## [1] "File MYb_prom_FB.RPMs.bedGraph.gz ' is created."
## [1] "Three files are generated : a pdf file of plots, a text file of interaction regions, and a bedGraph file."

14 Working with replicates

The example of how to work with replicats is shown on http://r3cseq.genereg.net/ website.

http://r3cseq.genereg.net/ provides more details of r3Cseq analysis pipeline. The example data sets can be downloaded from the website.

15 Session info

Here is the output of sessionInfo on the system on which this document was compiled:

## R Under development (unstable) (2018-11-30 r75722)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm9.masked_1.3.99
##  [2] RSQLite_2.1.1                            
##  [3] BSgenome.Mmusculus.UCSC.mm9_1.4.0        
##  [4] BSgenome_1.51.0                          
##  [5] r3Cseq_1.29.0                            
##  [6] qvalue_2.15.0                            
##  [7] VGAM_1.0-6                               
##  [8] rtracklayer_1.43.1                       
##  [9] Rsamtools_1.35.0                         
## [10] Biostrings_2.51.1                        
## [11] XVector_0.23.0                           
## [12] GenomicRanges_1.35.1                     
## [13] GenomeInfoDb_1.19.1                      
## [14] IRanges_2.17.1                           
## [15] S4Vectors_0.21.6                         
## [16] BiocGenerics_0.29.1                      
## [17] BiocStyle_2.11.0                         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                matrixStats_0.54.0         
##  [3] bit64_0.9-7                 RColorBrewer_1.1-2         
##  [5] tools_3.6.0                 R6_2.3.0                   
##  [7] DBI_1.0.0                   lazyeval_0.2.1             
##  [9] colorspace_1.3-2            tidyselect_0.2.5           
## [11] bit_1.1-14                  compiler_3.6.0             
## [13] chron_2.3-53                Biobase_2.43.0             
## [15] DelayedArray_0.9.0          bookdown_0.8               
## [17] scales_1.0.0                stringr_1.3.1              
## [19] digest_0.6.18               rmarkdown_1.11             
## [21] pkgconfig_2.0.2             htmltools_0.3.6            
## [23] rlang_0.3.0.1               bindr_0.1.1                
## [25] BiocParallel_1.17.3         dplyr_0.7.8                
## [27] RCurl_1.95-4.11             magrittr_1.5               
## [29] GenomeInfoDbData_1.2.0      Matrix_1.2-15              
## [31] Rcpp_1.0.0                  munsell_0.5.0              
## [33] proto_1.0.0                 sqldf_0.4-11               
## [35] stringi_1.2.4               yaml_2.2.0                 
## [37] SummarizedExperiment_1.13.0 zlibbioc_1.29.0            
## [39] plyr_1.8.4                  grid_3.6.0                 
## [41] blob_1.1.1                  crayon_1.3.4               
## [43] lattice_0.20-38             knitr_1.20                 
## [45] pillar_1.3.0                tcltk_3.6.0                
## [47] reshape2_1.4.3              XML_3.98-1.16              
## [49] glue_1.3.0                  evaluate_0.12              
## [51] data.table_1.11.8           BiocManager_1.30.4         
## [53] gtable_0.2.0                purrr_0.2.5                
## [55] assertthat_0.2.0            gsubfn_0.7                 
## [57] ggplot2_3.1.0               xfun_0.4                   
## [59] tibble_1.4.2                GenomicAlignments_1.19.0   
## [61] memoise_1.1.0               bindrcpp_0.2.2

References

Soler, Eric, Charlotte Andrieu-Soler, Ernie de Boer, Jan Christian Bryne, Supat Thongjuea, Ralph Stadhouders, Robert-Jan Palstra, et al. 2010. “The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation.” Genes & Development 24 (3): 277–89.

Stadhouders, R, S Thongjuea, C Andrieu-Soler, RJ Palstra, JC Bryne, E De Boer, C Kockx, et al. 2012. “Dynamic long-range chromatin interactions control Myb proto-oncogene transcription during erythriod development.” EMBO 31: 986–99.

Thongjuea, S, R Stadhouders, F Grosveld, E Soler, and Boris Lenhard. 2013. “r3Cseq-an R package for the discovery of long-range genomic interactions with chromosome conformation capture and next-generation sequencing data.” Nucleic Acids Research 41.