1 Basics

1.1 Install brainflowprobes

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. brainflowprobes is an R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install brainflowprobes by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("brainflowprobes")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

## If you want to force the installation of the development version, you can
## do so by running. However, we suggest that you wait for Bioconductor to
## run checks and build the latest release.
BiocManager::install("LieberInstitute/brainflowprobes")

1.2 Required knowledge

brainflowprobes is based on many other packages, particularly GenomicRanges, derfinder, and derfinderPlot. A brainflowprobes user is not expected to deal with those packages directly, but may find their manuals useful.

If you are asking yourself the question “How do I start using Bioconductor?” you might be interested in this blog post.

1.3 Asking for help

The blog post quoted above mentions some options, but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the brainflowprobes tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

1.4 Citing brainflowprobes

We hope that brainflowprobes will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation('brainflowprobes')
## 
## Price AJ (2019). _Plots and annotation for choosing BrainFlow target
## probe sequence_. doi: 10.18129/B9.bioc.brainflowprobes (URL:
## https://doi.org/10.18129/B9.bioc.brainflowprobes),
## https://github.com/LieberInstitute/brainflowprobes - R package
## version 1.0.0, <URL:
## http://www.bioconductor.org/packages/brainflowprobes>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {Plots and annotation for choosing BrainFlow target probe sequence},
##     author = {Amanda J Price},
##     year = {2019},
##     url = {http://www.bioconductor.org/packages/brainflowprobes},
##     note = {https://github.com/LieberInstitute/brainflowprobes - R package version 1.0.0},
##     doi = {10.18129/B9.bioc.brainflowprobes},
##   }

2 Introduction

brainflowprobes is an R package that contains four functions to aid in designing probes to target RNA sequences in nuclei isolated from human postmortem brain using flow cytometry. brainflowprobes was made to support the method described in the BrainFlow publication, which is based on the Invitrogen PrimeFlow&#174 RNA kit.

2.1 Notes before starting

This package is currently only compatible with hg19 sequences. Also, because of the type of data used, the plotting functions plot_coverage() and four_panels() do not work on Windows machines. To visualize the data, please use R installed on either Mac or Linux operating systems. The region_info() function can still be used on Windows machines for creating an annotated .csv file to get the information required for custom probe synthesis.

3 Choosing candidate RNA sequences

Which genes you choose to target will ultimately depend on the purpose of your experiment. BrainFlow can be used to isolate specific cell populations for downstream sequencing, for instance, or to assess the coexpression of up to four transcripts at a time at single-nucleus resolution. For the highest probability of success, however, several parameters should be considered when choosing a sequence to target, no matter the purpose. Target sequences are more likely to be successful if they:

  • Are at least 1kb of contiguously expressed sequence (mandatory)
  • Are as highly expressed a target as possible (within the cell population of interest)
  • Are as highly expressed in nuclear RNA as possible
  • Are selectively expressed in the cell population of interest
  • Are robust to degradation due to postmortem factors
  • Avoid splice junctions if it can be helped (given that we’re profiling nuclear RNA)

More details about probe design and each of these considerations can be found in the BrainFlow manuscript. One strategy for choosing a sequence could be to choose the 3’UTR of a transcript of interest. Another strategy (and how many of the probes already validated in the BrainFlow manuscript were designed) is to identify expressed regions using the derfinder package. However you choose which sequences to test, the brainflowprobes package will help narrow down the best sequences for which to synthesize target probes.

4 Annotating a candidate sequence

Let’s say you want to design a probe to target deep layer pyramidal neurons in the prefrontal cortex. You choose TBR1 because of its role in neuronal identity specification in these cells, and you want to see if the last exon of this gene, a ~2.5 Kb sequence, would make for a good probe target. You find using the UCSC Genome browser (for instance) that the hg19 coordinates for this exon are chr2:162279880-162282378:+, where chr2 is the chromosome, 162279880-162282378 is the start and end of the exon, and + means that it is on the plus strand.

Before any of the brainflowprobes functions can be used, the package must be loaded and attached:

## Load brainflowprobes R package
library('brainflowprobes')

The next step is to use the region_info() function to annotate the sequence in a .csv file that can be used to custom synthesize a target probe:

region_info("chr2:162279880-162282378:+", CSV = FALSE, SEQ = TRUE, OUTDIR = ".")
## snapshotDate(): 2019-10-29
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## Completed! If CSV=TRUE, check for region_info.csv in the temporary
## directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
##   seqnames     start       end width strand name          annotation
## 1     chr2 162279880 162282378  2499      + TBR1 NM_006593 NP_006584
##   description region distance   subregion insideDistance exonnumber nexons
## 1 inside exon inside     7072 inside exon              0          3      3
##              UTR geneL codingL          Geneid
## 1 overlaps 3'UTR  9573    7816 ENSG00000136535
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Sequence
## 1 GATCTACACCGGCTGTGACATGGACCGCCTGACCCCCTCGCCCAACGACTCGCCGCGCTCGCAGATCGTGCCCGGGGCCCGCTACGCCATGGCCGGCTCTTTCCTGCAGGACCAGTTCGTGAGCAACTACGCCAAGGCCCGCTTCCACCCGGGCGCGGGCGCGGGCCCCGGGCCGGGTACGGACCGCAGCGTGCCGCACACCAACGGGCTGCTGTCGCCGCAGCAGGCCGAGGACCCGGGCGCGCCCTCGCCGCAACGCTGGTTTGTGACGCCGGCCAACAACCGGCTGGACTTCGCGGCCTCGGCCTATGACACGGCCACGGACTTCGCGGGCAACGCGGCCACGCTGCTCTCTTACGCGGCGGCGGGCGTGAAGGCGCTGCCGCTGCAGGCTGCAGGCTGCACTGGCCGCCCGCTCGGCTACTACGCCGACCCGTCGGGCTGGGGCGCCCGCAGTCCCCCGCAGTACTGCGGCACCAAGTCGGGCTCGGTGCTGCCCTGCTGGCCCAACAGCGCCGCGGCCGCCGCGCGCATGGCCGGCGCCAATCCCTACCTGGGCGAGGAGGCCGAGGGCCTGGCCGCCGAGCGCTCGCCGCTGCCGCCCGGCGCCGCCGAGGACGCCAAGCCCAAGGACCTGTCCGATTCCAGCTGGATCGAGACGCCCTCCTCGATCAAGTCCATCGACTCCAGCGACTCGGGGATTTACGAGCAGGCCAAGCGGAGGCGGATCTCGCCGGCCGACACGCCCGTGTCCGAGAGTTCGTCCCCGCTCAAGAGCGAGGTGCTGGCCCAGCGGGACTGCGAGAAGAACTGCGCCAAGGACATTAGCGGCTACTATGGCTTCTACTCGCACAGCTAGGCCGCCCCTGCCCGCCCGGCCCCGCCGCGGCCCGGACCCCCAGCCAGCCCCTCACAGCTCTTCCCCAGCTCCGCCTCCCCACACTCCTCCTTGCGCACCCACTCATTTTATTTGACCCTCGATGGCCGTCTGCAGCGAATAAGTGCAGGTCTCCGAGCGTGATTTTAACCTTTTTTGCACAGCAGTCTCTGCAATTAGCTCACCGACCTTCAACTTTGCTGTAAACCTTTTGGTTTTCCTACTTACTCTTCTTCTGTGGAGTTATCCTCCTACAATTCCCCTCCCCCTCGTCTTTCTCTTACCTCCTACTTCTCTTTCTTGTAATGAAACTCTTCACCTTTAGGAGACCTGGGCAGTCCTGTCAGGCAGCAGCGATTCCGACCCGCCAAGTCTCGGCCTCCACATTAACCATAGGATGTTGACTCTAGAACCTGGACCCACCCAGCGCGTCCTTTCTTATCCCCGAGTGGATGGATGGATGGATGGATGGTAGGGATGTTAATAATTTTAGTGGAACAAAGCCTGTGAAATGATTGTACATAGTGTTAATTTATTGTAACGAATGGCTAGTTTTTATTCTCGTCAAGGCACAAAACCAGTTCATGCTTAACCTTTTTTTCCTTTCCTTTCTTTGCTTTTCTTTCTCTCCTCTCATACTTTCTCTTCTCTCTCTTTTAATTTTCTTGTGAGATAATATTCTAAGAGGCTCTAGAAACATGAAATACTCAGTAGTGATGGGTTTCCCACTTCTCCTCAATCCGTTGCATGAAATAATTACTATGTGCCCTAATGCACACAAATAGCTAAGGAGAATCCACCCAAACACCTTTAAAGGATAGGTGTCTGTTCATAGGCAAGTCGATTAAGTGGCATGATGCCTGCAAAGCAAAGTCAACTGGAGTTGTATGTTCCCCCCACCTTCTAAATAGAATAGCTCGACATCAGCAATATTATTTTGCCTTATTTGTTTTTCCCCAAAGTGCCAAATCCATTACTGGTCTGTGCAGGTGCCAAATATGCTGACAAACTGTTTCTGAATATCTTTCAGTACCCCTTCACCTTTATATGCTGTAAATCTTTGTAATGAATACTCTATTAATGATATAGATGACTGAATTGTTGGTAACTATAGTGTAGTCTAGTGAAGATGAATTGTGTGAGTTGTATATTTTACTGCATTTTAGTTTTGAAAATGACTTCCCCACCACCTAGAAACAGCTGAAATTTGACTTCCTTGGGAGAACACTAGCATTAATGCAAGTAAGACTGATTTTCCCCTAAGTCTTGTTATATTTGATAAGGAGCATTAATCCCCCTGGAAATAGATTAGTAGGATTTCTAATGTTGTGTAGCAAACCTATACTTTTTTGTATTTAAAAATTAATGTGAAATATGCATCATACACAATATTCAATCTAGATTCCAGTCCATGGGGGGATTTTTCCTAATAGGAATTCAGGGTCTAAACGTGTGTATATTTTGGCTCTTCTGTAAATCTAATGTTGTGATTTTTATATTTGTTTCGTTTTGTCTGTGAACTGAATAATTTATACAAGAACACACTCCATTGAGAAACGTTTTGTTTTTTGCTCGTTTGTATCGTCTGTGTATAACAAGTAAAATAAACCTGGTAAAAACGC

This function will annotate the specified genomic regions (sequences) by calling the matchGenes() function from the bumphunter package. The output is a table where the first six columns include the coordinates (chromosome, start, end, and strand), region width, and the name of the nearest gene. The proceeding columns list further information about the nearest gene and where the region is in relation to that gene. These columns are described in the documentation for matchGenes(), reprinted below:

Column Description
annotation RefSeq ID
description a factor with levels c(“upstream”, “promoter”, “overlaps 5’”, “inside intron”, “inside exon”, “covers exon(s)”, “overlaps exon upstream”, “overlaps exon downstream”, “overlaps two exons”, “overlaps 3’”, “close to 3’”, “downstream”, “covers”)
region a factor with levels c(“upstream”, “promoter”, “overlaps 5’”, “inside”, “overlaps 3’”, “close to 3’”, “downstream”, “covers”)
distance distance before 5’ end of gene
subregion a factor with levels c(“inside intron”, “inside exon”, “covers exon(s)”, “overlaps exon upstream”, “overlaps exon downstream”, “overlaps two exons”)
insideDistance distance past 5’ end of gene
exonnumber which exon
nexons number of exons
UTR a factor with levels c(“inside transcription region”, “5’ UTR”, “overlaps 5’ UTR”, “3’UTR”, “overlaps 3’UTR”, “covers transcription region”)
geneL the gene length
codingL the coding length
Entrez Entrez ID of closest gene

If CSV = TRUE, a file called region_info.csv will be printed in your working directory, unless another location and file name are listed in the OUTDIR option. It is important that chr preceeds the chromosome number, that the sequence is hg19, and that the candidate coordinates are encased in quotation marks (this tells R that it should read the coordinates as a character class). The output of this function is what can be sent to Invitrogen to specify the sequences you would like to be targeted.

5 Plotting expression coverage across a candidate region

brainflowprobes includes two functions that visualize expression information about candidate target regions based on four external datasets. Coverage data for these external datasets are stored online as BigWig tracks that unfortunately cannot be called using R on a Windows operating system. For users with Mac or Linux machines, expression of candidate regions in these datasets can be loaded and visualized using plot_coverage() and four_panels().

plot_coverage() uses the getRegionCoverage() function from the derfinder package to cut the coverage values for region(s) of interest (specified in the REGION option) in a set of nuclear (N) and cytoplasmic (C) RNA-seq samples derived from adult (A) and fetal (F) human cortex. In this dataset, two different RNA-seq libraries based on either polyA-enrichment (P) or rRNA depletion (R) were generated and sequenced for each fraction-age pair, resulting in eight groups of samples. Optimal candidate target probe sequences will be highly and uniformly expressed across the region.

plot_coverage("chr2:162279880-162282378:+", 
    PDF = "regionCoverage_fractionedData.pdf",
    OUTDIR = '.',
    COVERAGE = NULL, VERBOSE = FALSE)