cellCounts {Rsubread} | R Documentation |
Process raw 10X scRNA-seq data and generate UMI counts for each gene in each cell.
cellCounts( # input data index, sample, input.mode = "BCL", cell.barcode = NULL, # specify the aligner used for read mapping aligner = "align", # parameters used by featureCounts for assigning and counting UMIs annot.inbuilt = "mm10", annot.ext = NULL, isGTFAnnotationFile = FALSE, GTF.featureType = "exon", GTF.attrType = "gene_id", useMetaFeatures = TRUE, # user provided UMI cutoff for cell calling umi.cutoff = NULL, # number of threads nthreads = 10, # dealing with multi-mapping reads in the alignment step nBestLocations = 1, unique.mapping = FALSE, # other parameters passed to align, subjunc and featureCounts functions ...)
index |
A character string providing the base name of index files created for a reference genome by the |
sample |
A data frame or a character string providing sample-related information, including location of the data, sample names and index set names. See the Details section below for more details. |
input.mode |
A character string specifying the input mode. The supported input modes include |
cell.barcode |
A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If |
aligner |
Specify the name of the aligner used for read mapping. Currently only the |
annot.inbuilt |
Specify an inbuilt annotation for UMI counting. See |
annot.ext |
Specify an external annotation for UMI counting. See |
isGTFAnnotationFile |
See |
GTF.featureType |
See |
GTF.attrType |
See |
useMetaFeatures |
Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See |
umi.cutoff |
Specify a UMI count cutoff for cell calling. All the cells with a total UMI count greater than this cutoff will be called. If |
nthreads |
A numeric value giving the number of threads used for read mapping and counting. |
nBestLocations |
A numeric value giving the maximum number of reported alignments for each multi-mapping read. |
unique.mapping |
A logical value specifying if the multi-mapping reads should not be reported as mapped (i.e. reporting uniquely mapped reads only). |
... |
other parameters passed to |
This function takes as input scRNA-seq reads generated by the 10X platform, maps them to the reference genome and then produces UMI (Unique Molecular Identifier) counts for each gene in each cell.
The align
read mapping function and the featureCounts
quantification function, both included in this package, are utilised by this function.
Sample demultiplexing, cell barcode demultiplexing and read deduplication are carried out before generating the UMI counts.
cellCounts
can process multiple datasets at the same time.
The sample information should be provided to cellCounts
via the sample
parameter.
If the input format is BCL
(ie. input.mode="BCL"
), the provided sample information should include the location where the read data are stored, flowcell lanes used for sequencing, sample names and names of index sets used for indexing samples.
These information should be saved to a data.frame
object and then provided to the sample
parameter.
Below shows an example of this data frame:
InputDirectory Lane SampleName IndexSetName /path/to/dataset1 1 Sample1 SI-GA-E1 /path/to/dataset1 1 Sample2 SI-GA-E2 /path/to/dataset1 2 Sample1 SI-GA-E1 /path/to/dataset1 2 Sample2 SI-GA-E2 /path/to/dataset2 1 Sample3 SI-GA-E3 /path/to/dataset2 1 Sample4 SI-GA-E4 /path/to/dataset2 2 Sample3 SI-GA-E3 /path/to/dataset2 2 Sample4 SI-GA-E4 ...
It is compulsory to have the four column headers shown in the example above when generating this data frame for a 10X dataset.
If more than one datasets are provided for analysis, the InputDirectory
column should include more than one distinct directory.
Note that this data frame is different from the Sample Sheet generated by the Illumina sequencer.
The cellCounts
function uses the index set names included in this data frame to generate an Illumina Sample Sheet and then uses it to demultiplex all the samples.
If the input format is FASTQ
, a data.frame
object containing the following three columns, BarcodeUMIFile
, ReadFile
and SampleName
, should be provided to the sample
parameter.
Each row in the data frame represents a sample.
The ReadFile
column includes names of FASTQ files that contain read data for the samples.
Each FASTQ file corresponds to a sample.
The read data included in these FASTQ files only contain genomic sequences of the reads.
The cell barcode and UMI sequences of these reads can be found in the corresponding FASTQ files included in the BarcodeUMIFile
column.
Finally, if the input format is FASTQ-dir
, a character string, which includes the path to the directory where the FASTQ-format read data are stored, should be provided to the sample
parameter.
The data in this directory are expected to be generated by the bcl2fastq
program or the bamtofastq
program (a program developed by 10X).
The cellCounts
function returns a List
object to R.
It also outputs three gzipped FASTQ files and one BAM file for each sample.
The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively.
The BAM file includes location-sorted read mapping results.
The returned List
object contains the following components:
counts |
a |
annotation |
a |
sample.info |
a |
cell.confidence |
a |
Yang Liao and Wei Shi
buildindex
, align
, featureCounts