Simulating Single-Cell Multi-Omics Data with MOSim

list(name = “Carolina Monzó”) list(name = “Ángeles Arzalluz-Luque”) list(name = “Arianna Febbo”) list(name = “Sonia Tarazona”)

2024-05-01

Introduction

Welcome to the MOSim package, a versatile tool for simulating bulk and single-cell multi-omics data. In this vignette, we will explore how to create synthetic single-cell data, focusing on single-cell RNA-seq (scRNA-seq) and single-cell ATAC-seq (scATAC-seq) data. Using the MOSim package, you can generate custom multi-omics datasets for various experimental conditions, making it an essential resource for testing and validating analysis methods, or creating benchmark datasets.

Installation

Before we dive into the exciting world of data simulation, you’ll need to install the MOSim package. You can easily obtain it from CRAN using the following commands:

if (!requireNamespace("BiocManager", quietly = TRUE)) 
    install.packages("BiocManager")

BiocManager::install("MOSim")

# For the latest development version
install.packages("devtools")
devtools::install_github("ConesaLab/MOSim")

Simulating Single-Cell Multi-Omics Data

The core of data simulation lies in the scMOSim function, which allows you to create synthetic single-cell multi-omics data. Let’s explore a typical example of its usage, using the default dataset loaded in the package:

library(MOSim)

# Create a list of omics data types (e.g., scRNA-seq and scATAC-seq)
omicsList <- sc_omicData(list("scRNA-seq", "scATAC-seq"), 
                         data = NULL)

# Define cell types for your experiment
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))

# Load an association list containing peak IDs related to gene names
associationList <- data(associationList)

# Simulate multi-omics data with specific parameters
testing_groups <- scMOSim(
  omicsList,
  cell_types,
  numberReps = 2,
  numberGroups = 3,
  diffGenes = list(c(0.2, 0.3), c(0.2, 0.3)),
  minFC = 0.25,
  maxFC = 4,
  numberCells = NULL,
  mean = NULL,
  sd = NULL,
  regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2), c(0.1, 0.2)),
  associationList = associationList
)

In the example above, we load omics data types, specify experimental conditions and cell types, and load an association list. The scMOSim function lets us simulate multi-omics data with various parameters, such as the number of replicates, differentially expressed genes, and regulatory effects.

Data Preparation

Before diving into simulation, it’s essential to have your data ready. The sc_omicData function aids in preparing your data for simulation. It accepts the following inputs:

Providing Custom Data

scMOSim also allows you to simulate data resembling characteristics of a dataset of your choice. To do so, you need to format your data using the sc_omicData function. Supported input formats include:

# This is done to get a dataset to extract a matrix from (for example purposes)
scRNA <- MOSim::sc_omicData("scRNA-seq", data = NULL)
count <- scRNA[["scRNA-seq"]]
options(Seurat.object.assay.version = "v3")
Seurat_obj <- Seurat::CreateAssayObject(counts = count, assay = 'RNA')
omic_list_user <- sc_omicData(c("scRNA-seq"), data = c(Seurat_obj))

The resulting omic_list_user is a named list with “scRNA-seq” as the name and your count matrix as the value.

Running the Simulation: scMOSim

Default scMOSim Simulation

scMOSim can simulate scRNA and scATAC count matrices without providing any additional arguments. For a basic simulation, you only need to input the omics list and cell types. Here’s how it’s done:

omic_list <- sc_omicData(list("scRNA-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))

sim <- scMOSim(omic_list, cell_types)

This will result in simulated raw count matrices for scRNA.

Customizing the scMOSim Simulation

The scMOSim function offers a range of parameters to fine-tune your simulation:

omic_list <- sc_omicData(c("scRNA-seq", "scATAC-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))
sim <- scMOSim(omic_list, cell_types, numberReps = 2, 
               numberGroups = 2, diffGenes = list(c(0.2, 0.3)), feature_no = 8000, 
               clusters = 3, mean = c(2*10^6, 1*10^6,2*10^6, 1*10^6), 
               sd = c(5*10^5, 2*10^5, 5*10^5, 2*10^5), 
               regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2)))

Working with Simulation Results

The scMOSim Simulation Object

The result of your simulation is stored in a named list with ‘sim_sc + omic name’ as names and Seurat objects as values. Each Seurat object contains the synthetic count matrices for your experiment. Other relevant information included in the object are:

Retrieving Simulation Settings

To access simulation settings and other constraints for simulation, you can use the scOmicSettings function. This provides information about the relationship between genes and peaks, differentially expressed genes, regulator types, expression patterns, and fold changes for each gene and peak compared to group 1.

settings <- scOmicSettings(sim)

Accessing the Count Data Matrices

You can extract the simulated matrices for all experimental conditions and biological replicates using the scOmicResults function. This provides you with the synthetic data for further analysis and visualization.

res <- scOmicResults(sim)