sffdr Package Vignette

Introduction

The sffdr package implements the surrogate functional false discovery rate (sfFDR) procedure. This methodology integrates GWAS summary statistics from related traits (i.e., pleiotropy) to increase statistical power within the functional FDR framework. The inputs into sffdr are a set of p-values from a GWAS of interest and a set of p-values from one or many informative GWAS.

The significance quantities estimated by sffdr can be used for a variety of analyses:

Functional p-value (fp): Controls the type I error rate and can be used for standard significance analyses.
Functional q-value (fq): A measure of significance in terms of the positive FDR (closely related to FDR).
Functional local FDR (flfdr): A posterior error probability that is useful for functional fine-mapping and assessing significance of a SNP.

Citing this package

The methods implemented in this package are described in:

Bass AJ, Wallace C. Exploiting pleiotropy to enhance variant discovery with functional false discovery rates. Nature Computational Science; 2025.

Note that this work is an extension of the functional FDR methodology and the software builds on some of the functions in the fFDR package found at https://github.com/StoreyLab/fFDR.

Getting help

To report any bugs or issues related to usage please report it on GitHub at https://github.com/ajbass/sffdr.

Installation

You can install the development version of sffdr from GitHub:

# install development version of package
install.packages("devtools")
devtools::install_github("ajbass/sffdr")

Quick start guide

To demonstrate the package, we will use a sample dataset containing 10,000 SNPs for body mass index (BMI) as the primary trait of interest, with body fat percentage (BFP), cholesterol, and triglycerides as informative traits.

library(sffdr)
data(bmi)

# Define primary p-values and informative p-values
p <- sumstats$bmi
z <- as.matrix(sumstats[,-1])

head(sumstats)
#>         bmi      bfp cholesterol triglycerides
#> 1 0.4959680 0.313475   0.0394766      0.148379
#> 2 0.6937320 0.565400   0.8986590      0.298349
#> 3 0.0843257 0.447650   0.7301960      0.715846
#> 4 0.2623540 0.911920   0.1753410      0.775170
#> 5 0.2150660 0.668330   0.0526082      0.603160
#> 6 0.9501450 0.607349   0.2759850      0.867493

1. Modeling the functional proportion of null tests \(\pi_{0}(z)\)

The first step is to model the relationship between the functional proportion of null tests and the informative traits. We use pi0_model to create a model and fpi0est to estimate \(\pi_{0}(z)\).

# Create model
mpi0 <- pi0_model(z)
#> ==================================================
#> Building Pi0 Model (Rank: Signal -> 0.0)
#>   Variables        : 3
#>   Min Discoveries  : 2500 (All SNPs)
#> --------------------------------------------------
#>   [Auto-Scale]: N = 10000. Bandwidth set to 100 independent SNPs per knot.
#>   [bfp]: Fallback -> 1 Knot at 0.1000
#>   [cholesterol]: Fallback -> 1 Knot at 0.1000
#>   [triglycerides]: Fallback -> 1 Knot at 0.1000
#> --------------------------------------------------
#> ~splines::ns(bfp, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(cholesterol, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(triglycerides, knots = c(0.1), Boundary.knots = c(0, 1))
#> ==================================================

# Estimation of functional pi0
fpi0 <- fpi0est(p, mpi0)
#> ==================================================
#> Estimating Functional Pi0
#> ==================================================
#>   Fitting models across 19 lambda values...
#>   |                                                       |                                               |   0%  |                                                       |==                                             |   5%  |                                                       |=====                                          |  11%  |                                                       |=======                                        |  16%  |                                                       |==========                                     |  21%  |                                                       |============                                   |  26%  |                                                       |===============                                |  32%  |                                                       |=================                              |  37%  |                                                       |====================                           |  42%  |                                                       |======================                         |  47%  |                                                       |=========================                      |  53%  |                                                       |===========================                    |  58%  |                                                       |==============================                 |  63%  |                                                       |================================               |  68%  |                                                       |===================================            |  74%  |                                                       |=====================================          |  79%  |                                                       |========================================       |  84%  |                                                       |==========================================     |  89%  |                                                       |=============================================  |  95%  |                                                       |===============================================| 100%                                                         
#>   Selecting optimal lambda via MISE...
#>   Selected lambda: 0.25
#>   Generating predictions...
#>   Done.
#> ==================================================
#>

Note on Knots: Ideally, knots should cover the distribution of non-null p-values in your informative studies. Our algorithm uses an FDR threshold to choose the location and number of knots. Depending on the study, you may want to adjust these values (see ?pi0_model). Note that this toy example contains only 10,000 p-values, so the knot locations are not optimal since the function was designed for GWAS-scale data.

2. Estimating the functional significance quantities

With the functional \(\pi_{0}\) estimated, we can now calculate the FDR quantities and functional p-values.

# Apply sfFDR
sffdr_out <- sffdr(p, fpi0 = fpi0$fpi0)   
#> ==================================================
#> Running sfFDR Analysis (v1.1.0)
#> ==================================================
#>   Transforming surrogate variable...
#>   Fitting kernel density...
#>     High-Res Tcub (nn=0.1)... ✓
#>   Computing functional local FDRs...
#>   Computing functional p-values and q-values...
#>   Done.
#> ==================================================

# Extract results
results <- data.frame(
  p_value = p,
  fp_value = sffdr_out$fpvalues,
  fq_value = sffdr_out$fqvalues,
  flfdr = sffdr_out$flfdr
)

# Plot significance results (focusing on high-significance SNPs)
plot(sffdr_out, rng = c(0, 1e-6))
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

3. Incorporating linkage disequilibrium (LD)

In standard GWAS, SNPs are often correlated due to LD. To account for this, you should provide a indicating which SNPs are LD-independent (e.g., via pruning). sffdr uses these independent SNPs to fit the model, then computes the significance quantities for the entire dataset.

# logical vector: TRUE if SNP is LD-independent
# (In this example, all SNPs are already independent)
indep_snps <- rep(TRUE, length(p))

# Fit model using LD-independent subset
mpi0 <- pi0_model(z = z, indep_snps = indep_snps)
#> ==================================================
#> Building Pi0 Model (Rank: Signal -> 0.0)
#>   Variables        : 3
#>   Min Discoveries  : 2500 (All SNPs)
#> --------------------------------------------------
#>   [Auto-Scale]: N = 10000. Bandwidth set to 100 independent SNPs per knot.
#>   [bfp]: Fallback -> 1 Knot at 0.1000
#>   [cholesterol]: Fallback -> 1 Knot at 0.1000
#>   [triglycerides]: Fallback -> 1 Knot at 0.1000
#> --------------------------------------------------
#> ~splines::ns(bfp, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(cholesterol, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(triglycerides, knots = c(0.1), Boundary.knots = c(0, 1))
#> ==================================================

fpi0 <- fpi0est(p, mpi0, indep_snps = indep_snps)
#> ==================================================
#> Estimating Functional Pi0
#> ==================================================
#>   Fitting models across 19 lambda values...
#>   |                                                       |                                               |   0%  |                                                       |==                                             |   5%  |                                                       |=====                                          |  11%  |                                                       |=======                                        |  16%  |                                                       |==========                                     |  21%  |                                                       |============                                   |  26%  |                                                       |===============                                |  32%  |                                                       |=================                              |  37%  |                                                       |====================                           |  42%  |                                                       |======================                         |  47%  |                                                       |=========================                      |  53%  |                                                       |===========================                    |  58%  |                                                       |==============================                 |  63%  |                                                       |================================               |  68%  |                                                       |===================================            |  74%  |                                                       |=====================================          |  79%  |                                                       |========================================       |  84%  |                                                       |==========================================     |  89%  |                                                       |=============================================  |  95%  |                                                       |===============================================| 100%                                                         
#>   Selecting optimal lambda via MISE...
#>   Selected lambda: 0.25
#>   Generating predictions...
#>   Done.
#> ==================================================
#> 

# Estimate quantities for all SNPs
sffdr_out <- sffdr(p, fpi0 = fpi0$fpi0)
#> ==================================================
#> Running sfFDR Analysis (v1.1.0)
#> ==================================================
#>   Transforming surrogate variable...
#>   Fitting kernel density...
#>     High-Res Tcub (nn=0.1)... ✓
#>   Computing functional local FDRs...
#>   Computing functional p-values and q-values...
#>   Done.
#> ==================================================

Note on choosing independent SNPs: There are two strategies: (1) prune using plink (e.g., --indep-pairwise 200kb 1 0.3) and use the pruned set of SNPs as the independent set, or (2) use informed sampling using the informative traits (e.g., clumping or selecting a representative SNP in an LD block with the smallest informative p-value). The latter is likely better when the primary study has low power.

Advanced usage

Manual Knot Selection and using other informative variables

The selection of knots in the B-spline basis is an important tuning parameter in sffdr. If your surrogate GWAS has extremely high power (very small p-values), we recommend placing more knots at the lower end of the p-value distribution to capture the local density of signals.

# Create model (can include other variables (e.g., MAF) or specify more complicated models)
fmod <- "~ns(bfp, knots = c(0.01, 0.025, 0.05, 0.1))"
fpi0_mod <- fpi0est(p = p,
                    z = mpi0$zt,
                    pi0_model = fmod)

Note on incorporating additional variables: Other informative covariates can be included in the model (e.g., MAF, LD score, etc.). While the LD score annotations can be used (e.g., the baselineLD model), we have found that using the expected \(\chi^{2}\) given the annotations performs better in practice and simplifies the model fitting.