Welcome to MiDAS. This tutorial is supposed to help you get started with your analyses of immunogenetic associations. We will work with a simulated data set of 500 patients and 500 controls with a binary disease diagnosis.
We also have high resolution HLA alleles (4 - 6 digit), and presence/absence calls for KIR genes.
First, let’s import the phenotype data and HLA calls using MiDAS import functions. MiDAS will check for correct nomenclature of HLA. We can also define 4-digit resolution for HLA alleles as import format, which means that alleles with higher resolution will be reduced.
Let’s take a look at the imported HLA data tables:
Next, we want to check our HLA allele frequencies, and compare them to known frequencies from major populations. Here, we only include alleles with an allele frequency of 5% or higher in our study cohort. By default, MiDAS will output comparisons including the following populations, based on published data from allelefrequencies.net:
MiDAS comes with some pre-defined reference populations, but it is possible to customize these comparisons (see documentation).
|allele||Counts||Freq||USA NMDP African American pop 2||USA NMDP Chinese||USA NMDP European Caucasian||USA NMDP Hispanic South or Central American||USA NMDP Japanese||USA NMDP North American Amerindian||USA NMDP South Asian Indian|
Let’s assume our cohort is of predominantly European ancestry. We can plot the following comparison to see if allele frequencies in our data are distributed as expected, for example by visualizing common HLA-A allele frequencies in comparison with European, Chinese, and African American reference populations:
freq_HLA_long <- tidyr::gather( data = freq_HLA, key = "population", value = "freq", "Freq", "USA NMDP European Caucasian", "USA NMDP Chinese", "USA NMDP African American pop 2", factor_key = TRUE ) %>% filter(grepl("^A", allele)) plot_HLAfreq <- ggplot2::ggplot(data = freq_HLA_long, ggplot2::aes(x = allele, y = freq, fill = population)) + ggplot2::geom_bar( stat = "identity", position = ggplot2::position_dodge(0.7), width = 0.7, colour = "black" ) + ggplot2::coord_flip() + ggplot2::scale_y_continuous(labels = formattable::percent) plot_HLAfreq
The following function prepares our data for analysis, combining HLA and phenotypic data into one object. Here, we are interested in analyzing our data on the level of classical HLA alleles.
We can now test our HLA data for deviations from Hardy-Weinberg-Equilibrium (HWE) to filter out alleles that strongly deviate from HWE expectations (imputation or genotyping errors, …). Here, let’s remove alleles with a HWE p-value below 0.05 divided by the number of alleles tested / present in our data (N=447). We can create a filtered MiDAS object right away (
as.MiDAS = TRUE), as done in this example, or output actual HWE test results.
Now, we define our statistical model and run the analysis. Since we want to test for association with disease status, we use a logistic regression approach. The
term is necessary as placeholder for the tested HLA alleles and needs to be included. It will become handy when for example defining more complex interaction models.
runMiDAS function, we then refer to this model, choose our analysis type and define a inheritance model. Here we use dominant model, meaning that individuals will be defined as non-carriers (0) vs. carriers (1) for a given allele. Alternatively, it is also possible to choose recessive (0 = non-carrier or heterozygous carrier, 1 = homozygous carrier), overdominant (assuming heterozygous (dis)advantage: 0 = non-carrier or homozygous carrier, 1 = heterozygous carrier), or additive (N of alleles) inheritance models. Moreover, we define allele inclusion criteria, such that we only consider alleles frequencies above 2% and below 98%. We also apply the Bonferroni method to not only get nominal P values, but also such adjusted for multiple testing. For alternative multiple testing correction methods, as well as the option to choose a custom number of tests, please refer to the documentation.
exponentiate = TRUE means that the effect estimate will already be shown as odds ratio, since we use a logistic regression model.
HLA_model <- glm(disease ~ term, data = HLA, family = binomial()) HLA_results <- runMiDAS( object = HLA_model, experiment = "hla_alleles", inheritance_model = "dominant", lower_frequency_cutoff = 0.02, upper_frequency_cutoff = 0.98, correction = "bonferroni", exponentiate = TRUE ) kableResults(HLA_results)
Three HLA alleles show significant association with the disease after multiple testing adjustment. Due to the complex linkage disequilibrium structure in the MHC, it is likely that associations are not statistically independent. The two alleles HLA-DRB1*15:01 and HLA-DQB1*06:02 are a common class II haplotype. We can therefore test if there are associations that are statistically independent of our top-associated allele, by setting the
conditional flag to
TRUE. MiDAS will now perform stepwise conditional testing until the top associated allele does not reach a defined significance threshold (here
th = 0.05, based on adjusted p value).
HLA_results_cond <- runMiDAS( object = HLA_model, experiment = "hla_alleles", inheritance_model = "dominant", conditional = TRUE, lower_frequency_cutoff = 0.02, upper_frequency_cutoff = 0.98, correction = "bonferroni", exponentiate = TRUE ) kableResults(HLA_results_cond, scroll_box_height = "200px")
The results for conditional testing are displayed in a way that for each step the top associated allele is shown, along with a list of alleles conditioned on.
As we can see, HLA-DRB1*15:01 was not independently associated with the disease when correcting for our top-associated allele HLA-DQB1*06:02. However, HLA-B*57:01 can be considered an independent association signal.
Next, we want to find out what are the strongest associated amino acid positions, corresponding to our allele-level associations. This can help fine-mapping the associated variants to e.g. the peptide binding region or other functionally distinct parts of the protein. We thus prepare a MiDAS object with experiment type “hla_aa”, which includes the inference of amino acid variation from allele calls.
Amino acid data will be stored in a MiDAS object, but we can extract it to a data frame and select a couple of variables to display how this looks like:
Now, we run the association test based on amino acid variation. To first identify the most relevant associated amino acid positions, we run a likelihood ratio (omnibus) test, which groups all residues at each amino acid position.
HLA_AA_model <- glm(disease ~ term, data = HLA_AA, family = binomial()) HLA_AA_omnibus_results <- runMiDAS( HLA_AA_model, experiment = "hla_aa", inheritance_model = "dominant", conditional = FALSE, omnibus = TRUE, lower_frequency_cutoff = 0.02, upper_frequency_cutoff = 0.98, correction = "bonferroni" ) kableResults(HLA_AA_omnibus_results)
Next, we can investigate how effect estimates are distributed for a given associated amino acid position, e.g. DQB1_9:
HLA_AA_DQB1_9_results <- runMiDAS( HLA_AA_model, experiment = "hla_aa", inheritance_model = "dominant", omnibus_groups_filter = "DQB1_9", lower_frequency_cutoff = 0.02, upper_frequency_cutoff = 0.98, correction = "bonferroni", exponentiate = TRUE ) kableResults(HLA_AA_DQB1_9_results, scroll_box_height = "250px")
This shows us that individuals carrying a Phenylalanine (F) at position 9 of DQB1 have a significantly increased risk, whereas individuals carrying a Tyrosine (Y) at the same amino acid position have a decreased risk.
It is logical to hypothesize that the risk residue is found on HLA-DQB1*06:02, the previously associated HLA risk allele. MiDAS thus provides the function
getAllelesforAA to map amino acid residues back to the respective HLA alleles.
|HLA-DQB1 (9)||HLA-DQB1 alleles||count||frequency|
|F||04:01, 04:02, 04:23, 06:02||302||15.10%|
|Y||02:01, 02:02, 02:10, 03:01, 03:02, 03:03, 03:04, 03:05, 03:19, 03:22, 03:251, 03:96, 05:01, 05:02, 05:04, 05:107, 06:03, 06:04, 06:07, 06:09||1601||80.05%|
Finally, it is also interesting to note that there are several amino acid positions coming up that determine the Bw4 binding motif (e.g. B_81), which is a determinant for interactions of HLA class I alleles with KIR on Natural Killer cells.
Now that we have performed KIR genotyping, or e.g. inferred KIR types from available whole-genome sequencing data, we can import this information, and check the gene frequencies. In our example, we could successfully infer KIR gene presence status for 935 out of the 1,000 individuals in our data set.