1 Introduction

This vignette describes the use of the seqCAT package for authentication, characterisation and evaluation of two or more High Throughput Sequencing samples (HTS; RNA-seq or whole genome sequencing). The principle of the method is built upon previous work, where it was demonstrated that analysing the entirety of the variants found in HTS data provides unprecedented statistical power and great opportunities for functional evaluation of genetic similarities and differences between biological samples (Fasterius et al. 2017).

seqCAT work by creating Single Nucelotide Variant (SNV) profiles of every sample of interest, followed by comparisons between each set to find overall genetic similarity, in addition to detailed analyses of the differences. By analysing your data with this workflow you will not only be able to authenticate your samples to a high degree of confidence, but you will also be able to investigate what genes and transcripts are affected by SNVs differing between your samples, what biological effect they will have, and more. seqCAT’s workflow consists of three separate steps:

1.  Creation of SNV profiles
2.  Comparisons of SNV profiles
3.  Authentication, characterisation and evaluation of profile comparisons

Each step has its own section(s) below demonstrating how to perform the analyses. Input data should be in the form of VCF files, i.e output from variant callers such as the Genome Analysis ToolKit and annotated with software such as SnpEff.

1.1 Installation

The latest stable release of this package can be found on Bioconductor and installed using the biocLite function:

source("https://bioconductor.org/biocLite.R")
biocLite("seqCAT")

This will also install any missing packages requires for full functionality, should they not already exist in your system. If you haven’t installed Bioconductor, you can do so by simply calling biocLite() without specifying a package, and it will be installed for you. You can read more about this at Bioconductor’s installation page. You can find the development version of seqCAT on GitHub.

2 Creation of SNV profiles

The first step of the workflow is to create the SNV profile of each sample, which can then be compared to each other. In order to decrease the computation time for large comparison sets and to facilitate re-analyses with different parameters each SNV profile is saved on disc as a normal .txt file. While computation time is usually not an issue for simple binary comparisons (i.e. comparisons with only two samples), this can quickly become a concern for analyses where samples are compared to several others (A vs B, A vs C, …, and so on). The creation of a SNV profile includes filtering of low-confidence variants and removal of variants below a sequencing depth threshold (10 by default). Only records with the highest SNV impact (i.e. impact on protein function) for each variant is kept, as they are most likely to affect the biology of the cells. Creation of SNV profiles is also implemented in Python (section 2.2), which is faster than the standard implementation in R (section 2.1).

2.1 Create profiles with R

Throughout this vignette we will be using some example data, example.vcf.gz, which comes from the initial publication of the general process of this method (Fasterius et al. 2017). It is a simplified multi-sample VCF file on a subset of chromosome 12 (containing all variants up to position 25400000, in order to keep the file size low) for three different colorectal cancer cell lines: HCT116, HKE3 and RKO.

# Load the package
library("seqCAT")

# List the example VCF file
vcf <- system.file("extdata", "example.vcf.gz",
                   package = "seqCAT")

# Create two SNV profiles
create_profile(vcf, "HCT116", "hct116_profile.txt")
create_profile(vcf, "RKO", "rko_profile.txt", filter_depth = 15)

This creates SNV profiles for the two samples found in the example data (HCT116 and RKO) and saves them as hct116.profile.txt and rko_profile.txt in the current directory, respectively. The profile of the second sample was created with a non-standard filter for sequencing depth (15), which should only be done if you want a stricter criteria for your profile (such as when you’re only interested in higher-than-standard confidence variants).

2.2 Create profiles faster with Python

SNV profiles can also be created with Python, another scripting language, if you have installed it. You will also need to install the PyVCF module, in order for it to run. The Python version can create SNV profiles approximately five to ten times quicker than its R equivalent. This is not important for most users, but is nevertheless included for cases where extra speed is desired.

create_profile(vcf, "RKO", "RKO_profile.txt", python = TRUE)

2.3 Create COSMIC profiles

It is also possible to to compare your samples’ variants to some external source. Such a source is the Catalogue of somatic mutations in cancer, or COSMIC. COSMIC has over a thousand cell line-specific mutational profiles, and is thus a very useful resource if you are working with cell lines.

In order to use the COSMIC cell line database, you need to sign up for an account at their website and get permission to download their files (which is given free of charge to academia and non-profit organisation, but requires a commersial license for for-profit organisations). The required file is the one named CosmicCLP_MutantExport.tsv.gz, listed under complete mutational data here. As redistributing this file is not allowed, this package includes an extremely minimal subset of the original file, only useful for examples in this vignette and unit testing. Do not use this file for your own analyses, as your results will neither be complete nor accurate!

The first thing to check is to see if your specific cell line is available in the database, which can be accomplished using the list_cosmic function:

file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz",
                    package = "seqCAT")
cell_lines <- list_cosmic(file)
head(cell_lines)
## [1] "639V"  "A427"  "A549"  "AGS"   "AMO1"  "AN3CA"

This gives us a simple vector containing all the available cell lines in the COSMIC database (this version of the file is for the GRCh37 assembly). You can search it for a cell line of your choice:

any(grepl("HCT116", cell_lines))
## [1] TRUE

All COSMIC-related functions perform some simplification of cell line names (as there is variation in the usage of dashes, dots and other symbols), and are case-insensitive. When you have asserted that your cell line of interest is available, you can then read the profile for that cell line using the read_cosmic function:

cosmic <- read_cosmic(file, "HCT116")
head(cosmic)
## GRanges object with 1 range and 12 metadata columns:
##      seqnames               ranges strand |        gene        sample
##         <Rle>            <IRanges>  <Rle> | <character>   <character>
##   43       12 [25398281, 25398281]      * |        KRAS COSMIC.HCT116
##               ID         CDS          AA             description
##      <character> <character> <character>             <character>
##   43     COSM532     c.38G>A      p.G13D Substitution - Missense
##                                    somatic_status verification_status
##                                       <character>         <character>
##   43 Reported in another cancer sample as somatic            Verified
##              REF         ALT          A1          A2
##      <character> <character> <character> <character>
##   43           C           T           C           T
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

You now have a small, COSMIC SNV profile for your cell line, which you can compare to any other profile you may have data for (more on this below). You can also check how many variants are listed in COSMIC for your particular cell:

length(cosmic)
## [1] 1

Here we only see a single variant for the HCT116 cell line, which is only because of the extreme small subset of the COSMIC databse being used here. HCT116 has, in fact, over 2000 listed COSMIC SNVs, making it one of the more abundantly characterised cell lines available (as most cell lines has only a few hundred SNVs listed in COSMIC). A COSMIC profile of a couple of hundred variants is more common, though, and any analysis based only on COSMIC variants is thus inherently limited.

3 Comparing SNV profiles

3.1 Comparing full profiles

Once each relevant sample has its own SNV profile the comparisons can be performed. First, each profile is read using the read_profile function, which outputs GRanges objects for fast and efficient comparisons.

hct116 <- read_profile("hct116_profile.txt", "HCT116")
rko <- read_profile("rko_profile.txt", "RKO")
head(hct116)
## GRanges object with 6 ranges and 17 metadata columns:
##     seqnames         ranges strand |        rsID               gene
##        <Rle>      <IRanges>  <Rle> | <character>        <character>
##   1       12 [80385, 80385]      * | rs370087224 ABC7-42389800N19.1
##   2       12 [80399, 80399]      * |        None ABC7-42389800N19.1
##   3       12 [80422, 80422]      * | rs373297723 ABC7-42389800N19.1
##   4       12 [80729, 80729]      * | rs375960073 ABC7-42389800N19.1
##   5       12 [83011, 83011]      * | rs370570891 ABC7-42389800N19.1
##   6       12 [83012, 83012]      * | rs374646339 ABC7-42389800N19.1
##              ENSGID          ENSTID         REF         ALT      impact
##         <character>     <character> <character> <character> <character>
##   1 ENSG00000226210 ENST00000400706           C           T    MODIFIER
##   2 ENSG00000226210 ENST00000400706           G           A    MODIFIER
##   3 ENSG00000226210 ENST00000400706           G           A    MODIFIER
##   4 ENSG00000226210 ENST00000400706           A           G    MODIFIER
##   5 ENSG00000226210 ENST00000400706           T           C    MODIFIER
##   6 ENSG00000226210 ENST00000400706           C           G    MODIFIER
##             effect     feature                biotype        DP       AD1
##        <character> <character>            <character> <integer> <integer>
##   1 intron_variant  transcript unprocessed_pseudogene        10         8
##   2 intron_variant  transcript unprocessed_pseudogene        10         4
##   3 intron_variant  transcript unprocessed_pseudogene        15        11
##   4 intron_variant  transcript unprocessed_pseudogene        18        13
##   5 intron_variant  transcript unprocessed_pseudogene        10         3
##   6 intron_variant  transcript unprocessed_pseudogene        10         3
##           AD2          A1          A2    warnings      sample
##     <integer> <character> <character> <character> <character>
##   1         2           C           T                  HCT116
##   2         6           G           A                  HCT116
##   3         4           G           A                  HCT116
##   4         5           A           G                  HCT116
##   5         7           T           C                  HCT116
##   6         7           C           G                  HCT116
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

SNV profiles contain most of the relevant annotation data from the original VCF file, including SNV impacts, gene/transcript IDs and mutational (rs) ID. The DP (depth) field lists the total sequencing depth of this variant, while the specific allelic depths can be found in AD1 and AD2. The alleles of each variant can be found in A1 and A2.

Once each profile has been read, the genotypes of the overlapping variants between them can be compared using the compare_profiles function. Only variants found in both profiles are considered to overlap, as similarity calculations between profiles where some variants only have confident calls in one of the samples are inappropriate. An SNV is considered a match if it has an identical genotype in both profiles.

hct116_rko <- compare_profiles(hct116, rko)
head(hct116_rko)
##   chr   pos sample_1 sample_2 match        rsID               gene
## 1  12 80729   HCT116      RKO match rs375960073 ABC7-42389800N19.1
## 2  12 83508   HCT116      RKO match rs374142069 ABC7-42389800N19.1
## 3  12 83560   HCT116      RKO match rs368663404 ABC7-42389800N19.1
## 4  12 83979   HCT116      RKO match rs369733672 ABC7-42389800N19.1
## 5  12 84000   HCT116      RKO match rs374158904 ABC7-42389800N19.1
## 6  12 84096   HCT116      RKO match rs376990822 ABC7-42389800N19.1
##            ENSGID          ENSTID REF ALT   impact         effect    feature
## 1 ENSG00000226210 ENST00000400706   A   G MODIFIER intron_variant transcript
## 2 ENSG00000226210 ENST00000400706   T   G MODIFIER intron_variant transcript
## 3 ENSG00000226210 ENST00000400706   G   T MODIFIER intron_variant transcript
## 4 ENSG00000226210 ENST00000400706   T   C MODIFIER intron_variant transcript
## 5 ENSG00000226210 ENST00000400706   C   G MODIFIER intron_variant transcript
## 6 ENSG00000226210 ENST00000400706   C   G MODIFIER intron_variant transcript
##                  biotype DP.HCT116 AD1.HCT116 AD2.HCT116 A1.HCT116 A2.HCT116
## 1 unprocessed_pseudogene        18         13          5         A         G
## 2 unprocessed_pseudogene        26         21          5         T         G
## 3 unprocessed_pseudogene        21         14          7         G         T
## 4 unprocessed_pseudogene        51         42          9         T         C
## 5 unprocessed_pseudogene        52         42         10         C         G
## 6 unprocessed_pseudogene        65         49         16         C         G
##   warnings.HCT116 DP.RKO AD1.RKO AD2.RKO A1.RKO A2.RKO warnings.RKO
## 1                     15       8       7      A      G             
## 2                     17       6      11      T      G             
## 3                     16       4      12      G      T             
## 4                     23      14       9      T      C             
## 5                     18       9       9      C      G             
## 6                     18      10       8      C      G

The resulting dataframe retains all the information from each input profile (including any differing annotation, should they exist), and lists the depths and alleles by adding the sample names as suffixes to the relevant column names.

3.2 Comparing to COSMIC profiles

If you are working with cell lines and only want to analyse a subset of your data or as a orthogonal method complementary to others, you could compare your profile to a COSMIC profile. This works in the same way as comparing to another full profile, but gives slightly different output:

hct116_cosmic <- compare_profiles(hct116, cosmic)
head(hct116_cosmic)
##   chr      pos sample_1      sample_2 match        rsID          ENSGID
## 1  12 25398281   HCT116 COSMIC.HCT116 match rs112445441 ENSG00000133703
##            ENSTID   impact           effect    feature        biotype gene
## 1 ENST00000256078 MODERATE missense_variant transcript protein_coding KRAS
##   REF ALT DP.HCT116 AD1.HCT116 AD2.HCT116 A1.HCT116 A2.HCT116
## 1   C   T       180         96         84         C         T
##   warnings.HCT116 ID.COSMIC.HCT116 CDS.COSMIC.HCT116 AA.COSMIC.HCT116
## 1                          COSM532           c.38G>A           p.G13D
##   description.COSMIC.HCT116                 somatic_status.COSMIC.HCT116
## 1   Substitution - Missense Reported in another cancer sample as somatic
##   verification_status.COSMIC.HCT116 A1.COSMIC.HCT116 A2.COSMIC.HCT116
## 1                          Verified                C                T

You can use all the functions for downstream analyses for comparisons with COSMIC data, but your options for functional analyses will be limited, given that the COSMIC database is biased towards well-known and characterised mutations. It is, however, an excellent way to authenticate your cell lines and to assert the status of the mutations that exist in the analysed cells.

4 Evaluating binary comparisons

4.1 Similarity and global statistics

When you have your matched, overlapping SNVs, it’s time to analyse and characterise them. The first thing you might want to check are the global similarities and summary statistics, which can be done with the calculate_similarity function. The concordance is simply the number of matching genotypes divided by the total number of overlapping variants, while the similarity score is a weighted measure of the concordance in the form of a binomial experiment, taking into account the number of overlapping variants available:

\[Similarity = \frac{s + a}{n + a + b}\]

… where s is the number of matching genotypes, n is the total number of overlapping SNVs, a and b being the parameters used to weigh the concordance in favour of comparisons with more overlaps. The default parameters of 1 and 5 were selected to yield an equivalent cutoff to one suggested by Yu et al. (2015), which results in a lower limit 44 of perfectly matching overlapping variants with a similarity score of 90. The similarity score is thus a better measure of biological equivalency than just the concordance.

similarity <- calculate_similarity(hct116_rko)
similarity
##   sample_1 sample_2 overlaps matches concordance similarity_score
## 1   HCT116      RKO      259     181        69.9             68.7

Here, you can see a summary of the relevant statistics for your particular comparison: the number of overlaps between your two samples, the number of matching genotypes, their concordance as well as their similarity score. The cutoff used by Yu et al. for cell line authenticity was 90 % for their 48 SNP panel, something that could be considered the baseline for this method as well. The score, 68.7, is well below that cutoff, and we can thus be certain that these two cells are indeed not the same (as expected). While hard thresholds for similarity are inadvisable, a general guideline is that comparisons with scores above 90 can be considered similar while those below can be considered dissimilar. While a score just below 90 does not mean that the cells definitely are different, it does mean that more rigorous evaluation needs to be performed in order to ensure their biological equivalency. Are there specific genes or regions that are of special interest, for example? If so, it might be informative to specifically investigate the similarity there (more on this below).

You may additionally change the parameters of the score (if you, for example, want a stricter calculation). You may also supply the calculate_similarity function with an existing dataframe with summary data produced previously, in order to aggregate scores and statistics for an arbitrary number of comparisons.

# Create and read HKE3 profile
create_profile(vcf, "HKE3", "hke3_profile.txt")
hke3 <- read_profile("hke3_profile.txt", "HKE3")

# Compare HCT116 and HKE3
hct116_hke3 <- compare_profiles(hct116, hke3)

# Add HCT116/HKE3 similarities to HCT116/RKO similarities
similarities <- calculate_similarity(hct116_hke3,
                                     similarity, a = 1, b = 10)
similarities
##   sample_1 sample_2 overlaps matches concordance similarity_score
## 1   HCT116      RKO      259     181        69.9             68.7
## 2   HCT116     HKE3      493     475        96.3             94.4

Notice that the new similarities dataframe contains both the comparisons of HCT116/RKO and HCT116/HKE3, and we can clearly see that HCT116 and HKE3 are indeed very similar, as expected (HKE3 was derived from HCT116). This is true even when using a higher value for the b parameter. Any number of samples can be added using the calculate_similarity function, for use in further downstream analyses.

4.2 Evaluation of SNV impacts

An SNV’s impact represent the putative effect that variant may have on the function of the resulting protein, and ranges from HIGH through MODERATE, LOW and MODIFIER, in decreasing order of magnitude. HIGH impact variants may, for example, lead to truncated proteins due to the introduction of a stop codon, while MODIFIER variants have little to no effect on the protein at all. While there is no guarantee that a specific phenotype arises from a HIGH rather than a MODERATE impact variant (for example), it may be informative to look at the impact distribution of the overlapping SNVs between two profiles. This can easily be performed by the plot_impacts function:

impacts <- plot_impacts(hct116_rko)
impacts

This function takes a comparison dataframe as input and simply plots the impact distribution of the overlapping variants. It has a number of arguments with defaults, such as if you want to add text with the actual numbers to the plot (annotate = TRUE by default), if you want to show the legend (legend = TRUE by default) and what colours you want to plot the match-categories with (palette = c("#0D2D59", "#1954A6") by default, two shades of blue). We can see that most of the SNVs are present in the MODIFIER impact category, and that there is not a single mismatched HIGH impact SNV. (You can also visualise the impact distribution between your sample and the COSMIC database in exactly the same way.)

You might also want to look at only a subset of variants, e.g. only the variants with HIGH or MODERATE impacts, which is easily achieved with some data manipulation:

hct116_rko_hm <- hct116_rko[hct116_rko$impact == "HIGH" |
                            hct116_rko$impact == "MODERATE", ]
nrow(hct116_rko_hm)
## [1] 19

4.3 Evaluation of specific chromosomes, regions, genes and transcripts

You might be interested in a specific chromosome or a region on a chromosome, and it might be useful to work with data for only that subset. This operation is easily performed on a comparison dataframe:

hct116_rko_region <- hct116_rko[hct116_rko$chr == 12 &
                                hct116_rko$pos >= 25000000 &
                                hct116_rko$pos <= 30000000, ]
head(hct116_rko_region)
##     chr      pos sample_1 sample_2 match      rsID  gene          ENSGID
## 247  12 25358650   HCT116      RKO match   rs12245 LYRM5 ENSG00000205707
## 248  12 25358828   HCT116      RKO match   rs12587 LYRM5 ENSG00000205707
## 249  12 25358943   HCT116      RKO match    rs8720 LYRM5 ENSG00000205707
## 250  12 25358969   HCT116      RKO match rs1137196 LYRM5 ENSG00000205707
## 251  12 25359328   HCT116      RKO match rs1137189 LYRM5 ENSG00000205707
## 252  12 25359352   HCT116      RKO match rs1137188 LYRM5 ENSG00000205707
##              ENSTID REF ALT   impact                  effect    feature
## 247 ENST00000381356   A   T MODIFIER downstream_gene_variant transcript
## 248 ENST00000381356   T   G MODIFIER downstream_gene_variant transcript
## 249 ENST00000381356   T   C MODIFIER downstream_gene_variant transcript
## 250 ENST00000381356   T   G MODIFIER downstream_gene_variant transcript
## 251 ENST00000381356   A   T MODIFIER downstream_gene_variant transcript
## 252 ENST00000381356   G   A MODIFIER downstream_gene_variant transcript
##            biotype DP.HCT116 AD1.HCT116 AD2.HCT116 A1.HCT116 A2.HCT116
## 247 protein_coding       351        196        155         A         T
## 248 protein_coding       382        224        158         T         G
## 249 protein_coding       380        223        157         T         C
## 250 protein_coding       306        184        122         T         G
## 251 protein_coding       436        282        154         A         T
## 252 protein_coding       407        242        165         G         A
##     warnings.HCT116 DP.RKO AD1.RKO AD2.RKO A1.RKO A2.RKO warnings.RKO
## 247                    414     217     197      A      T             
## 248                    422     244     178      T      G             
## 249                    420     238     182      T      C             
## 250                    349     200     149      T      G             
## 251                    508     297     211      A      T             
## 252                    507     270     237      G      A

You might also be interested in a specific gene or transcript, of special importance to your study:

hct116_rko_eps8_t <- hct116_rko[hct116_rko$ENSTID == "ENST00000281172", ]
hct116_rko_vamp1 <- hct116_rko[hct116_rko$ENSGID == "ENSG00000139190", ]
hct116_rko_ldhb <- hct116_rko[hct116_rko$gene == "LDHB", ]
head(hct116_rko_ldhb)
##     chr      pos sample_1 sample_2    match      rsID gene          ENSGID
## 243  12 21788465   HCT116      RKO mismatch      None LDHB ENSG00000111716
## 244  12 21797029   HCT116      RKO    match rs1650294 LDHB ENSG00000111716
##              ENSTID REF ALT   impact              effect    feature
## 243 ENST00000350669   G   T MODIFIER 3_prime_UTR_variant transcript
## 244 ENST00000350669   A   G      LOW    sequence_feature      helix
##            biotype DP.HCT116 AD1.HCT116 AD2.HCT116 A1.HCT116 A2.HCT116
## 243 protein_coding      1353        754        599         G         T
## 244 protein_coding      5157          2       5155         G         G
##     warnings.HCT116 DP.RKO AD1.RKO AD2.RKO A1.RKO A2.RKO warnings.RKO
## 243                   1347    1347      NA      G      G             
## 244                   4253       2    4251      G      G

Here we see two mutations in the LDHB gene, one mismatching MODIFIER variant and one matching LOW variant. This is a good approach to check for known mutations in your dataset. For example, the HCT116 cell line is supposed to have a KRASG13D mutation. We might look for this using its known rsID or position:

hct116_rko_kras <- hct116_rko[hct116_rko$rsID == "rs112445441", ]
hct116_rko_kras <- hct116_rko[hct116_rko$chr == 12 &
                              hct116_rko$pos == 25398281, ]
nrow(hct116_rko_kras)
## [1] 0

The reason that we don’t find this particular variant in the HCT116 vs. RKO comparison is that it is not present in the RKO profile, either because it isn’t a mutation in RKO or because there was no confident variant call for that particular position. The compare_profiles function only looks at overlapping positions, so we will have to look at the individual profiles instead:

hct116[seqnames(hct116) == 12 & start(hct116) == 25398281, ]
## GRanges object with 1 range and 17 metadata columns:
##        seqnames               ranges strand |        rsID        gene
##           <Rle>            <IRanges>  <Rle> | <character> <character>
##   3149       12 [25398281, 25398281]      * | rs112445441        KRAS
##                 ENSGID          ENSTID         REF         ALT      impact
##            <character>     <character> <character> <character> <character>
##   3149 ENSG00000133703 ENST00000256078           C           T    MODERATE
##                  effect     feature        biotype        DP       AD1
##             <character> <character>    <character> <integer> <integer>
##   3149 missense_variant  transcript protein_coding       180        96
##              AD2          A1          A2    warnings      sample
##        <integer> <character> <character> <character> <character>
##   3149        84           C           T                  HCT116
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
rko[seqnames(rko) == 12 & start(rko) == 25398281, ]
## GRanges object with 0 ranges and 17 metadata columns:
##    seqnames    ranges strand |        rsID        gene      ENSGID
##       <Rle> <IRanges>  <Rle> | <character> <character> <character>
##         ENSTID         REF         ALT      impact      effect     feature
##    <character> <character> <character> <character> <character> <character>
##        biotype        DP       AD1       AD2          A1          A2
##    <character> <integer> <integer> <integer> <character> <character>
##       warnings      sample
##    <character> <character>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Here we see that the KRASG13D mutation is indeed present in the HCT116 profile, but not in RKO. (Please note that this data was aligned and analysed using the GRCh37 / hg19 assembly and that listed positions might not be accurate for other assemblies.)

5 Evaluating multiple comparisons

Many scientific studies compare more than just two datasets, not to mention meta-studies and large-scale comparisons. It is therefore important to be able to characterise and evaluate many-to-one or many-to-many cases as well - the seqCAT package provides a number of functions and procedures for doing so.

5.1 Performing multiple profile comparisons

The first step of such an analysis is to create and read SNV profiles for each sample that is to be evaluated (please see section 2). The example data used here has three different samples: HCT116, HKE3 and RKO. The compare_many function is a helper function for creating either one-to-many or many-to-many SNV profile comparisons, and returns a list of the global similarities for all combinations of profiles and their respective data (for downstream analyses):

# Create list of SNV profiles
profiles <- list(hct116, hke3, rko)

# Perform many-to-many comparisons
many <- compare_many(profiles)
many[[1]]
##   sample_1 sample_2 overlaps matches concordance similarity_score
## 1   HCT116   HCT116      523     523       100.0             99.1
## 2   HCT116     HKE3      493     475        96.3             95.4
## 3   HCT116      RKO      259     181        69.9             68.7
## 4     HKE3     HKE3     1604    1604       100.0             99.7
## 5     HKE3      RKO      299     204        68.2             67.2
## 6      RKO      RKO      583     583       100.0             99.2

We can here see the summary statistics of all three combinations of the cell lines in the example data. Notice that compare_many will only perform a comparison that has not already been performed, i.e. it will not perform the RKO vs. HCT116 comparison if it has already performed HCT116 vs. RKO. Also notice that it does perform self-comparisons (i.e. HCT116 vs. HCT116), which is useful for downstream visualisations.

The similarities are stored in the first element of the results (many[[1]]), while the data for each comparison is stored in the second (many[[2]]). The second element is itself also a list, whose indices correspond to the row names of the similarity object. If we, for example, are interested in the HKE3 self-comparison, we can see that its row name is 4. We can then access its data like this:

hke3_hke3 <- many[[2]][[4]]
head(hke3_hke3)
##   chr   pos sample_1 sample_2 match        rsID         gene          ENSGID
## 1  12 73805     HKE3     HKE3 match rs375835195 RP11-598F7.1 ENSG00000249054
## 2  12 75190     HKE3     HKE3 match rs374099059   AC215219.1 ENSG00000238823
## 3  12 75308     HKE3     HKE3 match rs370314061   AC215219.1 ENSG00000238823
## 4  12 75337     HKE3     HKE3 match rs147539459   AC215219.1 ENSG00000238823
## 5  12 76316     HKE3     HKE3 match rs370768066   AC215219.1 ENSG00000238823
## 6  12 76349     HKE3     HKE3 match  rs71412503   AC215219.1 ENSG00000238823
##            ENSTID REF ALT   impact                  effect    feature
## 1 ENST00000504074   G   C MODIFIER downstream_gene_variant transcript
## 2 ENST00000458783   A   G MODIFIER   upstream_gene_variant transcript
## 3 ENST00000458783   C   T MODIFIER   upstream_gene_variant transcript
## 4 ENST00000458783   A   G MODIFIER   upstream_gene_variant transcript
## 5 ENST00000458783   T   C MODIFIER   upstream_gene_variant transcript
## 6 ENST00000458783   A   G MODIFIER   upstream_gene_variant transcript
##   biotype DP.HKE3 AD1.HKE3 AD2.HKE3 A1.HKE3 A2.HKE3 warnings.HKE3
## 1 lincRNA      15        8        7       G       C              
## 2   miRNA      45       39        6       A       G              
## 3   miRNA      38       38       NA       C       C              
## 4   miRNA      38       27       11       A       G              
## 5   miRNA      63       46       17       T       C              
## 6   miRNA      44       35        9       A       G

You may also specify the a and b similarity score parameters, as above. If you are interested in only a one-to-many comparison (for cases when you have a “true” baseline profile to compare against), you can do this by also specifying the one = <profile> parameter in the function call. This is useful if you have a COSMIC profile to compare against, for example:

many_cosmic <- compare_many(profiles, one = cosmic)
many_cosmic[[1]]
##        sample_1      sample_2 overlaps matches concordance similarity_score
## 1 COSMIC.HCT116 COSMIC.HCT116        1       1         100             28.6
## 2 COSMIC.HCT116        HCT116        1       1         100             28.6
## 3 COSMIC.HCT116          HKE3        1       1         100             28.6
## 4 COSMIC.HCT116           RKO        1       0           0             14.3

It is important to note that performing many comparisons like this may take quite some time, depending on the number of profiles and how much data each profile has. By returning all the data in a list you may then save each comparison to a file, for later re-analysis without having to re-do the comparisons.

5.2 Visualising multiple comparisons

A useful and straightforward way of visualising multiple profile comparisons is to use a heatmap. We can use the summary statistics listed in the similarity object from above as input to the function plot_heatmap, which gives you a simple overview of all your comparisons:

heatmap <- plot_heatmap(many[[1]])
heatmap

Here we see a blue colour gradient for the similarity score of the three cell lines, which are clustered according to their similarity (using cluster = TRUE, as default). You may change the size of the text annotations using annotation_size = 5 (default) or suppress them entirely (annotate = FALSE). You may also suppress the legend (legend = FALSE), change the main colour of the gradient (colour = "#1954A6" by default) or change the limits of the gradient (limits = c(0, 50, 90, 100) by default). The choice of gradient limits are based on clarity (comparisons with a similarity score less than 50, i.e. those that likely have too few overlapping variants to begin with, are suppressed) and the previously mentioned 90 % concordance threshold (Yu et al. 2015).

This heatmap makes it clear that HCT116 and HKE3 are, indeed, very similar to each other, while RKO differs from them both. These types of heatmaps can be created for an arbitrary number of samples, which will then give a great overview of the global similarities of all the samples studied. This can be used to evaluate the quality of the datasets (e.g. to see which comparisons have very few overlaps), find similarity clusters and potential unexpected outliers. If a sample stands out in a heatmap such as this, that is grounds for further investigation, using both the methods described above and more classical evaluations of sequencing data (read quality, adapter contamination, alignments, variant calling, and so on).

Citation

If you are using seqCAT to analyse your samples, please cite the article in which the general methodology was first published.

A novel RNA sequencing data analysis method for cell line authentication
Fasterius, E., Raso, C., Kennedy, S., Kolch, W., Al-Khalili C. et al.
PloS One, 12(2), e0171435. (2017)
doi: http://doi.org/10.1371/journal.pone.0171435

Session info

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2               seqCAT_1.0.0              
##  [3] VariantAnnotation_1.24.0   Rsamtools_1.30.0          
##  [5] Biostrings_2.46.0          XVector_0.18.0            
##  [7] SummarizedExperiment_1.8.0 DelayedArray_0.4.0        
##  [9] matrixStats_0.52.2         Biobase_2.38.0            
## [11] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
## [13] IRanges_2.12.0             S4Vectors_0.16.0          
## [15] BiocGenerics_0.24.0        BiocStyle_2.6.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13             lattice_0.20-35         
##  [3] tidyr_0.7.2              prettyunits_1.0.2       
##  [5] assertthat_0.2.0         rprojroot_1.2           
##  [7] digest_0.6.12            plyr_1.8.4              
##  [9] R6_2.2.2                 backports_1.1.1         
## [11] RSQLite_2.0              evaluate_0.10.1         
## [13] ggplot2_2.2.1            zlibbioc_1.24.0         
## [15] rlang_0.1.2              GenomicFeatures_1.30.0  
## [17] progress_1.1.2           lazyeval_0.2.1          
## [19] blob_1.1.0               Matrix_1.2-11           
## [21] rmarkdown_1.6            labeling_0.3            
## [23] RMySQL_0.10.13           BiocParallel_1.12.0     
## [25] stringr_1.2.0            RCurl_1.95-4.8          
## [27] bit_1.1-12               biomaRt_2.34.0          
## [29] munsell_0.4.3            compiler_3.4.2          
## [31] rtracklayer_1.38.0       pkgconfig_2.0.1         
## [33] htmltools_0.3.6          tidyselect_0.2.2        
## [35] tibble_1.3.4             GenomeInfoDbData_0.99.1 
## [37] bookdown_0.5             XML_3.98-1.9            
## [39] dplyr_0.7.4              GenomicAlignments_1.14.0
## [41] bitops_1.0-6             grid_3.4.2              
## [43] gtable_0.2.0             DBI_0.7                 
## [45] magrittr_1.5             scales_0.5.0            
## [47] stringi_1.1.5            tools_3.4.2             
## [49] bit64_0.9-7              BSgenome_1.46.0         
## [51] glue_1.2.0               purrr_0.2.4             
## [53] yaml_2.1.14              colorspace_1.3-2        
## [55] AnnotationDbi_1.40.0     memoise_1.1.0           
## [57] knitr_1.17               bindr_0.1

References

Fasterius, Erik, Cinzia Raso, Susan Kennedy, Nora Rauch, Pär Lundin, Walter Kolch, Mathias Uhlén, and Cristina Al-Khalili Szigyarto. 2017. “A novel RNA sequencing data analysis method for cell line authentication.” PloS One 12 (2): e0171435.

Yu, Mamie, Suresh K Selvaraj, May M Y Liang-Chu, Sahar Aghajani, Matthew Busse, Jean Yuan, Genee Lee, et al. 2015. “A resource for cell line authentication, annotation and quality control.” Nature 520 (7547): 307–11.