Building a minimal genome browser with h5vc and shiny

In this vignette we will have a look at another powerful use-case for h5vc which is interactive data exploration. This can be a great tool to communicate your results to cour colaborators.

If you want to see the end result first, head over to the rstudio shiny server hosting service, where we have put an example shiny application demonstrating this.

Prerequisites

For this vignette you will have to know about:

The Data

We will use the yeast tally file that we created in the h5vc.creating.tallies vignette as our single data input.

The Shiny App

The shiny app that we write is the most basic genome broser imaginable, it will have controls for the chromosome, genomic position and the windowsize (i.e. zoom level) of the current view and it will generate a mismatchPlot from the tally file and the inputs that shows us the respective region of the genome in all samples present in the tally file.

The shiny app consists of two files: ui.R defining the user interface and server.R defining the calculations / plots that the UI will be populated with.

ui.R

In the user interface file we define a sliderInput for the windowsize and two other inputs chromSelect and genomicPositionSelect that will be generated by the server. We do this since the available chromosomes depend on the tally file and the available positions depend on the selected chromosome.

library(shiny)
shinyUI(pageWithSidebar(
  # Application title
  headerPanel("Simple Genome Browser"),
  # Sidebar with a slider inputs and selection boxes
  sidebarPanel(
    sliderInput("windowsize", 
                "Windowsize:", 
                min = 10,
                max = 200,
                value = 50,
                step = 5),
    uiOutput("chromSelect"),
    uiOutput("genomicPositionSelect")
  ),
  # Show a plot of the region
  mainPanel(
    plotOutput("mismatchPlot", height=800)
  )
))

server.R

In server.R the logic of our application is stored, basically we load the required packages, have a look into the tally file to check for chromosomes and chromosome lengths and then generate the controls for the user to select chromosome and position.

The main calculations happen in output$mismatchPlot, were we take the chromosome and position and windowsize selected by the user and get the respective data from the tallyFile. That data is then handed to the mismatchPlot function and creates the plot we send back to the user (i.e. the web browser).

library(shiny)
library(h5vc)
library(rhdf5)
tallyFile <- "yeast.hfs5"
study <- "/yeast"
h5ls(tallyFile)
chromosomes <- h5ls(tallyFile)
chromlengths <- as.numeric(subset(chromosomes, otype == "H5I_DATASET" & name == 
    "Reference")$dim)
chromosomes <- subset(chromosomes, otype == "H5I_GROUP" & name != "yeast")$name
names(chromlengths) = chromosomes

# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {

    output$chromSelect <- renderUI({
        selectInput("chrom", "Chromosome", choices = c("I", "II", "III", "IV", 
            "V", "VI", "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", 
            "XV", "XVI", "Mito"))
    })

    output$genomicPositionSelect <- renderUI({
        sliderInput("gpos", "Genomic Position:", min = 10, max = chromlengths[input$chrom] - 
            10, value = 200)
    })

    group <- reactive({
        paste(study, input$chrom, sep = "/")
    })

    sampleData <- reactive({
        sd = getSampleData(tallyFile, group())
        sd$Sample = c("YB210", "s288c")
        sd
    })

    pos <- reactive({
        min(max(input$windowsize + 1, input$gpos), chromlengths[input$chrom] - 
            input$windowsize - 1)
    })

    data <- reactive({
        h5dapply(tallyFile, group(), blocksize = input$windowsize * 3, names = c("Coverages", 
            "Counts", "Deletions"), range = c(max(pos() - input$windowsize, 
            0), min(pos() + input$windowsize, chromlengths[input$chrom])))[[1]]

    })

    output$mismatchPlot <- renderPlot({
        p = mismatchPlot(data(), sampleData(), samples = c("s288c", "YB210"), 
            input$windowsize, pos())
        print(p)
    })
})

Including Variant Calls

Obviously this genome browser has limited use in its current state but by adding a file of variants as an input and making them selectabe we can turn it into a browser for potential variant calls and can then easily go through those and inspect the region to get an idea of the quality of our calls. There is a lot of potential for creating tremendously usefull and visually impressive ways of communicating your reserach to your collaborators and peers.

Calling the variants

We use the following code to call variants using the build-in functionality of h5vc. We call variants from the perspective of the YB210 strain, taking the s288c strain as our control.

library(rhdf5)
library(h5vc)
setwd("~/ShinyApps/Yeast")
tallyFile <- "yeast.hfs5"
study <- "/yeast"
h5ls(tallyFile)
chromosomes <- h5ls(tallyFile)
chromosomes <- subset(chromosomes, otype == "H5I_GROUP" & name != "yeast")$name
variantCalls <- list()
for (chrom in chromosomes) {
    group <- paste(study, chrom, sep = "/")
    sdat <- getSampleData(tallyFile, group)
    sdat$Group <- "yeast"
    variantCalls[[chrom]] <- h5dapply(filename = tallyFile, group = group, blocksize = 1e+05, 
        FUN = callVariantsPairedFancy, sampledata = sdat, cl = vcConfParams(returnDataPoints = TRUE, 
            minStrandAltSupport = 4), names = c("Counts", "Coverages", "Reference"), 
        verbose = TRUE)
}
for (chrom in chromosomes) {
    variantCalls[[chrom]] <- do.call(rbind, variantCalls[[chrom]])
}
variantCalls <- do.call(rbind, variantCalls)
rownames(variantCalls) = NULL
variantCalls$Support = variantCalls$caseCountFwd + variantCalls$caseCountRev
variantCalls$Coverage = variantCalls$caseCoverageFwd + variantCalls$caseCoverageRev
variantCalls$AF = variantCalls$Support/variantCalls$Coverage
save(variantCalls, file = "yeast.variants.RDa")

Extending the functionality of the genome browser

Now we have to make some changes to the genome browser shiny app so that it loads the list of variants and allows us to browse and filter them. The result of the efforts described below is linked here:

Changes to the interface

We will make three changes to the interface: 1. We move the mismatchPlot to a separate tab and add another tab which will contain a table of all the variants fulfilling current filtering criteria as well as a tab showing summary pots of the filtered variants (a histogram of the allele frequency for now). 2. We add filter selectors to the sidebar, we will allow filtering on allele freqeuncy (AF column), Support and Coverage 3. We replace the genomic position selector with a variant selector

The new ui.R looks like this:

library(shiny)
shinyUI(pageWithSidebar(

  # Application title
  headerPanel("Simple Variant Browser - Yeast Strains Example"),

  # Sidebar with a slider inputs and selection boxes
  sidebarPanel(
    sliderInput("windowsize", 
                "Windowsize:", 
                min = 10,
                max = 200,
                value = 50,
                step = 5),
    uiOutput("chromSelect"),
    sliderInput("af", 
                "Allele Frequency:", 
                min = 0,
                max = 1,
                value = c(0.1,1.0),
                step = 0.01),
    sliderInput( "minSupport",
                 "Minimum Support",
                 min = 2,
                 max = 200,
                 value = 2),
    sliderInput( "minCoverage",
                 "Minimum Coverage",
                 min = 10,
                 max = 500,
                 value = 50,
                 step = 5),
    uiOutput("variantSelect"),
    textOutput("diag")
  ),

  # Show a plot of the region
  mainPanel(
    tabsetPanel(
      tabPanel( title = "Region Mismatch Plot", plotOutput("mismatchPlot", height=800) ),
      tabPanel( title = "Variant Table", tableOutput("variantTable") ),
      tabPanel( title = "Variant Summary Plots", plotOutput("afHist") )
    )
  )
))

Changes to the server-side computations

Here we have to add the code to load the variant calls, dynamically generate the variant selector and update some methods for generating the plots / calculating the current position to plot (since this now depends on the sleected variant.)

The new server.R looks like this:

library(shiny)
library(h5vc)
library(rhdf5)
library(ggplot2)
library(grid)

tallyFile <- "yeast.hfs5"
study <- "/yeast"
h5ls(tallyFile)
chromosomes = h5ls(tallyFile)
chromlengths = as.numeric(subset(chromosomes, otype == "H5I_DATASET" & name == 
    "Reference")$dim)
chromosomes = subset(chromosomes, otype == "H5I_GROUP" & name != "yeast")$name
names(chromlengths) = chromosomes

load(file = "yeast.variants.RDa")
variantCalls$start <- variantCalls$start + 1  #fixing difference in counting 0-based vs. 1-based
variantCalls$end <- variantCalls$end + 1
# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {

    output$chromSelect <- renderUI({
        selectInput("chrom", "Chromosome", choices = c("I", "II", "III", "IV", 
            "V", "VI", "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", 
            "XV", "XVI", "Mito"))
    })

    group <- reactive({
        paste(study, input$chrom, sep = "/")
    })

    sampleData <- reactive({
        sd = getSampleData(tallyFile, group())
        sd$Sample = c("YB210", "s288c")
        sd
    })

    variants <- reactive({
        subset(variantCalls, seqnames == input$chrom & AF >= input$af[1] & AF <= 
            input$af[2] & Support >= input$minSupport & Coverage >= input$minCoverage)
    })

    output$variantSelect <- renderUI({
        tmp = seq(nrow(variants()))
        names(tmp) = paste(variants()$start, " - ", variants()$refAllele, "/", 
            variants()$altAllele, sep = "")
        selectInput("var", "Variant:", choices = tmp)
    })

    pos <- reactive({
        variants()$start[as.numeric(input$var)]
    })

    data <- reactive({
        if (nrow(variants()) > 0) {
            h5dapply(filename = tallyFile, group = group(), blocksize = input$windowsize * 
                3, names = c("Coverages", "Counts", "Deletions"), range = c(max(pos() - 
                input$windowsize, 0), min(pos() + input$windowsize, chromlengths[input$chrom])))[[1]]
        } else {
            NULL
        }
    })

    output$mismatchPlot <- renderPlot({
        if (nrow(variants()) > 0) {
            p = mismatchPlot(data(), sampleData(), samples = c("s288c", "YB210"), 
                input$windowsize, pos())
            print(p)
        }
    })

    output$variantTable <- renderTable({
        variants()[, c("seqnames", "start", "refAllele", "altAllele", "AF", 
            "Support", "Coverage")]
    })

    output$afHist = renderPlot({
        hist(variants()$AF, breaks = seq(0, 1, 0.01))
    })

})

Conclusions

So there you have it, your own variant browser ready to be deployed on the web for your collaborators or the readers of your paper to access, browse and filter as they like. And we did all of this in a bit over 100 lines of code.

There is a whole bunch of other useful plots, filters and functionalities that could be added to this tool, downloading filtered variants or generated plots would be one of the more obvious ones.