TCGAbiolinks has provided a few functions to search GDC database. This section starts by explaining the different GDC sources (Harmonized and Legacy Archive), followed by some examples how to access them.
There are two available sources to download GDC data using TCGAbiolinks:
A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers.
Example:
For more information check TCGA wiki
You can easily search GDC data using the GDCquery
function.
Using a summary of filters as used in the TCGA portal, the function works with the following arguments:
?project | A list of valid project (see table below)] | |
---|---|---|
data.category | A valid project (see list with TCGAbiolinks:::getProjectSummary(project)) | |
data.type | A data type to filter the files to download | |
workflow.type | GDC workflow type | |
legacy | Search in the legacy repository | |
access | Filter by access type. Possible values: controlled, open | |
platform | Example: | |
CGH- 1x1M_G4447A | IlluminaGA_RNASeqV2 | |
AgilentG4502A_07 | IlluminaGA_mRNA_DGE | |
Human1MDuo | HumanMethylation450 | |
HG-CGH-415K_G4124A | IlluminaGA_miRNASeq | |
HumanHap550 | IlluminaHiSeq_miRNASeq | |
ABI | H-miRNA_8x15K | |
HG-CGH-244A | SOLiD_DNASeq | |
IlluminaDNAMethylation_OMA003_CPI | IlluminaGA_DNASeq_automated | |
IlluminaDNAMethylation_OMA002_CPI | HG-U133_Plus_2 | |
HuEx- 1_0-st-v2 | Mixed_DNASeq | |
H-miRNA_8x15Kv2 | IlluminaGA_DNASeq_curated | |
MDA_RPPA_Core | IlluminaHiSeq_TotalRNASeqV2 | |
HT_HG-U133A | IlluminaHiSeq_DNASeq_automated | |
diagnostic_images | microsat_i | |
IlluminaHiSeq_RNASeq | SOLiD_DNASeq_curated | |
IlluminaHiSeq_DNASeqC | Mixed_DNASeq_curated | |
IlluminaGA_RNASeq | IlluminaGA_DNASeq_Cont_automated | |
IlluminaGA_DNASeq | IlluminaHiSeq_WGBS | |
pathology_reports | IlluminaHiSeq_DNASeq_Cont_automated | |
Genome_Wide_SNP_6 | bio | |
tissue_images | Mixed_DNASeq_automated | |
HumanMethylation27 | Mixed_DNASeq_Cont_curated | |
IlluminaHiSeq_RNASeqV2 | Mixed_DNASeq_Cont | |
file.type | To be used in the legacy database for some platforms, to define which file types to be used. | |
barcode | A list of barcodes to filter the files to download | |
experimental.strategy | Filter to experimental stratey. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array. Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq | |
sample.type | A sample type to filter the files to download |
The options for the field project
are below:
The options for the field sample.type
are below:
The other fields (data.category, data.type, workflow.type, platform, file.type) can be found below. Please, not that these tables are still incomplete.
legacy = FALSE
)datatable(readr::read_csv("https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454"),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 40),
rownames = FALSE)
## Parsed with column specification:
## cols(
## Data.category = col_character(),
## Data.type = col_character(),
## `Workflow Type` = col_character(),
## Platform = col_character()
## )
legacy = TRUE
)datatable(readr::read_csv("https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=1817673686"),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 40),
rownames = FALSE)
## Parsed with column specification:
## cols(
## Data.category = col_character(),
## Data.type = col_character(),
## Platform = col_character(),
## file.type = col_character()
## )
In this example we will access the harmonized database (legacy = FALSE
) and search for all DNA methylation data for recurrent glioblastoma multiform (GBM) and low grade gliomas (LGG) samples.
query <- GDCquery(project = c("TCGA-GBM", "TCGA-LGG"),
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"),
sample.type = "Recurrent Solid Tumor"
)
datatable(getResults(query),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
In this example we will access the harmonized database (legacy = FALSE
) and search for all patients with DNA methylation (platform HumanMethylation450k) and gene expression data for Colon Adenocarcinoma tumor (TCGA-COAD).
query.met <- GDCquery(project = "TCGA-COAD",
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"))
query.exp <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ")
# Get all patients that have DNA methylation and gene expression.
common.patients <- intersect(substr(getResults(query.met, cols = "cases"), 1, 12),
substr(getResults(query.exp, cols = "cases"), 1, 12))
# Only seelct the first 5 patients
query.met <- GDCquery(project = "TCGA-COAD",
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"),
barcode = common.patients[1:5])
query.exp <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ",
barcode = common.patients[1:5])
datatable(getResults(query.met, cols = c("data_type","cases")),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
This exmaple shows how the user can search for breast cancer Raw Sequencing Data (“Controlled”) and verify the name of the files and the barcodes associated with it.
query <- GDCquery(project = c("TCGA-BRCA"),
data.category = "Raw Sequencing Data",
sample.type = "Primary solid Tumor")
# Only first 100 to make render faster
datatable(getResults(query, rows = 1:100,cols = c("file_name","cases")),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
This exmaple shows how the user can search for glioblastoma multiform (GBM) and low grade gliomas (LGG) DNA methylation data for platform Illumina Human Methylation 450 and Illumina Human Methylation 27.
query <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
legacy = TRUE,
data.category = "DNA methylation",
platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27"))
datatable(getResults(query, rows = 1:100),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
This exmaple shows how the user can search for glioblastoma multiform (GBM) gene expression data with the normalized results for expression of a gene. For more information check rnaseqV2 TCGA wiki
# Gene expression aligned against hg19.
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
legacy = TRUE)
datatable(getResults(query.exp.hg19),
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
If you want to get the manifest file from the query object you can use the function getManifest. If you set save to TRUEm a txt file that can be used with GDC-client Data transfer tool (DTT) or with its GUI version ddt-ui will be created.
## id
## 97 217d72e9-4d6f-409d-911c-0a70b17a0adc
## 204 973ce0ac-f613-4b99-b2ab-3e2d5548f05f
## filename
## 97 unc.edu.b469eb7c-723f-4870-b4e4-ebfaae7a118b.1536566.rsem.genes.normalized_results
## 204 unc.edu.152afe8c-f67c-4d7c-93ac-e1b7edd56c54.1544649.rsem.genes.normalized_results
## md5 size state
## 97 beda9f89f08fc6a892a72e8b704fdbd9 437283 live
## 204 84478e78d95e1155019ccb7e0e0fea2f 436272 live