rfaRm
Lara Selles Vidal, Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro
April 7, 2020
The Rfam database is a collection of families of non-coding RNA and other structured RNA elements. Each family is defined by a multiple sequence alignment of family members, a consensus secondary structure and a covariance model, which integrates both multiple sequence alignment information and secondary structure information, and are related to the hidden Markov models employed by Pfam. “rfaRm” is a R package providing a client-side interface to the Rfam database. The package allows the search of the Rfam database by keywords or sequences, as well as the retrieval of all available information on specific Rfam families, such as member sequences, multiple sequence alignments, secondary structures and covariance models.
The Rfam database (Kalvari et al. 2017) is a collection of RNA sequences classified into different families of non-coding RNA, cis-regulatory elements and other structured RNA. It is designed to be analogous to the Pfam database of protein families (El-Gebali et al. 2018). The initial step for identifying an Rfam family is to generate a manually curated seed alignment of a set of known RNA sequences, ideally including secondary structure information. The Infernal software package (Nawrocki & Eddy 2013) is then used to build a covariance model for the family (a probabilistic profile of the sequence and secondary structure of the family), and to search for new homologs to add from a sequence database. When new homologs are added, the covariance model is updated, and the process is repeated until no new homologs are found.
Much of the information stored in Rfam could be useful if integrated within existing genomics and transcriptomics workflows and with functionalities of packages already available in BioConductor. rfaRm aims to facilitate the integration of the information provided by Rfam in existing workflows by providing an R client-side interface to the Rfam database.
rfaRm allows to search the Rfam database by keywords and sequence. In the search by keywords, the supplied keywodrs are matched against the fields of each entry, and accessions for the matching families are returned. In the search by sequence, the supplied sequence is used by Infernal to be searched against the Rfam library of covariance models, returning high-scoring hits which allow the identification of regions in the query sequence that belong to an Rfam family. Searches can be used to find RNA families of interest.
rfaRm also allows the retrieval of the information available in Rfam for each RNA family by providing a valid Rfam family identifier. The types of retrievable information for each Rfam family include:
The results of each query are returned as R objects that can be used to perform further analysis on the retrieved data with other R packages. Additionally, when appropriate they can also be saved to standard file formats, which can be read by commonly used Bioinformatics software.
Rfam families can be searched by keywords with the “rfamTextSearchFamilyAccession” function. The function returns a character vector where each element is a string representing the Rfam accession for an Rfam family that matched the searched keyword. Rfam accessions are always of the “RFXXXXX” structure, where X is any digit from 0 to 9. For example, in order to search for Rfam families corresponding to riboswitches, the keyword “riboswitch” can be passed as the query to the search function. The function will output a vector with the Rfam accessions of the matching families:
rfamTextSearchFamilyAccession("riboswitch")
## [1] "RF00174" "RF00059" "RF00162" "RF01051" "RF00504" "RF00050" "RF00379"
## [8] "RF00167" "RF00168" "RF01734" "RF01750" "RF00634" "RF01055" "RF00380"
## [15] "RF01068"
It is also possible to perform a sequence-based search with the “rfamSequenceSearch” function. The supplied query sequence is matched against the covariance models of the Rfam families by the Infernal software in order to identify regions in the query sequence that belong to one of the Rfam families. It should be noted that sequence-based searches can occasionally take long times to finish, sometimes up to several hours. The supplied string should be an RNA sequence, containing only the standard RNA nucleotides (A, U, G and C). The function will return a nested list where each top-level element represents a high-scoring hit, and is in itself a list including information such as the Rfam family with which the hit was found, and an alignment between the query sequence and the consensus sequence of the Rfam family. In the following example, only one high-scoring hit is identified in the provided sequence: an FMN riboswitch. The alignment with the consensus sequence of the Rfam family is saved to a text file, together with its associated secondary structure annotation.
sequenceSearch <- rfamSequenceSearch("GGAUCUUCGGGGCAGGGUGAAAUUCCCGACCGGUGGUAUAGUCCACGAAAGUAUUUGCUUUGAUUUGGUGAAAUUCCAAAACCGACAGUAGAGUCUGGAUGAGAGAAGAUUC")
length(sequenceSearch)
## [1] 1
names(sequenceSearch)
## [1] "FMN"
## Save the aligned consensus sequence and query sequence to a text file, together
## with secondary structure annotation
writeLines(c(sequenceSearch$FMN$alignmentQuerySequence, sequenceSearch$FMN$alignmentMatch,
sequenceSearch$FMN$alignmentHitSequence, sequenceSearch$FMN$alignmentSecondaryStructure),
con = "searchAlignment.txt")
Each Rfam family has two main identifiers: its accession (a string with the RFXXXX format), and its ID (a keyword related to the Rfam family functionality or RNA type). Both can be used interchangeably with all of the query functions available in rfaRm. It is possible to obtain the ID corresponding to an accession with the “rfamFamilyAccessionToID” function, and the accession corresponding to an ID with the “rfamFamilyIDToAccession” function.
rfamFamilyAccessionToID("RF00174")
## [1] "Cobalamin"
rfamFamilyIDToAccession("FMN")
## [1] "RF00050"
When the accession or ID of the Rfam family of interest is known, it is possible to query the database to retrieve different types of data about the family.
A summary with a short functional description of an Rfam family can be obtained with the “rfamFamilySummary” function.
rfamFamilySummary("RF00174")
## $rfamReleaseNumber
## [1] "14.2"
##
## $numberSequencesSeedAlignment
## [1] "430"
##
## $sourceSeedAlignment
## [1] "Barrick JE, Breaker RR"
##
## $numberSpecies
## [1] "5174"
##
## $RNAType
## [1] "Cis-reg; riboswitch;"
##
## $numberSequencesAll
## [1] "13781"
##
## $structureSource
## [1] "Predicted; PFOLD"
##
## $description
## [1] "Cobalamin riboswitch"
##
## $rfamAccession
## [1] "RF00174"
##
## $rfamID
## [1] "Cobalamin"
##
## $comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]."
# The summary includes, amongst other data, a brief paragraph describing the
# family
rfamFamilySummary("RF00174")$comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]."
The consensus secondary structure for an Rfam family can be retrieved, together with its consensus sequence, with the “rfamConsensusSecondaryStructure” function. RNA Secondary structure notation can be output in the WUSS or extended Dot-Bracket notations. In the example below, the consensus secondary structure of the Rfam family with accession RF00174 is retrieved in both notations, and saved to a file with the extended Dot-Bracket notation.
rfamConsensusSecondaryStructure("RF00174", format = "WUSS")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] ":::::::::::::::[[[[[[,..<<<____......................_____....___>>>,,......,,,(..((,,,.<.<<.<.<.___...............................____>>>>.>,,<<<__.__.>>>,<<<---..<<.<<..........------<<<<<<-----............................<<<-<<<<<<_____.........................................................._>>.>>>>--->>>...>>>..>.>>..----------..................................................................>>>>.---->>>,,,,)))]]]]]]:::::::::::::::"
rfamConsensusSecondaryStructure("RF00174", format = "DB")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] "...............[[[[[[...<<<......................................>>>...........(..((....<.<<.<.<.......................................>>>>.>..<<<......>>>.<<<.....<<.<<................<<<<<<.................................<<<.<<<<<<................................................................>>.>>>>...>>>...>>>..>.>>..............................................................................>>>>.....>>>....)))]]]]]]..............."
rfamConsensusSecondaryStructure("RF00174", filename = "RF00174secondaryStructure.txt",
format = "DB")
## [1] "uuaaauugaaacgaugauGGUu..ccccuuu......................aaagu....gaaggguu......AAaa..GGGAA.c.cc.G.G.UGa...............................aAaUCCgg.gGCuGcCC.CC.gCaACuGUAA..gc.Gg..........agagcaccccccAauAa............................GCCACUggcccgcaa..........................................................ggg.ccGGGAAGGC...ggg..g.gg..aaggaaugac..................................................................cCgc.gAGcCaGGAGACCuGCCaucaguuuuugaaucucc"
## [2] "...............[[[[[[...<<<......................................>>>...........(..((....<.<<.<.<.......................................>>>>.>..<<<......>>>.<<<.....<<.<<................<<<<<<.................................<<<.<<<<<<................................................................>>.>>>>...>>>...>>>..>.>>..............................................................................>>>>.....>>>....)))]]]]]]..............."
The generated files with secondary structure in the extended Dot-Bracket notation can be read by the R4RNA package (Lai et al. 2012), and used for further analysis and plotting of the corresponding RNA.
secondaryStructureTable <- readVienna("RF00174secondaryStructure.txt")
plotHelix(secondaryStructureTable)
It is also possible to retrieve an SVG file containing a representation of the consensus secondary structure of an Rfam family. Different types of secondary structure representations can be generated, including simple base pairing plots, plots with sequence conservation and plots with basepair conservation, amongst other types. The SVG file can be stored as a character string directly readable by the rsvg and magick packages, or saved to a file which can be read by external software.
## Retrieve an SVG with a normal representation of the secondary structure of Rfam
## family RF00174 and store it into a character string, which can be read by rsvg
## after conversion from SVG string to raw data
normalSecondaryStructureSVG = rfamSecondaryStructureXMLSVG("RF00174", plotType = "norm")
rsvg(charToRaw(normalSecondaryStructureSVG))
## Retrieve an SVG with a sequence conservation representation added to the
## secondary structure of Rfam family RF00174, and save it to an SVG file
rfamSecondaryStructureXMLSVG("RF00174", filename = "RF00174SSsequenceConservation.svg",
plotType = "cons")
It is also possible to directly plot the retrieved graphics to R’s graphics display, and save it to an image file such as PNG, JPEG or GIF.
## Plot a normal representation of the secondary structure of Rfam family RF00174
rfamSecondaryStructurePlot("RF00174", plotType = "norm")
## Save to a PNG file a plot of a representation of the secondary structure of
## Rfam family with ID FMN with sequence conservation annotation
rfamSecondaryStructurePlot("RF00050", filename = "RF00050SSsequenceConservation.png",
plotType = "cons")
The seed multiple sequence alignment that was manually curated and used to define the Rfam family can be retrieved with the “rfamSeedAlignment” function. The function returns the seed alignment as a Biostrings MultipleAlignment object with the aligned RNA sequences. The alignment can also be written to a file in Stockholm or FASTA format. In the example below, the seed alignment for Rfam family RF00174 is written to two files, in Stockholm and FASTA format, respectively.
rfamSeedAlignment("RF00174", filename = "RF00174seedAlignment.stk", format = "stockholm")
## RNAMultipleAlignment with 430 rows and 441 columns
## aln names
## [1] GGCACCUUCGCGGCAGAUGGUU--C...CCUGCCAUCAGCGUCAUCAACCGC AF010496.1/39869-...
## [2] CCACUCAGGGCGGGCGCUGGUU--U...CCUGCCAGCGCAUGGAUUUCGGGC AF010496.1/105318...
## [3] GCUACUCCAACAGGCGAUGGU---U...CCUGCCAUCGCUCUGGCGUCGCAA AF010496.1/116971...
## [4] GCAAUGAGGAAGGAUUAAGGUU--C...CCUGCCUUAAACCAAGUCAUCCAC AF193754.1/4343-4142
## [5] GGAAAUUUUUUUGCAUAGGGU---U...CCGACCCUAUGUAAUCGUUCCACG AF193754.1/24966-...
## [6] AGGCUGACCGGUGCAGCUGGUU--C...CCUGCCCGCUGCCCGCACGCGACC AF263012.1/9229-9015
## [7] GAUAAUCCAAGUCGUCGAGGUU--C...CCGGCCCCGACAAUAUAUUGGUCC AF306632.1/382-181
## [8] UCAAACAGCAACAGUAAAGGU---G...CCUGUCUUUAUUGUGAAGUUUCUA AJ000758.1/1191-1370
## [9] CAUAUCGUGCAAAAAAAAGGUG--C...CCAGCCUUUUUCUAAUGCUAAGCU AJ000758.1/1489-1668
## ... ...
## [422] ACUAUAUGUGGUGUUCAAGGUU--C...CCUGCCUUGAGCGCAAAUGUCCAC AE007869.2/70421-...
## [423] GGCCAUAUGCCGCCGUCAGGUG--C...CCGGCCUGGCGAGAUAGACCGGCC AL591688.1/199976...
## [424] GUAGCCUUGCGCUUUCGAGGUU--C...CCUGCCUCGUCACAGAUUCUCACU CP000094.2/355264...
## [425] UUACUGAAAUGGUUAUGUCGU---G...CCAGCCAAAUCGGUAAAAUAUAUU AALF02000008.1/20...
## [426] GCCAUUCAGCGAAGAGAUGGUU--C...CCUGCCAUCGCGUCGUUCGUCCGG CP002156.1/230652...
## [427] UUUUUUAUAUAUCGAAAUGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALC02000021.1/24...
## [428] AUAUAGUAUGCGAGUAUUGGU---G...CCUGCCAAUACUUGAAGUUUCUAU CP003056.1/112605...
## [429] CUUCAAGAAUAAAAGGCAUGUU--A...CCAGCCCGCUCAGUAAAAGAUUUU AALE02000017.1/13...
## [430] AUUUGUAGUUACUGAUAGGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALD02000002.1/89...
rfamSeedAlignment("RF00174", filename = "RF00174seedAlignment.stk", format = "fasta")
## RNAMultipleAlignment with 430 rows and 441 columns
## aln names
## [1] GGCACCUUCGCGGCAGAUGGUU--C...CCUGCCAUCAGCGUCAUCAACCGC AF010496.1/39869-...
## [2] CCACUCAGGGCGGGCGCUGGUU--U...CCUGCCAGCGCAUGGAUUUCGGGC AF010496.1/105318...
## [3] GCUACUCCAACAGGCGAUGGU---U...CCUGCCAUCGCUCUGGCGUCGCAA AF010496.1/116971...
## [4] GCAAUGAGGAAGGAUUAAGGUU--C...CCUGCCUUAAACCAAGUCAUCCAC AF193754.1/4343-4142
## [5] GGAAAUUUUUUUGCAUAGGGU---U...CCGACCCUAUGUAAUCGUUCCACG AF193754.1/24966-...
## [6] AGGCUGACCGGUGCAGCUGGUU--C...CCUGCCCGCUGCCCGCACGCGACC AF263012.1/9229-9015
## [7] GAUAAUCCAAGUCGUCGAGGUU--C...CCGGCCCCGACAAUAUAUUGGUCC AF306632.1/382-181
## [8] UCAAACAGCAACAGUAAAGGU---G...CCUGUCUUUAUUGUGAAGUUUCUA AJ000758.1/1191-1370
## [9] CAUAUCGUGCAAAAAAAAGGUG--C...CCAGCCUUUUUCUAAUGCUAAGCU AJ000758.1/1489-1668
## ... ...
## [422] ACUAUAUGUGGUGUUCAAGGUU--C...CCUGCCUUGAGCGCAAAUGUCCAC AE007869.2/70421-...
## [423] GGCCAUAUGCCGCCGUCAGGUG--C...CCGGCCUGGCGAGAUAGACCGGCC AL591688.1/199976...
## [424] GUAGCCUUGCGCUUUCGAGGUU--C...CCUGCCUCGUCACAGAUUCUCACU CP000094.2/355264...
## [425] UUACUGAAAUGGUUAUGUCGU---G...CCAGCCAAAUCGGUAAAAUAUAUU AALF02000008.1/20...
## [426] GCCAUUCAGCGAAGAGAUGGUU--C...CCUGCCAUCGCGUCGUUCGUCCGG CP002156.1/230652...
## [427] UUUUUUAUAUAUCGAAAUGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALC02000021.1/24...
## [428] AUAUAGUAUGCGAGUAUUGGU---G...CCUGCCAAUACUUGAAGUUUCUAU CP003056.1/112605...
## [429] CUUCAAGAAUAAAAGGCAUGUU--A...CCAGCCCGCUCAGUAAAAGAUUUU AALE02000017.1/13...
## [430] AUUUGUAGUUACUGAUAGGGUC--A...CCAGCCAGAUCGGUAAAAGAUAUU AALD02000002.1/89...
The phylogenetic tree associated to the seed alignment can be retrieved with the “rfamSeedTree” function. The tree is ouput in the New Hampshire extended format (NHX). The resulting tree is returned as a string and saved to a file, which can be directly read by the treeio package (Wang et al. 2019) or other external software.
seedTreeNHXString <- rfamSeedTree("RF00050", filename = "RF00050tree.nhx")
treeioTree <- read.nhx("RF00050tree.nhx")
A plot of the phylogenetic tree can also be directly retrieved with the “rfamSeedTreeImage” function, labeled with either species names or sequence accessions.
rfamSeedTreeImage("RF00164", filename = "RF00164seedTreePlot.png", label = "species")
The covariance model describing each Rfam family can be obtained and saved to a file with the “rfamCovarianceModel” function. The file is provided in the Infernal software format.
rfamCovarianceModel("RF00050", filename = "RF00050covarianceModel.cm")
A list of the sequence regions assigned to be members of an Rfam family can be retrieved with the “rfamSequenceRegions” function. The resulting dataframe contains the GenBank accessions of the containing sequences, as well as the start and end points of each sequence region belonging to the Rfam family. The sequence regions can also be saved to a tab-delimited file.
## By providing a filename, the sequence regions will be written to a
## tab-delimited file
sequenceRegions <- rfamSequenceRegions("RF00050", "RF00050sequenceRegions.txt")
head(sequenceRegions)
## Sequence GenBank accession Bit score Region start position
## 1 BA000032.2 135.0 1323634
## 2 CP009654.1 134.8 389437
## 3 JXMS01000004.1 134.4 105773
## 4 JXOK01000049.1 134.1 210476
## 5 CP003242.1 134.1 356056
## 6 ABCH01000022.1 133.3 43466
## Region end position
## 1 1323774
## 2 389292
## 3 105633
## 4 210336
## 5 355916
## 6 43613
## Sequence description
## 1 Vibrio parahaemolyticus RIMD 2210633 DNA, chromosome 2, complete sequence.
## 2 Francisella sp. CA97-1460, complete genome.
## 3 Desulfovibrio sp. JC271 contig4, whole genome shotgun sequence.
## 4 Vibrio mytili strain CAIM 528 contig00059, whole genome shotgun sequence.
## 5 Vibrio sp. EJY3 chromosome 2, complete sequence.
## 6 Vibrio shilonii AK1 1103207002024, whole genome shotgun sequence.
## Species NCBI tax ID
## 1 Vibrio parahaemolyticus RIMD 2210633 223926
## 2 Francisella sp. CA97-1460 1542390
## 3 Halodesulfovibrio spirochaetisodalis 1560234
## 4 Vibrio mytili 50718
## 5 Vibrio sp. EJY3 1116375
## 6 Vibrio shilonii AK1 391591
It should be noted that some Rfam families have too many associated sequence regions. In such cases, the list cannot be retrieved from the server.
A list of entries in the PDB database of 3D macromolecular structures (Berman et al. 2000) with experimentally solved structures for members of an Rfam family can be retrieved by using the “rfamPDBMapping” function. If available, the resulting set of entries will be returned as a dataframe, which can also be saved as a tab-delimited file if a filename is provided.
## By providing a filename, the matching PDB entries will be written to a
## tab-delimited file
PDBentries <- rfamPDBMapping("RF00050", "RF00050PDBentries.txt")
PDBentries
## Rfam accession PDB ID Chain PDB start residue PDB end residue
## 1: RF00050 3f2q X 1 112
## 2: RF00050 3f2t X 1 112
## 3: RF00050 3f2w X 1 112
## 4: RF00050 3f2x X 1 112
## 5: RF00050 3f2y X 1 112
## 6: RF00050 3f30 X 1 112
## CM start position CM end position eValue Bit score
## 1: 3 138 2.5e-29 104.20
## 2: 3 138 2.5e-29 104.20
## 3: 3 138 2.5e-29 104.20
## 4: 3 138 2.5e-29 104.20
## 5: 3 138 2.5e-29 104.20
## 6: 3 138 2.5e-29 104.20
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–242. Retrieved from https://doi.org/10.1093/nar/28.1.235
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S.R., Luciani, A., Potter, S.C., Qureshi, M., Richardson, L.J., Salazar, G.A., Smart, A., Sonnhammer, E.L.L., Hirsh, L., Paladin, L., Piovesan, D., Tosatto, S.C.E. & Finn, R.D. (2018). Rfam 13.0: Shifting to a genome-centric resource for non-coding rna families. Nucleic Acids Research, 47, D427–D432. Retrieved from https://doi.org/10.1093/nar/gky995
Kalvari, I., Argasinska, J., Quinones-Olvera, N., Nawrocki, E.P., Rivas, E., Eddy, S.R., Bateman, A., Finn, R.D. & Petrov, A.I. (2017). Rfam 13.0: Shifting to a genome-centric resource for non-coding rna families. Nucleic Acids Research, 46, D335–D342. Retrieved from https://doi.org/10.1093/nar/gkx1038
Lai, D., Proctor, J.R., Zhu, J.Y.A. & Meyer, I.M. (2012). R- chie : A web server and r package for visualizing rna secondary structures. Nucleic Acids Research, 40, e95–e95. Retrieved from https://doi.org/10.1093/nar/gks241
Nawrocki, E.P. & Eddy, S.R. (2013). Infernal 1.1: 100-fold faster rna homology searches. Bioinformatics, 29, 2933–2935. Retrieved from https://doi.org/10.1093/bioinformatics/btt509
Wang, L.-G., Lam, T.T.-Y., Xu, S., Dai, Z., Zhou, L., Feng, T., Guo, P., Dunn, C.W., Jones, B.R., Bradley, T., Zhu, H., Guan, Y., Jiang, Y. & Yu, G. (2019). Treeio: An r package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution, 37, 599–603. Retrieved from https://doi.org/10.1093/molbev/msz240