The present invention relates to a method for profiling nucleotide-resolution structure of an RNA molecule at single-molecule level.
The secondary structure of RNA molecules is often critical to performing their biological function (for example, translation, RNA degradation and RNA maturation). Determining secondary structures for RNAs of interest such as viral RNAs or mRNAs is crucial for regulating their corresponding RNA biological process. For instance, understanding the RNA structure of viral RNAs can enable the design of artificial siRNAs or antisense oligos to efficiently target and deactivate viral RNAs via RNA degradation, or the design of small molecules for targeting and inhibiting viral RNA activities such as replication. In addition, RNA structure is known to affect the RNA stability and translation efficiency. Optimizing the RNA structure could increase the translation efficiency and RNA stability, thereby enhancing protein production. Therefore, determining the accurate RNA structure in living cells is important in developing RNA-based gene engineering and therapeutics, e.g., anti-viral siRNAs/antisense oligonucleotides (ASO) and mRNA vaccines.
Recent methods for studying in vivo RNA structure are based on chemical probing. Two main types of chemical reagent can be used. One modifies the Watson-Crick base-pairing face on the nucleobase, as a direct measure of single-strandedness. For instance, Dimethyl sulfate (DMS) is one of the most commonly used nucleobase probing reagents as it easily penetrates the cell, a pre-requisite for in vivo chemical probing1. The other type of chemical reagent modifies the ribose by selective 2′-hydroxyl acylation and which can be analysed by primer extension (SHAPE)2,3. For instance, NAI (2-methylnicotinic acid imidazolide) is a commonly used SHAPE reagent. Both chemical modifications can be measured by both reverse transcription stopping and mutational profiling4-6. Since 2014, a variety of high-throughput (mostly Illumina-based short-read sequencing) RNA structure chemical probing methods have transformed the scope of RNA structure studies, enabling genome-wide RNA structure analyses7. However, there are still three main challenges to obtain accurate RNA structure in vivo:
We have developed a novel single-molecule structure sequencing method (smStructure-seq) which addresses these challenges by taking advantage of high-accuracy single-molecule sequencing, together with our new analysis pipeline that directly clusters the structural information derived from the mutation profile of each single molecule. This pipeline method (referred to as Determination of the Variation of RNA structure conformation (DaVinci)) incorporates the individual mutation profiles and derives the most-likely RNA structure conformation via a stochastic context-free grammar (SCFG) algorithm independent of thermodynamic parameters. Then the whole conformation space is identified and visualized via PCA analysis. Using the DaVinci method, we could accurately estimate RNA structure conformations at the single-molecule resolution. Therefore, our smStructure-seq allied with DaVinci analysis pipeline can address the challenges of heterogeneities of both isoforms and structural conformations simultaneously and is capable of generating single-molecule RNA structure conformations for each RNA transcript (e.g., isoform). Notably, our smStructure-seq together with our DaVinci analysis pipeline are the first methods which permit resolution of RNA structure information at a true single-molecule level.
According to a first aspect of the present invention, there is provided a method for determining structure of an RNA molecule, the method comprising:
This method allows determination of nucleotide-resolution structure of an RNA molecule at a single-molecule level. An individual cDNA molecule is sequenced with multiple reads to determine the consensus sequence representing the chemical mutation profile. This allows accurate determination of the profile for an individual molecule, as variations in the reads will be caused by structure-specific chemical modification in the sequencing process; whereas other methods which combine multiple reads of different molecules—even if having the same nucleotide sequence—cannot control for the possibility of different pairing and/or folding patterns of the original RNA molecules.
By “structure-specific chemical modification” is meant a modification process which is capable of modifying RNA nucleotides based on the single-strandedness of the RNA molecule at that nucleotide. In embodiments the modification process preferentially modifies single-stranded (ie, unpaired) nucleotides. The modification process preferably has no bias towards any particular nucleotides other than paired or unpaired (that is, any unpaired nucleotide is equally likely to be modified regardless of whether it is A, C, G, or U). The modification process may comprise contacting the RNA molecule with a chemical reagent. Two main types of chemical reagent can be used. One modifies the Watson-Crick base-pairing face on the nucleobase, as a direct measure of single-strandedness, e.g, DMS (dimethyl sulfate), kethoxal, glyoxal, EDC (1-ethyl-3-(3-dimethylaminopropyl) carbodiimide), DEPC (diethylpyrocarbonate) and CMCT (1-cyclohexyl-3-(2-morpholinoethyl) carbodiimide metho-p-toluene sulfonate). The other type of chemical reagent is a hydroxyl-selective electrophile, which modifies the ribose by selective 2′-hydroxyl acylation. The hydroxyl-selective electrophile may be selected from 1M7 (1-methyl-7-nitroisatoic anhydride), 1M6 (1-methyl-6-nitroisatoic anhydride), NMIA (N-methylisatoic anhydride), NAI (2-methylnicotinic acid imidazolide), FAI (2-methyl-3-furoic acid imidazolide), and 2A3 (2-aminopyridine-3-carboxylic acid imidazolide). This second category of reagent lends itself to analysis of RNA by primer extension (SHAPE).
During the reverse transcription step, bases which have been chemically modified are more likely to result in mutations than unmodified bases. Suitable reverse transcriptases for use in this step include e.g. murine leukemia virus (M-MLV) RT including Superscript II, Superscript Ill and Superscript IV reverse transcriptase (Invitrogen); avian myeloblastosis virus (AMV); human immunodeficiency virus type 1 (HIV-1) RTs; thermostable Geobacillus stearothermophilus group II intron RT (TGIRT-III); Marathon RT; Maxima H minus; Rocketscript (Bioneer); Thermoscript (Life Technologies); Monsterscript (Illumina) or fidelity (AccuScript; Stratagene); SMARTScribe reverse transcriptases; Induro RT, etc. For example, the reverse transcriptase may be more likely to skip the modified base, or to introduce a mutation into the cDNA strand at that position. When sequenced, an individual cDNA molecule will therefore yield a sequence which may differ from the original RNA sequence in a number of mutated bases. Which bases are mutated will depend on how accessible the original molecule was to the chemical modification, which in turn will depend on the conformation of the RNA. In this way, the consensus sequence generated by the single molecule sequencing can be said to accurately represent a chemical mutation profile for a given individual molecule by identifying which bases are mutated. Comparison of multiple chemical mutation profiles from individual molecules can highlight regions of molecules having the same or similar sequence but differing conformations.
The population of RNA molecules can be subjected to chemical modification in vitro or in vivo. In vivo modification can take place in a living cell, organ, or organism. The cell, organ, or organism may be a plant, fungus, protozoan, prokaryote including bacteria and archea, or animal cell, organ, or organism.
Preferably multiple RNA molecules are subjected to structure-specific chemical modifications; in preferred embodiments of the invention, step a) comprises subjecting a first population of RNA molecules to structure-specific chemical modifications. The RNA molecules of the first population may have identical sequences to one another. This can allow generation of multiple individual chemical mutation profiles, information from which can optionally be combined at or after step d) to determine common regions of likely chemical modification across multiple individual molecules having the same sequence. Alternatively, the RNA molecules of the first population may have similar (but not identical) or different sequences to one another, as discussed below.
In some embodiments, step a) of the method may also comprise subjecting a second population of RNA molecules to structure-specific chemical modifications. Additional further populations (third, fourth, fifth, etc) of RNA molecules may also be included. The RNA molecules of the second population preferably have identical sequences to one another, and a similar (but not identical) or different sequence to those of the first population.
By “having a similar sequence” is meant that the RNA molecules share one or more common regions. In some embodiments the RNA molecules may be variants of one another—for example, having regions of sequence identity and other regions where the sequence differs. In some embodiments, the variant RNA molecules may originate from a common genomic region. Such RNAs may be, for example, isoforms (eg transcription variants or splice variants) from the same gene or coding region; or subgenomic RNAs (eg subgenomic viral RNA). In other embodiments, the variant RNA molecules may originate from duplicated coding regions or members of gene families. Where a first variant RNA is shorter than a second variant RNA, the sequence of the first may be wholly contained within the sequence of the second—that is, there is preferably 100% identity across the length of the shorter RNA. Alternatively, there may be less than 100% identity (eg, 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, 70% or less identity) over the length of the shorter RNA molecule.
By “having a different sequence” is meant that the RNA molecules do not share common sequences. In some embodiments the RNA molecules are completely dissimilar.
The method is particularly intended to be useful for analysing multiple variant RNA molecules sharing similar sequences which cannot be determined by short read sequencing platforms, as well as for analysing RNA molecules of identical sequence but which may have different structural conformations. Short read sequencing typically cannot sequence fragments of longer than 300 bp, and so when RNA variants share more than this amount of sequence, short read sequencing cannot determine which RNA variant a particular read comes from. Thus, in embodiments, similar RNA molecules share more than 300 nt, preferably more than 400 nt, most preferably more than 600 nt, 750 nt, 1000 nt, 1500 nt, 2000 nt, 2500 nt, 3000 nt, 3500 nt, 4000 nt, 4500 nt, 5000 nt, 10,000 nt, or more than 10,000 nt of common sequence. It is important to note that a population of RNA molecules with the same sequences may contain diverse RNA structural conformations due to the highly dynamic property of RNA molecules. Known methods which use short read sequencing or nanopore sequencing platforms—and which rely on combining information from more than one molecule in order to achieve suitable resolution—cannot resolve the RNA structure information for a single molecule. Therefore, in embodiments, the population of RNA molecules with the same sequences contains diverse RNA structural conformations.
The present method allows multiple RNA molecules of similar sequence to be modified, as well as multiple RNA molecules of the same sequence but which may have different conformations; since the modifications depend on structure of the RNA, different conformations of the RNA molecule will be modified in different ways. These modifications result in incorporation of mutations into the reverse transcribed complementary DNA sequence, and these mutations can be accurately identified by the consensus sequence (eg, a circular consensus sequence, CCS). Multiple consensus sequences can therefore be used to determine whether that site is likely to be single or double stranded in a given conformation—that is, variant RNA molecules with the same sequence can have multiple conformations, each of which will generate a different mutation profile. Combination of information from those unique mutation profiles may therefore indicate the range of different conformations, and the likelihood of a given nucleotide of each single RNA molecule being in a single or double stranded conformation. This can help to elucidate the potential conformation space.
In embodiments, the plurality of RNA molecules may be maintained in different conditions—for example, pH, temperature, light, nutrients, starvation stress, oxidative stress, presence or absence of ions, presence or absence of ions, presence or absence of proteins, presence or absence of metabolites, presence or absence of ligands, presence or absence of chemicals, presence or absence of compounds, presence or absence of pathogens, etc—and the effect of these conditions on single-molecule RNA structure investigated.
In embodiments, the plurality of RNA molecules may be maintained in different cell types—for example, plant cell types (e.g. meristem cell, vascular cell, endodermis cell, etc), animal cell types (e.g. bone cell, sperm cell, fat cell, etc), etc—and the effect of these cell types on single-molecule RNA structure investigated.
In embodiments, the plurality of RNA molecules may be maintained in different developmental stages—for example, plant developmental stages (e.g. vegetative developmental stages and flowering developmental stage, etc) and animal developmental stages (zygote, somitogenesis, organogenesis, etc), etc—and the effect of these developmental stages on single-molecule RNA structure investigated.
In embodiments, the plurality of RNA molecules may be maintained in different species and natural variants—for example, plant species (e.g. Arabidopsis, rice, wheat, barley, soybean, etc), animal species (e.g. zebrafish, Drosophila, mouse, human, worm, etc), etc—and the effect of these organisms and natural variants on single-molecule RNA structure investigated.
In embodiments, the method may further comprise subjecting a control RNA molecule (for example, a molecule which is not subjected to chemical modification) to one or more (and preferably all) of steps b)-d) of the method. The control molecules are used to estimate the experimental noise. In embodiments the control RNA molecule may be part of a control library of molecules; for example, an RNA library obtained from the same source (and hence representing the same mixture of RNA molecules) as the test RNA molecule and/or test RNA library.
The single molecule sequencing method is a method which is capable of obtaining sequence information for single molecules, and is preferably long-read sequencing. By long-read sequencing is meant a method which is capable of sequencing a single molecule of more than 300 nt, preferably more than 400 nt, most preferably more than 600 nt, 750 nt, 1000 nt, 1500 nt, 2000 nt, 2500 nt, 3000 nt, 3500 nt, 4000 nt, 4500 nt, 5000 nt, 10,000 nt, or more than 10,000 nt. In preferred embodiments, the method is designed for use with the PacBio platform for single-molecule real time sequencing. The PacBio platform is generally described in Eid et al, “Real-Time DNA Sequencing from Single Polymerase Molecules”, SCIENCE, 2 Jan. 2009, Vol 323, Issue 5910, pp. 133-138. In brief, double stranded DNAs containing the chemical-induced mutations are ligated to dumbbell adapters to create a circular molecule. A primer and strand-displacement polymerase are introduced, together with labelled nucleotides. Since the molecule is circular, repeated rounds of polymerase replication are undertaken, with incorporation of nucleotides being detected at each step. Multiple reads from a single molecule can then be combined to give one highly accurate consensus sequence.
As will be apparent from the disclosures herein, although the determination of RNA structure by DMS/SHAPE-method is known, the resolution of RNA structure information at single-molecule level has not previously been known.
In embodiments of the invention, the method may be carried out on a library of RNA molecules to determine structures of multiple RNA molecules (whether a sub-population of the whole RNA population, or potentially a transcriptome-wide structurome). For example, the library may be obtained from the RNA population of a cell, tissue, or organ (for instance, an RNA structurome library for a particular tissue). In other embodiments, the library can also include multiple RNA transcripts (e.g., different isoforms) for one or more particular genes; this would allow determination of single-molecule RNA structure conformations for each of the isoforms.
The method may further comprise performing sequence-specific amplification of the cDNA obtained in step b) prior to carrying out step c). This may be of particular benefit where the method is carried out on a library of individual RNAs and allows amplification of a desired target sequence from the whole RNA population. In this way, a given library may be used for determination of different RNA structures for individual genes of interest, or for groups of genes sharing common sequence regions (eg, those arising from gene duplication events, or from differential splicing of RNA), or transcriptomes. The amplification may be thermal amplification, for example, PCR; or may be isothermal amplification, for example, loop-mediated isothermal amplification (LAMP), Helicase-dependent isothermal amplification (HDA), Rolling cycle amplification (RCA), (Recombinase polymerase amplification) RPA, and so on. The skilled person will be aware how to perform such techniques.
In embodiments, the method may further comprise determining RNA structure information at single-molecule level for multiple RNA molecules with the same or different sequences; for example, up to the transcriptome-wide scale.
In embodiments, the step of combining information from multiple mutation profiles may comprise converting each mutation profile to a bit vector to indicate whether a modification occurs at each individual base; and combining multiple bit vectors. The step of combining multiple bit vectors may comprise transforming the bit vectors into single-stranded constraint information—that is, determining for each nucleotide whether it is constrained to be single stranded, or whether base pairing may take place.
In embodiments, the method may further comprise the step of using the structure determined at step d) to generate multiple possible RNA structures for a given molecule, and optionally clustering said multiple possible structures into two or more groups. That is, the determined structure may include probabilities of each base in a given RNA molecule being paired or unpaired, and those probabilities used to generate possible specific structures within the available conformation space. In embodiments, the multiple possible RNA structures are generated independent of thermodynamic parameters; that is, they are based primarily or solely on the A-U and G-C or G-U base-pairing rules. A stochastic context-free grammar (SCFG) may be used to derive the set of possible RNA structures based on the bit-vectors as constraint information. Once generated, the set of possible structures may be clustered by any appropriate algorithm. Where appropriate, clustering may also be carried out directly on the structures determined at step d), without previously generating multiple possible RNA structures for a given molecule. This may be of particular interest when the method is carried out on multiple RNA populations; the determined structures of the populations may be clustered in order to identify particular groups of RNAs (for instance, isoforms) sharing similar regions of conformation space; or when the method is carried out on a population of RNA molecules with the same sequences but diverse RNA structural conformations.
The methods described herein have a number of potential applications based on obtaining accurate RNA structure information. In embodiments, providing the RNA structure information of viral RNAs may enable the design of artificial siRNAs or antisense oligos to efficiently target viral RNAs and deactivate the viral RNAs via RNA degradation. Additionally, it also enables the design of small molecules for targeting the viral RNAs and subsequently inhibiting the viral RNA activities such as replication. Furthermore, this method may further guide the optimization of the RNA structure allowing the enhancements of protein production. The skilled person will be aware of how the structure of the molecule (and accessibility of nucleotides) can impact on these uses.
Described herein are three workflows for obtaining RNA structures; each is designed for addressing different aspects of RNA analysis, and hence the skilled person may adopt different workflows for different purposes. The first workflow (
Methods described herein may also include a new structure analysis pipeline, Determination of the Variation of RNA structure conformation (DaVinci) that incorporates the individual mutation profiles and derives the most-likely RNA structure conformation via a stochastic context-free grammar (SCFG) algorithm independent of thermodynamic parameters. Then the whole conformation space is identified and visualized via PCA analysis. Using the DaVinci method, we could accurately estimate RNA structure conformations at the single-molecule resolution.
Solving In Vivo RNA Structure Conformation at Single Molecule Level (smStructure-Seq)
The RNA structure is an intrinsic property of RNAs which serves an important genetic and functional role of RNA beyond its nucleotide sequence itself carrying coding information. Since 2014, a variety of high-throughput (mostly Illumina-based short-read sequencing) RNA structure profiling methods have transformed the scope of RNA structure studies, enabling genome-wide RNA structure analyses7. However, these methods could not resolve RNA structure at the single-molecule level. Our smStructure-seq together with our DaVinci analysis pipeline are the first method that resolve the RNA structure information at the single molecule level, solving three main challenges to decipher the RNA structure in vivo.
The isoform heterogeneity is a major challenge to accurately assign RNA structural information for individual gene-linked isoforms. 90% of human genes12 and 60% of Arabidopsis genes12 have alternative splicing events. The RNA structure information within the shared regions between isoforms cannot be distinguished by short read sequencing platforms (e.g., Illumina). Our smStructure-seq addresses this challenge by using the PacBio sequencing method. The sequencing principle of PacBio platform allows the dissection of different transcript isoforms accurately9,13-15, since it requires no assembly step that attenuates the accuracy of isoform assignment.
The second challenge is to determine the RNA structure information for single molecules. RNA structure adopts multiple conformations. Single-molecule structure information as described herein can not only discriminate RNAs with very similar sequence (e.g, isoform or RNA sub-genome in viruses), but can facilitate the identification of RNA structure diversity (the third challenge of RNA structure analysis). Recently, a Nanopore-based method, PORE-cupine16 was developed to address these challenges. The long-read Nanopore sequencing captures structures along the whole length of each isoform16. However, the macromolecules in the Nanopore channel can be occupied by multiple bases at one time, increasing uncertainty in signal assignment of the nucleotides16,17. Besides, Nanopore has an averaged error rate of 14% for both direct RNA and cDNA sequencing8, which cannot achieve the single-molecular accuracy. In contrast, the long-read, single molecule real time sequencing (eg, PacBio platform) used by smStructure-seq can achieve 99.9% accuracy at the nucleotide level2, facilitating the single-molecular accuracy. The accurate single-molecule read is the foundation to deciphering the conformation diversity at the single-molecule level.
The RNA structure can dynamically change in vivo by adopting different conformations. Directly dissecting the heterogeneity of different RNA structural conformations remains challenging. Two computational approaches, DREEM10 and DRACO11 have been developed. DREEM10 used an expectation-maximization regime to detect the RNA structure conformation while DRACO11 used an alternative method based on a new clustering regime. These two computational methods were developed to estimate structural conformations based on the Illumina-based platform. Due to the limitation of short read sequencing, the direct dissection of RNA structural conformations has only been achieved for short RNA fragments (200-300 nt)10 in examples to date, although in theory these methods could be improved for long transcripts. Despite this possibility, these two computational approaches deduce the RNA structure conformation by clustering the chemical reactivity profiles. The chemical reactivity-based clustering methods tend to generate two mutation profiles: one with extreme high chemical modifications (more single-stranded RNA structure) and one with extreme low chemical modifications (more double-stranded RNA structure). These clusters directly reflect the similarity of chemical modification efficiencies rather than directly represent the clusters of RNA structure conformations per se.
Our methods described herein (and referred to as “smStructure-seq”) can solve the challenge by taking advantage of high-accuracy single-molecule sequencing, together with our new analysis pipeline that directly clusters the structural information derived from the mutation profile of each single molecule. This method (named “Determination of the Variation of RNA structure conformation” (DaVinci)), incorporates the individual mutation profiles and derives the most-likely RNA structure conformation via a stochastic context-free grammar (SCFG) algorithm independent of thermodynamic parameters. Then the whole conformation space is identified and visualized via PCA analysis. Using the DaVinci method, we could accurately estimate RNA structure conformations at the single-molecule resolution.
Single-Molecule RNA Structure Profiling (smStructure-Seq)
Our methods described herein (and referred to as “smStructure-seq”) adopt the advantage of long read length and the high accuracy of HiFi reads of the PacBio platform7, which is capable of the direct determination of both RNA isoform-specific structures and structural conformations. Our smStructure-seq can address the challenges of heterogeneities of both isoforms and structural conformations simultaneously.
This method is suited to diverse RNA structure probing chemicals such as Dimethyl sulfate (DMS) and 2′-hydroxyl acylation-based chemicals (SHAPE reagent). Extracted RNAs are exposed to the relevant probing chemicals, and subjected to reverse transcription, where the modified sites lead to mutations in the complementary DNA (cDNA). We adapted the resulting cDNAs into the PacBio platform for single-molecule real time sequencing, hence the overall method is termed as single molecule-based RNA structure sequencing (smStructure-seq). The derived raw reads were processed to obtain high-accuracy circular consensus sequences or HiFi reads for generating the RNA structure probing chemical reactivities based on the chemical-adduct mutation profiles.
We have described below three smStructure-seq library construction pipelines, which may be used depending on the properties of different reverse transcriptases. For all pipelines, the initial RNA treatment, and generation of a (+)SHAPE and (−)SHAPE RNA library, is carried out as follows. The example below uses RNA from Arabidopsis thaliana seedlings. Our smStructure-seq is able to be used in other organisms.
(+)SHAPE and (−)SHAPE Single-Molecule Structure-Seq (smStructure-Seq) Library Construction
We used the SHAPE reagent, 2-methylnicotinic acid (NAI) to do the in vivo RNA secondary structure chemical probing. NAI was prepared as reported previously3. Briefly, Arabidopsis thaliana seedlings were completely covered in 20 mL 1×SHAPE reaction buffer (100 mM KCl, 40 mM HEPES (pH7.5) and 0.5 mM MgCl2) in a 50 mL Falcon tube. NAI was added to a final concentration of 1 M and the tube swirled on a shaker (1000 rpm) for 15 min at room temperature (22° C.) or 30 min at 4° C. This high NAI concentration allows NAI to penetrate plant cells and modify the RNA in vivo. After quenching the reaction with freshly prepared dithiothreitol (DTT), the seedlings were washed with deionized water and immediately frozen with liquid nitrogen and ground into powder. Total RNA was extracted using hot phenol method8, followed by DNaseI treatment in accordance with the manufacturer's protocol. The control group was prepared using DMSO (dimethyl sulfoxide, labeled as (−)SHAPE), following the same procedure as described above.
2 μg (+)SHAPE or (−)SHAPE RNA samples was added into 19 μL buffer system containing with 2 μL 0.5 μM RNA-DNA hybrid adaptors (SEQ ID NO 1: 5′rArGrA rUrCrG rGrArA rGrArG rCrArC rArCrG rUrCrU rGrArA rCrUrC rCrArG rUrCrA rC/3SpC3/and SEQ ID NO 2: 5′GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC TN (N=equimolar A, T, G, C)), 4 μL 5× reaction buffer (2.25 M NaCl, 25 mM MgCl2, 100 mM Tris-HCl, pH 7.5), 2 μL 10×DTT (50 mM; made fresh or from frozen stock), and 1 μL TGIRT®-III enzyme (10 μM; InGex). The reaction system was pre-incubated at room temperature for 30 minutes, then 1 μL of 25 mM dNTPs (an equimolar mixture of dATP, dCTP, dGTP, and dTTP at 25 mM each; RNA grade) added. The whole reaction system in the tube was incubated at 60° C. for 120 minutes. To remove the TGIRT®-III enzyme from the template, 1 μL of 5 M NaOH was added and incubated at 95° C. for 3 minutes. The sample was cooled down to room temperature and neutralized with 1 μL of 5 M HCl before the clean-up of the cDNAs with a MinElute Reaction Cleanup Kit (QIAGEN, Cat. No. 28204). To capture specific target RNAs, PCR reactions with 10 cycles were conducted with primers specific for the desired target using KOD Xtreme™ Hot Start DNA Polymerase (Novagen). Alternatively, non-specific amplification may be used to enrich the library. The amplified DNA fragments from the eight replicates of PCR reactions were merged for achieving sufficient DNAs. The resulting DNAs were size-selected by Solid Phase Reversible Immobilization (SPRI) size selection system (BECKMAN COULTER). Two independent biological replicates were generated for both (+)SHAPE and (−)SHAPE smStructure-seq libraries. The purified DNA samples were subjected to PacBio library construction by BGI company using PacBio Sequel 3.0.
smStructure-seq Data Analysis
The raw reads from (+)SHAPE and (−)SHAPE libraries were converted into HiFi reads (circular consensus sequences, i.e., CCS ) by the utility “ccs” (https://github.com/PacificBiosciences/ccs) with parameters ‘--minPasses=3’ in order to achieve ˜99.8% predicted accuracy (Q30)9. The HiFi reads were demultiplexed utilizing demultiplex barcoding algorithm Lima version 1.11.0 (https://github.com/pacificbiosciences/barcoding). The derived HiFi reads were mapped to target reference sequences by using BLASR (version 5.3.3)19 with parameters “--minMatch 10 -m 5 --hitPolicy leftmost”. Each read was converted into a ‘bit vector’. Briefly, each bit vector corresponds to a single read and consists of series of ‘0’s (representing matches) and ‘1’s (mutations representing mismatches and unambiguously aligned deletions)10. To generate the overall SHAPE reactivity profiles, the mutation rate (mutr) at a given nucleotide is simply the total number of “1”s divided by the total “0”s and “1”s at that location. Raw SHAPE reactivities were then generated for each nucleotide using the following expression,
where (+)SHAPE corresponds to a NAI treated sample and (−)SHAPE refers to a DMSO treated sample. 1−mutr((−)SHAPE) was the true negative rate representing the specificity at the specific location. The raw SHAPE reactivity (R) mathematically estimates the positive likelihood ratio of SHAPE modification. The raw SHAPE reactivity was normalized to a standard scale that spanned 0 (no reactivity) to ˜1 (high SHAPE reactivity)20 for showing the mutation profiles.
The whole pipeline of Davinci (Determination of the Variation of RNA structure conformation via stochastic context-free grammar) is illustrated in
The representative structure for each cluster was identified by calculating the most common RNA structure type at each position (i.e., maximum expected accuracy, MEA)25 and then being determined by the RNA structure that is at the center of the whole conformational space and most similar to the most common RNA structure. The base-pair probability (bpp) was calculated by counting the frequency of all present base-pairs in the whole conformation space. The positional base-pair probability was derived by the Pi=ΣjjPij, where Pij is the probability of base i of being base-paired to base j, over all its potential J pairing partners. The likelihood of single-strandedness was calculated by the expression of “1−Pi”. Besides, Shannon entropy is calculated as, Ei=Σjj−Pij*log 10(Pij), where Pij is the probability of base i of being base-paired to base j, over all its potential J pairing partners.
Three smStructure-Seq Pipeline Workflows for Different Applications:
This workflow is for obtaining the single-molecule RNA structure information of individual RNAs of interests. 2 μg DMS or SHAPE treated RNA samples are added into 19 μL buffer system containing 2 μL 0.5 μM gene specific reverse transcription primer, 4 μL 5×reaction buffer (2.25 M NaCl, 25 mM MgCl2, 100 mM Tris-HCl, pH 7.5), 2 μL 10×DTT (50 mM; made fresh or from frozen stock), and 1 μL SuperScript II or SuperScript Ill or SuperScript IV or TGIRT®-III enzyme or MarathonRT enzyme (additional 1 μL 20 mM MnCl2). Pre-incubate the reaction system at room temperature for 30 minutes, then add 1 μL of 25 mM dNTPs (an equimolar mixture of dATP, dCTP, dGTP, and dTTP at 25 mM each; RNA grade). The whole reaction system in the tube is incubated at 60° C. for 120 minutes. To remove the reverse transcriptase from the template, 1 μL of 5 M NaOH is added and incubate at 95° C. for 3 minutes. The sample is cooled down to room temperature and neutralized with 1 μL of 5 M HCl before the clean-up of the cDNAs with a MinElute Reaction Cleanup Kit (QIAGEN, Cat. No. 28204). PCR reactions with 10 cycles are conducted with forward and reverse primers containing gene specific primers along with sequences of barcode indexes using KOD Xtreme™ Hot Start DNA Polymerase (Novagen). The amplified DNA fragments from the eight replicates of PCR reactions are merged for achieving sufficient DNAs. The resulting DNAs are size-selected with 1.5% agarose gel, and obtain size fractions (1-2 kb, 2-3 kb, and 3-6 kb) to maximize long reads for capturing all the full-length transcript variants. For individual transcripts with known lengths, the resulting DNAs are size-selected for the specific size range by Solid Phase Reversible Immobilization (SPRI) size selection system (BECKMAN COULTER) following SageELF 2% agarose gel cassette. The purified DNA samples were subjected to PACBIO dumbbell library construction and sequenced using PacBio Sequel I V3.0 or Sequel II.
SmStructure-Seq Pipeline Workflow II (for Long RNAs Up to ˜20 kb)—
This workflow is for obtaining the single-molecule RNA structure information of individual long RNAs up to ˜20 kb. For instance, viral RNAs are usually very long RNAs. This workflow is designed for solving both full-length/subgenome viral RNAs and long RNAs. 2 μg DMS or SHAPE treated RNA samples are added to 10 μL buffer system containing 1 μl 10×T4 PNK buffer, 1 μl T4 PNK enzyme (NEB, M0201L) with no ATP and incubated at 37° C. for 30 min.
1 μl of 50 μM 3′RNA adapter (SEQ ID NO 3: /5rApp/AGAUCGGAAGAGCACACGUCUG/3SpC3/), 1 μl 10×T4 RNA ligase buffer, 6 μl PEG8000, and 2 μl T4 RNA ligase 2 K227Q (NEB, M0351L) were added into the 10 μL sample and incubated at 25° C. for an hour. 1 μL 20 μM DNA primer (SEQ ID NO 4: 5′ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 3′), 1 μL 20 mM MnCl2, 1 μL 10 mM dNTPs, 4 μL 5×RT reaction buffer, 0.5 μL RiboLock RNase Inhibitor and 1 μL Maxima H Minus enzyme (200 U/μL) are added into ligated RNAs and incubated at 50-65° C. for 180 minutes. Since both MMLV and Maxima H Minus enzyme are capable of adding CCC at the 3′end of cDNA after the first strand cDNA synthesis, then afterwards, the 2nd strand of cDNA is synthesized by SMARTScribe Reverse Transcriptase using template switching (TS) oligo (e.g., SMARTer IIA oligonucleotide) which is a DNA oligo sequence that carries 3 ribo-guanosines (rGrGrG) at its 3′ end. PCR reactions with 10 cycles are conducted with forward and reverse primers using KOD Xtreme™ Hot Start DNA Polymerase (Novagen). The amplified DNA fragments from the eight replicates of PCR reactions are merged for achieving sufficient DNAs. The resulting DNAs are size-selected with 1.5% agarose gel, and obtain three size fractions (1-2 kb, 2-3 kb, and 3-6 kb) to maximize long reads for capturing the full-length transcripts. For individual known transcripts, the resulting DNAs are size-selected for the specific size range by Solid Phase Reversible Immobilization (SPRI) size selection system (BECKMAN COULTER) following SageELF 2% agarose gel cassette. The purified DNA samples were subjected to PACBIO dumbell library construction and sequenced using PacBio Sequel I V3.0 or Sequel II.
This workflow is for obtaining the single-molecule RNA structure information of multiple RNAs up to the transcriptome-wide scale. 2 μg DMS or SHAPE treated RNA samples are added into 19 μL buffer system containing 2 μL 0.5 μM RNA-DNA hybrid adaptors (SEQ ID NO 1: 5′rArGrA rUrCrG rGrArA rGrArG rCrArC rArCrG rUrCrU rGrArA rCrUrC rCrArG rUrCrA rC/3SpC3/and SEQ ID NO 5: 5′GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC TNNNNNN (N=equimolar A, T, G, C)), 4 μL 5×reaction buffer (2.25 M NaCl, 25 mM MgCl2, 100 mM Tris-HCl, pH 7.5), 2 μL 10×DTT (50 mM; made fresh or from frozen stock), and 1 μL TGIRT®-III enzyme or MarathonRT enzyme (additional 1 μL 20 mM MnCl2). Pre-incubate the reaction system at room temperature for 30 minutes, then add 1 μL of 25 mM dNTPs (an equimolar mixture of dATP, dCTP, dGTP, and dTTP at 25 mM each; RNA grade). The whole reaction system in the tube is incubated at 60° C. for 120 minutes. To remove the TGIRT®-III enzyme or MarathonRT enzyme from the template, 1 μL of 5 M NaOH is added and incubate at 95° C. for 3 minutes. The sample is cooled down to room temperature and neutralized with 1 μL of 5 M HCl before the clean-up of the cDNAs with a MinElute Reaction Cleanup Kit (QIAGEN, Cat. No. 28204). ssDNA ligation is performed to attach the ssDNA linker to the 3′end of the cDNA products. There are two ssDNA ligation methods: 1) CircLigase I enzyme ligation with ssDNA linker SEQ ID NO 6: /5Phos/NNNAGATCGGAAGAGCGTCGTGTAG/3SpC3/.
The ssDNA ligation conditions use the CircLigase I enzyme, but they have been modified from the literature and manufacturer's (Epicentre) protocols to optimize ligation. Ligation is performed at 65° C. instead of 60° C., and for 12 h instead of 1 h to facilitate higher ligation yield and to allow time for slower ligating sequences to react. Ligation is followed by heating at 85° C. for 15 min to deactivate the CircLigase I. 2) T4 DNA ligase reaction with ssDNA linker SEQ ID NO 7: /5Phos/AGATCGGAAGAGCGTCGTGTAGCTCTTCCGATCTNNNNNN/3SpC3. In the 20 μL reaction, 10 μl of 2×Quick T4 ligase buffer, 1 μl Quick T4 DNA ligase (NEB, M2200L), and 1 μl 50 M ssDNA linker are added together and incubate at 20° C. overnight. PCR reactions with 10 cycles are conducted with forward and reverse primers (SEQ ID NO 8: 5′ GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 3′ and SEQ ID NO 4: 5′ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 3′) using KOD Xtreme™ Hot Start DNA Polymerase (Novagen). The amplified DNA fragments from the eight replicates of PCR reactions are merged for achieving sufficient DNAs. The resulting DNAs are size-selected with 1.5% agarose gel, and obtain three size fractions (1-2 kb, 2-3 kb, and 3-6 kb) to maximize long reads for capturing the full-length transcripts. For individual known transcripts, the resulting DNAs are size-selected for the specific size range by Solid Phase Reversible Immobilization (SPRI) size selection system (BECKMAN COULTER) following SageELF 2% agarose gel cassette. The purified DNA samples were subjected to PACBIO dumbell library construction and sequenced using PacBio Sequel I V3.0 or Sequel II.
In parallel, another pipeline is designed for other reverse transcriptases, e.g., SuperScript II or SuperScript Ill or SuperScript IV. 2 μg DMS or SHAPE treated RNA samples are added to 10 μL buffer system containing with 1 μl 10×T4 PNK buffer, 1 μl T4 PNK enzyme (NEB, M0201L) with no ATP and incubated at 37° C. for 30 min. 1 μl of 50 μM 3′RNA adapter (SEQ ID NO 3: /5rApp/AGAUCGGAAGAGCACACGUCUG/3SpC3/), 1 μl 10×T4 RNA ligase buffer, 6 μl PEG8000, and 2 μl T4 RNA ligase 2 K227Q (NEB, M0351L) were added into the 10 μL sample and incubated at 25° C. for an hour. 1 μL 20 μM DNA primer (SEQ ID NO 4: 5′ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 3′), 1 μL 20 mM MnCl2, 1 μL 10 mM dNTPs, 4 μL 5×RT reaction buffer, 0.5 μL RiboLock RNase Inhibitor and 1 μL SuperScript II or SuperScript Ill or SuperScript IV enzyme (200 U/μL) are added into ligated RNAs and incubated at 42° C. for 180 minutes. To remove the SuperScript RT enzyme from the template, 1 μL of 5 M NaOH is added and incubate at 95° C. for 3 minutes. The sample is cooled down to room temperature and neutralized with 1 μL of 5 M HCl before the clean-up of the cDNAs with a MinElute Reaction Cleanup Kit (QIAGEN, Cat. No. 28204). ssDNA ligation is performed to attach the ssDNA linker to the 3′end of the cDNA products. There are two ssDNA ligation methods: 1) CircLigase I enzyme ligation with ssDNA linker SEQ ID NO 6: /5Phos/NNNAGATCGGAAGAGCGTCGTGTAG/3SpC3/. The ssDNA ligation conditions use the CircLigase I enzyme, but they have been modified from the literature and manufacturer's (Epicentre) protocols to optimize ligation. Ligation is performed at 65° C. instead of 60° C., and for 12 h instead of 1 h to facilitate higher ligation yield and to allow time for slower ligating sequences to react. Ligation is followed by heating at 85° C. for 15 min to deactivate the CircLigase I. 2) T4 DNA ligase reaction with ssDNA linker SEQ ID NO 7: /5Phos/AGATCGGAAGAGCGTCGTGTAGCTCTTCCGATCTNNNNNN/3SpC3. In the 20 μL reaction, 10 μl of 2×Quick T4 ligase buffer, 1 μl Quick T4 DNA ligase (NEB, M2200L), and 1 μl 50 M ssDNA linker are added together and incubate at 20° C. overnight. PCR reactions with 10 cycles are conducted with forward and reverse primers (SEQ ID NO 8: 5′ GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 3′ and SEQ ID NO 4: 5′ GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 3′) using KOD Xtreme™ Hot Start DNA Polymerase (Novagen). The amplified DNA fragments from the eight replicates of PCR reactions are merged for achieving sufficient DNAs. The resulting DNAs are size-selected with 1.5% agarose gel, and obtain three size fractions (1-2 kb, 2-3 kb, and 3-6 kb) to maximize long reads for capturing the full-length transcripts. For individual known transcripts, the resulting DNAs are size-selected for the specific size range by Solid Phase Reversible Immobilization (SPRI) size selection system (BECKMAN COULTER) following SageELF 2% agarose gel cassette. The purified DNA samples were subjected to PACBIO dumbell library construction and sequenced using PacBio Sequel I V3.0 or Sequel II.
Once the samples have been prepared and sequenced, the data is analysed in much the same way regardless of pipeline, as described above.
To benchmark the reproducibility and accuracy of our smStructure-seq, we calculated the SHAPE reactivities of 18S rRNA. We found our smStructure-seq libraries were highly reproducible with very high Pearson correlations of 0.95 (P-value <2.2e-16). By comparing our SHAPE reactivities with the 18S rRNA phylogenetic secondary structure26, we found that our smStructure-seq can accurately probe the full-length RNA structure in vivo. We obtained around 8.58 billion total bases (1,317,882 raw reads) of 18S rRNA which serves as the internal control for our smStructure-seq libraries.
To demonstrate the power of DaVinci, we performed our analysis pipeline on the HIV Rev response element (RRE) that has been reported to be able to adopt alternative conformations that promote different rates of virus replication28. DaVinci was used to analyse the DMS-MaP data of HIV RRE region10. DREEM used chemical reactivity-based clustering methods and identified two extreme conformations The conformation 1′ and conformation 2′ were the conformation identified by DREEM10 which are close to the conformations identified by DaVinci (conformation 1 and conformation 2). However, DaVinci could identify an extra cryptic HIV Rev response element (RRE)29 with extended stem IV and short stem V whose conformation was not identified by chemical reactivity-based clustering methods. This conformation (named as RRE61, shown as conformation 3) has been identified to have the ability to confer RevM10 resistance29. See
A further example comes from analysis of primary-miRNAs RNA secondary structure, important as it affects the efficiency of Microprocessor and Dicing complexes30-32 Previous work33 has shown that a SWI2/SNF2 ATPase CHR2 can change pri-miRNA structures based on the ensemble structure model derived from average chemical modifications. In the chr2-1 mutant, pri-miR319b had a lesser folding or more unpaired nucleotides upper stem, potentially able to affect pri-miRNA processing from the terminal loop to the lower33. Using our DaVinci analysis pipeline, we revealed at least four conformations in the conformation space of pri-miRNA319b. Rather than changing the ensemble structures, CHR2 changes the populations of individual conformations towards to three conformations with lesser folded upper stem region. These results showed that DaVinci can identify the dynamic nature of in vivo RNA structure conformations, facilitating the investigation of the RNA structural conformation functionality. See
Therefore, our smStructure-seq allied with DaVinci analysis pipeline can address the challenges of both heterogeneities of isoforms and structural conformations simultaneously and is capable to generate single-molecule RNA structure conformations for each RNA transcript (e.g., isoform).
The following section provides a more detailed walkthrough of the smStructure-Seq analysis pipeline, DaVinci. Reference is made to various publicly-available tools, as well as to additional python scripts written for this purpose. The specific additional python scripts implement the general principles of the method described herein, and are not considered essential to put the present methods and invention into practice, as the skilled person is able to write alternative suitable computer code to implement the invention. The particular additional scripts are mapped-m5-to-bitvectors.py; fold2dotbracketFasta.py; fold-contrafold-uniq-bits-vectors.py; draw-kmeans-clusters.py; run-pca-on-forgi-vectors.py; README-steps.md.
The section also refers to example test data files; these are not given here, but again the skilled person would understand what such example test data files may contain. Similarly, final output files are listed by name. The output file COOLAIR3_R1_R2-clusters.png from this test data is shown in
smStructure-Seq Analysis Pipeline, DaVinci.
Long reads sequencing used the highly accurate sequencing mode to generate single RNA molecule sequence information and the output from the following three steps will be further used for the section (B).
Generate Highly Accurate Single-Molecule Consensus Reads (HiFi Reads) using ccs version 4.2.0 (from https://github.com/PacificBiosciences/ccs). We start with the raw pacbio subreads as the bam and corresponding pbi index files to generate the HiFi Reads. We have run the consensus generation with minimum 10 passes (Note we have tried different number of passes varying from 3 to 10).
In this step, we generate a highly accurate single-molecule consensus reads (HiFi Reads). The input to the ccs tool is the raw bam file ‘raw.subreads.bam’ and corresponding output file is the ‘consensus_reads.ccs2.bam’. The parameter ‘minPasses’ should be set to desired rounds of polishing steps; here it is set to a high standard, i.e, 10 rounds. We have computed this on high-performance computing clusters with 64 cores indicated by the parameter (-j 64).
In this step, we demultiplexed the reads HiFi reads by the barcodes. This step is optional if the reads were not multiplexed before the sequencing. We specify the direction primers in a FASTA file by denoting as 5p and 3p for the forward and reverse direction, respectively. Demultiplexed the reads using lima version 1.11.0 (https://github.com/PacificBiosciences/barcoding), a PacBio Barcode Demultiplexer and Primer Remover tool with the following command:
In the command above, the primers sequence are stored in the FASTA format file named ‘primers.fa’ and the demultiplexed output bams are given with prefix ‘output.bam’. We used the flag ‘--split-bam-named’ to the names corresponding to the names specified in the primer sequence file. For this case, we set the desired number of bams to ‘--bam-handles 11’, which should be a set number of primers. This step also outputs the corresponding ‘subreadset.xml’ files which allows us the loading of the bam files in the next step.
We converted the BAM (and the associated xml files) format to the FASTA format for the downstream processes, as conversion allowed us to manually inspect the output. We used bam2fasta (version 1.3.1) from bam2fastx package (https://github.com/PacificBiosciences/bam2fastx) with the following command:
Here the ‘--output’ is the output name prefix and the input is the xml files which in turns load the bam files for conversion. Note this is an optional step as the bam files can be directly used in the following steps.
Step 1: Mapping of HiFi Reads to the Transcriptome Reference with BLASR Aligner.
We have used the PacBio long read aligner BLASR tool (version 5.1) to map the clean reads to the transcriptome reference (https://github.com/pacificbiosciences/blasr). We set the hitPolicy as leftmost to pick the best hits. We used the m5 format as an output (as this allows for manual inspection of the mapping results).
In this step, the input reads for replicate ‘R1_5p.fasta’ and the target reference ‘reference.fasta’ and output file is the R1_5p.m5 which is m5 formatted alignment file. We ran this with a minimum seed length match threshold of 10 with the parameter ‘--minMatch’.
Step 2: Generate Bit Vectors from the Alignment.
The observed mutation profiles for a transcript were converted to a binary vector representation using the bespoke script ‘m5_to_bitvectors.py’. The bit vectors represent sites of mutation and marked as 1 for a mutation(mismatch) or 0 otherwise denoting a wild type (no mutation) state. The region of reference not covered by the reads were marked as ‘NA’.
The parameter ‘--transcript’ denotes the name of the target transcript and the ‘-reference_file’ the transcriptome reference in the FASTA format that we used in the mapping step. The input files are a list of m5 formatted files, ‘R1_5p.m5’ and ‘R2_5p.m5’. We are generating a single output bitvector file ‘COOLAIR3_5p.bit’.
We use context free RNA folding with contrafold version 2.02 (http://contra.stanford.edu/contrafold/contrafold_v2_02.tar.gz) to get the folded RNA structure files. Then these structure files were further processed through the forgi (https://github.com/ViennaRNA/forgi) tool to get the a numerical representation as Forgi vectors. The script fold-contrafold-uniq-bits-vectors.py folds and generates the Forgi secondary structure representation for the given transcripts. Forgi tool and the associated utility rnaConvert were used (https://github.com/ViennaRNA/forgi).
Corresponding script for the conversion of folded files to the dotbracket format is done via the fold2dotbracketFasta.py script
We used the dimension reduction analysis, for example, principal component analysis (PCA) on the Forgi vectors from Scikit-learn: Machine Learning in Python (Pedregosa et al., JMLR 12, pp. 2825-2830, 2011). The projection helps us to view the groupings and separate out the latent classes or to view the conformational space of structures. The following bespoke script gives an output as a png and pdf format along with a csv file for a further manual inspection.
Given a desired number of clusters we perform K-means clustering on the PCA projection using Scikit-learn. We can specify the number of clusters; we have set the default value to three, after trying out different numbers. We visually found three clusters is an optimal number of clusters from the plot.
The parameter ‘--tag’ is the prefix for the output files and the ‘--input_file’ is the csv output file we have from the PCA step. The final outputs will be figures (png and pdf) and a csv file, for this example COOLAIR3-clusters.png, COOLAIR3-clusters.pdf and COOLAIR3-clusters.csv. Clusters 1, 2, and 3 are coloured in the plot as green, orange and pink, respectively.
| Number | Date | Country | Kind |
|---|---|---|---|
| 501541 | Feb 2022 | LU | national |
| 2208604.5 | Jun 2022 | GB | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/EP2023/054582 | 2/23/2023 | WO |