The present disclosure relates generally to novel methods for nucleic acid sequencing. Specifically, the invention relates to a liquid chromatography-mass-spectrometry (LC-MS) based technique for direct sequencing of RNA without prior complementary DNA (cDNA) synthesis. The technique allows one to simultaneously read target RNA sequences with single nucleotide resolution while detecting the presence, type, location and quantity of a wide spectrum of target RNA modifications.
Mass spectrometry (MS) is an essential tool for studying protein modifications (1), where peptide fragmentation produces “ladders” that reveal the identity and position of various amino acid modifications. As of yet, a similar approach is not yet feasible for nucleic acids, because in situ fragmentation techniques providing satisfactory sequence coverage do not exist. A number of major challenges are associated with such nucleic acids sequencing methods. One is that the process of preparing mass ladders needed for RNA sequencing also leads to the generation of other non-mass ladder fragments and mass adducts—where impurities or other molecules or their metal ions which are not related to RNA sequencing, can come along with the RNA mass ladder fragments and obscure the true masses of the ladder fragments.
Ideally, ladder cleavage should be highly uniform with one random cut on each RNA strand, without sequence preference/specificity. However, the structural/cleavage uniformity of ladder sequences generated by the prerequisite RNA degradation is often mixed with undesired fragments with multiple cuts on each RNA strand (internal fragments), complicating downstream data analysis. The presence of both internal fragments and mass adducts results in “noise” in the data that can interfere with data analysis for sequencing, because it is very challenging to single out the desired ladder fragments needed for sequencing from the entire mass data even for a single stranded RNA. Thus, methods to date do not efficiently permit the efficient sequencing of mixtures of RNA molecules such as those derived from a biological sample.
Aberrant nucleic acid modifications, especially methylations and pseudouridylations in RNA, have been correlated to the development of major diseases like breast cancer, type-2 diabetes, and obesity (2,3), each of which affects millions of people around of the world. Despite their significance, the available tools to reliably identify, locate, and quantify modifications in RNA are very limited. As a result, the function of most of such modifications remains largely unknown.
Accordingly, methods are needed to facilitate the efficient sequencing of RNA molecules, including, for example, tRNAs, siRNAs, therapeutic synthetic oligoribonucleotides having pharmacokinetic properties, mixtures of RNA molecules, as well as detection of modifications of such RNA molecules.
The current disclosure is related to a direct, liquid-chromatography-mass spectrometry (herein referred to as LC-MS) based RNA sequencing method which can be used to directly sequence RNA without the need of prior cDNA synthesis, simultaneously determine the nucleotide sequence of an RNA molecule with single nucleotide resolution, as well as, reveal the presence, type, location and quantity of RNA modifications. The disclosed method can be used to determine the type, location and quantity of each modification within the RNA sample. Such techniques can be used advantageously to correlate the biological functions of any given RNA molecule with its associated modifications and for quality control of RNA-based therapeutics.
The LC-MS-based RNA sequencing methods disclosed herein, advantageously provide methods that enable sequencing of purified RNA samples, as well as samples containing multiple RNA species, including mixtures of RNA derived from a biological sample. This strategy can be applied to the de novo sequencing of RNA sequences carrying both canonical and structurally atypical nucleosides. The methods provide a simplified means for analyzing LC-MS-based data through efficient labeling of RNA at its 3′ and/or 5′ ends, thus enabling separation of 3′ ladder and 5′ ladder RNA pools for MS-based analysis.
In an embodiment, an RNA sequencing method, for determining the primary RNA sequence and the presence/identification of RNA modifications, is provided comprising the steps of: (i) labeling of the 5′ and/or 3′ end of the RNA; (ii) random degradation of the RNA; (iii) optionally, physical separation of resultant RNA fragments based on 5′ and 3′ end labeling; (iv) separation and detection of the resultant RNA fragment properties; and (v) data analysis resulting in sequence/modification identification.
In an embodiment, an RNA sequencing method, for determining the primary RNA sequence and the presence/identification of RNA modifications, is provided comprising the steps of: (i) treatment of RNA to be sequenced with N-cyclohexyl-N′-(2-morpholinoethyl)-carbodiimide metho-p-toluenesulfonate (CMC); (ii) affinity labeling of the 5′ and/or 3′ end of the RNA; (iii) random degradation of the RNA into mass ladders; (iv) optionally, physical separation of resultant RNA fragments based on an affinity interaction; (v) measurement of resultant RNA fragments using reverse-phase high performance liquid chromatography (HPLC) or capillary electrophoresis (CE) or other separation methods coupled with mass spectrometry; and (v) MS data analysis resulting in sequence/modification identification.
In specific aspects, the 5′ and 3′ end of the RNA are labeled with affinity-based moieties and/or size shifting moieties. In another aspect, the fragment properties are detected through the use of one or more separation methods including, for example, high performance liquid chromatography, capillary electrophoresis coupled with mass spectrometry.
A hydrophobic end-labelling strategy was used via introducing 2-D mass-retention time (RT) shifts for ladder identification. Specifically, mass-RT labels were added to the 5′ and/or 3′ end of the RNA to be sequenced, and at least one of these moieties results in a retention time shift to longer times, causing all of the 5′ and/or 3′ ladder fragments to have a markedly delayed RT, which clearly distinguished the 5′ ladder from the 3′ ladder. The hydrophobic label tags not only result in mass-RT shifts of labelled ladders, making it much easier to identify each of the 2-D mass ladders needed for LC-MS sequencing of RNA and thus simplifying base-calling procedures, but labelled tags also inherently increase the masses of the RNA ladder fragments so that the terminal bases can even be identified, thus allowing the complete reading of a sequence from one single ladder, rather than requiring paired-end reads.
In certain aspects of the invention, the RNA sequencing method is based on the formation and sequential physical separation of two ladder pools of degraded RNA fragments, referred to herein as 5′ and 3′ ladder pools, which are then subjected to LC/MS for HPLC and MS determination of the RNA sequence as well as the presence of RNA modifications. The physical separation of the 5′ and 3′ ladder pools can be accomplished through the use of a variety of different molecular affinity interactions, such as for example, the affinity of biotin for streptavidin.
In one aspect, the RNA sequencing method disclosed herein comprises the steps of: (i) affinity labeling of the 5′ and/or 3′ end of the RNA molecules; (ii) random degradation of the labeled RNA; (iii) 5′ and/or 3′ end labeled fragment separation based on the affinity labeling; and (iv) sequential performance of liquid chromatography HPLC with high-resolution mass spectrometer (MS) for sequence/modification identification.
In a specific aspect, the method consists of (i) chemical labeling of 5′ and/or 3′ RNA ends for physical separation of ladder fragments based on a biotin/streptavidin affinity (ii) formic acid-mediated RNA degradation, (iii) physical separation of 5′ and/or 3′ labeled RNA (iv) high-performance liquid chromatography (HPLC)-mediated separation of fragments, (v) sequential ESI-Quadrupole-Time-of-Flight (Q-TOF)-MS-based mass detection, and (iv) data analysis based on a simple computational algorithm that extracts, aligns and processes relevant mass peaks from the mass spectrum.
In another specific example, the method consists of (i) 5′ end chemical labeling of RNA with a bulky hydrophobic tag, like Cy3, which is designed to increase the size of the RNA fragment to increase retention time, and 3′ end labeling with an affinity tag like biotin, or vice versa, thus permitting sequence identification without the need for physical separation (ii) formic acid-mediated RNA degradation, (iii) high-performance liquid chromatography (HPLC)-mediated separation of fragments, and sequential ESI-Quadrupole-Time-of-Flight (Q-TOF)-MS-based mass detection, and (iv) data analysis based on a simple computational algorithm that extracts, aligns and processes relevant mass peaks from the mass spectrum.
Further details and aspects of exemplary embodiments of the disclosure are described in more detail below with reference to the appended figures. Any of the above aspects and embodiments of the disclosure may be combined without departing from the scope of the disclosure.
Various embodiments of the present methods for RNA sequencing and modification identification are described herein with reference to the drawings wherein:
Although the present disclosure will be described in terms of specific embodiments, it will be readily apparent to those skilled in this art that various modifications, rearrangements, and substitutions may be made without departing from the spirit of the present disclosure. The scope of the present disclosure is defined by the claims appended hereto.
For purposes of promoting an understanding of the principles of the present disclosure, reference will now be made to exemplary embodiments illustrated in the drawings, and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the present disclosure is thereby intended. Any alterations and further modifications of the inventive features illustrated herein, and any additional applications of the principles of the present disclosure as illustrated herein, which would occur to one skilled in the relevant art and having possession of this disclosure, are to be considered within the scope of the present disclosure.
The current disclosure is related to a direct, liquid-chromatography-mass spectrometry (herein referred to as LC-MS) based RNA sequencing method which can be used to directly sequence RNA without cDNA synthesis, simultaneously determine the nucleotide sequence of RNA molecules with single nucleotide resolution as well as detection of the presence of target RNA modifications. The disclosed method can be used to determine the type, location and quantity of modifications within the RNA sample. The RNA to be sequenced may be a purified RNA sample of limited diversity, as well as samples of RNA containing complex mixtures of RNA, such as RNA derived from a biological sample. Such techniques can be used to determine the nucleotide sequence of an RNA molecule and to advantageously correlate the biological functions of any given RNA molecule with its associated modifications.
As used herein, ribonucleic acid (RNA) refers to oligoribonucleotides or polyribonucleotides as well as analogs of RNA, for example, made from nucleotide analogs. The RNA will typically have a base moiety of adenine (A), guanine (G), cytosine (C) and uracil (U), a sugar moiety of a ribose and a phosphate moiety of phosphate bonds. RNA molecules include both natural RNA and artificial RNA analogs. The RNA can be synthetic or can be isolated from a particular biological sample using any number of procedures which are well known in the art, wherein the particular chosen procedure is appropriate for the particular biological sample. RNA samples include for example, mRNA, tRNA, antisense-RNA, and siRNA, to name a few. No limitations are imposed on the base length of RNA. The LC-MS-based sequencing methods disclosed herein enable the sequencing of not only purified RNA samples, but also more complicated RNA samples containing mixtures of different RNAs.
In a specific embodiment, the structure of synthetic oligoribonucleotides of therapeutic value can be determined using the sequencing methods disclosed herein. Such methods will be of special valuable to those engaged in research, manufacture, and quality control of RNA-based therapeutics, as well as the regulatory entities. Incorporation of structural modifications into synthetic oligoribonucleotides has been a proven strategy for improving the polymer's physical properties and pharmacokinetic parameters. However, the characterization and the structure elucidation of synthetic and highly-modified oligonucleotides remains a significant hurdle.
In addition to sequencing of RNA, the methods disclosed herein may be used to determine the sequence of DNA. As used herein, deoxynucleic acid (DNA) refers to oligonucleotides or polynucleotides as well as analogs of DNA, for example, made from nucleotide analogs. The DNA will typically have a base moiety of adenine (A), guanine (G), cytosine (C) and thymine (T), a sugar moiety of a deoxyribose and a phosphate moiety of phosphate bonds. DNA molecules include both natural DNA and artificial DNA analogs. The DNA can be synthetic or can be isolated from a particular biological sample using any number of procedures which are well known in the art, wherein the particular chosen procedure is appropriate for the particular biological sample. DNA samples include for example, genomic DNA and mitochondrial DNA, to name a few. No limitations are imposed on the base length of DNA. With proper enzymatic and/or chemical degradation, the LC-MS-based sequencing methods disclosed herein enable the sequencing of not only purified DNA samples, but also more complicated DNA samples containing mixtures of different DNAs. In non-limiting embodiments of the invention, enzymatic degradation of the DNA can be achieved using DNA restriction endonucleases.
In one aspect, the sequencing method of the invention comprises the steps of: (i) affinity labeling of the 5′ and 3′ end of the RNA sample to facilitate subsequent separation of the 5′ and 3′ end labeled RNA pools; (ii) random non-specific cleavage of the RNA; (iii) physical separation of resultant target RNA fragments using affinity based interactions; (iv) LC/MS measurement of resultant mass ladders with liquid chromatography (LC) and high resolution mass spectrometry (MS); and (iv) sequence generation and modification analysis.
In an embodiment, an RNA sequencing method, for determining the primary RNA sequence and the presence/identification of RNA modifications, is provided comprising the steps of: (i) labeling of the 5′ and/or 3′ end of the RNA; (ii) random degradation of the RNA; (iii) optionally, physical separation of resultant RNA fragments based on 5′ and 3′ end labeling; (iv) separation and detection of the resultant RNA fragment properties; and (v) data analysis resulting in sequence/modification identification.
In an embodiment, an RNA sequencing method, for determining the primary RNA sequence and the presence/identification of RNA modifications, is provided comprising the steps of: (i) treatment of RNA to be sequenced with N-cyclohexyl-N′-(2-morpholinoethyl)-carbodiimide metho-p-toluenesulfonate (CMC); (ii) affinity labeling of the 5′ and 3′ end of the RNA; (iii) random degradation of the RNA; (iv) optionally, physical separation of resultant RNA fragments based on an affinity interaction; (v) measurement of resultant RNA fragments using reverse-phase high performance liquid chromatography (HPLC) or capillary electrophoresis (CE) or other separation methods coupled with mass spectrometry; and (v) MS data analysis resulting in sequence/modification identification.
In a specific aspect, the method consists of (i) chemical labeling of 5′ and 3′ RNA ends for physical separation of ladder fragments based on a biotin/streptavidin affinity (ii) formic acid-mediated RNA degradation, (iii) physical separation of 5′ and 3′ labeled RNA (iv) high-performance liquid chromatography (HPLC)-mediated separation of fragments, (v) sequential ESI-Quadrupole-Time-of-Flight (Q-TOF)-MS-based mass detection, and (iv) data analysis based on a simple computational algorithm that extracts, aligns and processes relevant mass peaks from the mass spectrum.
In another specific example, the method consists of (i) 5′ end chemical labeling of RNA with a bulky hydrophobic tag, like Cy3, which is designed to increase the size of the RNA fragment to increase retention time, and 3′ end labeling with an affinity tag like biotin, or vice versa, thus permitting sequence identification without the need for physical separation (ii) formic acid-mediated RNA degradation, (iii) high-performance liquid chromatography (HPLC)-mediated separation of fragments, and sequential ESI-Quadrupole-Time-of-Flight (Q-TOF)-MS-based mass detection, and (iv) data analysis based on a simple computational algorithm that extracts, aligns and processes relevant mass peaks from the mass spectrum.
Such, non-limiting computational algorithms that may be used in the practice of the invention include, for example, those disclosed in PCT/US19/33895 filed May 24, 2019 which is incorporated herein by reference in its entirety.
Although, the sequencing method disclosed herein is generally based on the formation and sequential physical separation of the two 5′ and 3′ ladder pools of degraded target RNA fragments for MS analysis, the physical separation of ladder pools is not a required step as the labeled RNA degraded fragments will have a retention time shift as compared to unlabeled RNA degraded fragments which can be differentiated in 2-dimensional mass-retention time plot after the LC/MS step.
As one step in the sequence method disclosed herein, the RNA to be sequenced is subjected to random controlled degradation. As used herein, the terms degradation and cleavage may be used interchangeably. It is understood that the degradation, or cleavage, of RNA refers to breaks in the RNA strand resulting in fragmentation of the RNA into two or more fragments. In general, such fragmentation for purposes of the present disclosure are random. However, site specific fragmentation may also be employed. RNA's natural tendency to be degraded can be advantageously used to generate a sequence ladder, i.e., a mass latter, for subsequent sequence determination via liquid chromatography-mass spectrometry (LC-MS). By controlling the timing of exposure to a degradation reagent, single but randomized cleavage along the target RNA molecule backbone may be achieved, thus simplifying downstream MS data analysis.
In one aspect, the target RNA molecule is exposed to random chemical cleavage to form ladder pools of degraded target RNA fragments. In a preferred embodiment chemical cleavage is accomplished through use of formic acid. Formic acid degradation is preferred because its boiling point is approximately 100° C. like water and the formic acid can be easily remove it e.g., by lyophilizer or speedvac. Such cleavage is designed to cleave the RNA molecule at its 5′-ribose positions throughout the molecule. In addition to formic acid degradation, alkaline degradation may also be used. For example, the following alkaline buffers may be used to degrade the RNA sample: 1× Alkaline Hydrolysis Buffer (e.g., 50 mM Sodium Carbonate [NaHCO3/Na2CO3] pH 9.2, 1 mM EDTA; or the Alkaline Hydrolysis Buffer supplied with Ambion's RNA Grade Ribonucleases). In addition to chemical cleavage, RNAs may be subjected to enzymatic degradation. Enzymes that may be used to degrade the RNA include for example, Crotalus phosphodiesterase I, bovine spleen phosphodiesterse II and XRN-1 exoribonucease. Such RNA degradation treatment is carried out under conditions where a desired single cleavage event occurs on the RNA molecule resulting in a pool of differently sized RNA fragments resulting in a complete ladder.
As a further step in the sequencing method disclosed herein, the ends of the RNA fragments are labeling to provide affinity interactions that can be utilized to provide a means for separation of the fragmented 5′ or 3′ labeled fragment pools within the cleavage mixture. Such affinity interactions are well known to those skilled in the art and included, for example, those interactions based on affinities such as those between antigen and antibody, enzyme and substrate, receptor and ligand, or protein and nucleic acid, to name a few. Labeling of the 5′ and 3′ ends of the fragmented RNA for use in affinity separation may be achieved using a variety of different methods well known to those skilled in the art. Such labeling is designed to achieve separation of fragmented RNA for subsequent MS analysis. RNA end-labeling may be performed before or after the chemical cleavage of the RNA.
In a preferred embodiment, the biotin/streptavidin interaction may be utilized to enrich for the ladder RNA fragments. In yet another preferred embodiment, the poly (A) oligonucleotide/dT interaction may be used to separate fragmented RNA. In instances where the end of the RNA is labeled with a biotin moiety, streptavidin beads may be used to purify the desired RNA ladder fragments. Alternatively, where the RNA has been labeled with a poly (A) DNA oligonucleotide, oligopoly (dT) immobilized beads such as (dT) 25-cellulose beads (New England Biolabs) may be used to enrich for the RNA fragments. The choice of chromatography material will be dependent on the 5′ and 3′ RNA labeling used and selection of such chromatography/separation material is well known to those skilled in the art.
As one example, the 3′ and 5′ RNA ends may be labeled with biotin for subsequent separation of RNA fragments based on the biotin/streptavidin interaction through use of streptavidin beads. In yet another aspect, short DNA adapters may be ligated to each end of the RNA sample. The 3′ end of the RNA may be ligated to a 5′ phosphate-terminated, pentamer-capped photocleavable poly(A) DNA oligonucleotide with T4 RNA ligase to form a phosphodiester-linked RNA-DNA hybrid. The 5′ end of the RNA-DNA hybrid may then be ligated to 5′ biotinylated DNA after phosphorylation via T4 polynucleotide kinase using T4 RNA ligase.
In a specific embodiment, two short DNA adapters are ligated to each end of the RNA sample, to physically select the desired fragment into either the 5′ or 3′ ladder pool from the undesired fragments with more than one phosphodiester bond cleavage in the crude degraded product mixture, followed by a lengthened formic acid degradation time resulting in most of the RNA sample being degraded, most of which turn into the desired fragments needed to obtain a complete sequence ladder. The 3′ end of the RNA sample is ligated to a 5′-phosphate-terminated, pentamer-capped photocleavable poly (A) DNA oligonucleotide with T4 RNA ligase 1 (New England Biolabs) to form a phosphodiester-linked RNA-DNA hybrid. Likewise, the 5′ end of the RNA-DNA hybrid is ligated to 5′-biotinylated DNA after phosphorylation via T4 polynucleotide kinase with the same ligase. The resulting 5′ DNA-RNA-DNA-3′ hybrid is treated with formic acid for approximately 5-15 min. Following formic acid treatment, streptavidin-coupled beads (ThermoFisher Scientific) can be used to isolate the 5′ ladder fragment pool followed by oligomer-release for subsequent LC/MS analysis. Similarly, oligopoly (dT) immobilized beads such as (dT) 25-Cellulose beads (New England Biolabs) can be used to enrich the 5′ ladder, which can then be eluted for LC/MS analysis after photocleavage by UV light (300-350 nm). Only the RNA section of the hybrid will be hydrolyzed, while the DNA section will remain intact as DNA lacks the 2′-OH group. In a specific embodiment, a biotin tag is added via a two-step reaction, at each end of the RNA sample. As a first step, a thiol-containing phosphate is introduced at the 5′-end by reacting T4 polynucleotide kinase with adenosine 5′[γ-thio]triphosphate (ATP-γ-S) to add a thiophosphate to the 5′ hydroxyl group of the to-be-sequenced RNA and then a conjugation addition is made between the resultant thiolphosphorylated RNA and the biotin (Long Arm) Maleimide (Vector Laboratories, USA), which is designed for biotinylating proteins, nucleic acids, or other molecules containing one or more thiol groups. The resulting 5′-biotinylated-RNA is then treated with formic acid, similar to the previous procedure (13). After acid degradation, streptavidin-coupled beads (Thermo Fisher Scientific, USA) are used to single out the 5′ ladder pool, which will be released for subsequent LC/MS analysis after breaking the biotin-streptavidin interaction. Although, the sequencing methods disclosed herein are generally based on the formation and sequential physical separation of 5′ and 3′ ladder pools of degraded target RNA fragments for MS analysis, the physical separation of ladder pools is not a required step. The labeled RNA degraded fragments will have a retention time shift as compared to unlabeled RNA degraded fragments which can be differentiated via the LC/MS step. In a specific embodiment, to increase the retention time shift, the RNA may be labeled with bulky moieties such as, for example, a hydrophobic Cy3 or Cy5 tag or other fluorescent tag. Such a tag is added via a two-step reaction, at the 5′-end of the RNA sample. As a first step, a thiol-containing phosphate is introduced at the 5′-end by reacting T4 polynucleotide kinase with adenosine 5′[γ-thio]triphosphate (ATP-γ-S) to add a thiophosphate to the 5′ hydroxyl group of the to-be-sequenced RNA and then a conjugation addition is made between the resultant thiolphosphorylated RNA and the Cy3 or Cy5 Maleimide (Tenova Pharmaceuticals, USA), which is designed for biotinylating proteins, nucleic acids, or other molecules containing one or more thiol groups. After 3′ end biotin labeling and acid degradation, the resultant two-end-labeled RNA is directly subjected for LC/MS without any affinity-based physical separation.
For 3′ end labeling, after isolating the 5′ ladder pool (which will be analyzed by LC/MS) in case affinity tags were used, the remaining residue, which contains the 3′ ladder pool with all of the original 3′-hydroxyl groups, will be subjected to 3′ end labeling. For this purpose, biotinylated cytidine bisphosphate (pCp-biotin) is activated by adenylation using ATP and Mth RNA ligase to produce AppCp-biotin. Then the members of the 3′ ladder pool with a free 3′ terminal hydroxyl are then ligated to the activated 5′-biotinylated AppCp via T4 RNA ligase, thus resulting in the 3′ end of each sequence in the 3′ ladder pool becoming biotin-labeled. Similarly, streptavidin-coupled beads are used to isolate the 3′ ladder pool, which will be released for subsequent LC/MS analysis (separate from the 5′ ladder pool) after breaking the biotin-streptavidin interaction.
Once separation of RNA fragment pools is performed, the RNA fragments can be analyzed by any of a variety of means including liquid chromatography coupled with mass spectrometry, or capillary electrophoresis coupled with mass spectrometry or other methods known in the art. Preferred mass spectrometer formats include continuous or pulsed electrospray (ESI) and related methods or other mass spectrometer that can detect RNA fragments like MALDI-MS. HPLC-MS measurements can be performed using high resolution time-of-flight or Orbitrap mass spectrometers that have a mass accuracy of less than 5 ppm. The use of such mass spectrometers facilitates accurate discernment between cytosine and uridine bases in the RNA sequence. In one aspect of the invention, the mass spectrometer is an Agilent 6550 and 1200 series HPLC with a Waters XBridge C18 column (3.5 μm, 1×100 mm). Mobile phase A may be aqueous 200 mM HFIP (1,1,1,3,3,3-Hexafluoro-2-propanol) and 1-3 mM TEA (Triethylamine) at pH 7.0 and mobile phase B methanol. In a specific non-limiting embodiment, the HPLC method for a 20 μL of a 10 μM sample solution was a linear increase of 2%-5% to 20%-40% B over 20-40 min at 0.1 mL/min, with the column heated to 50 or 60° C. Sample elution was monitored by absorbance at 260 nm and the eluate was passed directly to an ESI source with 325° C. drying with nitrogen gas flowing at 8.0 L/min, a nebulizer pressure of 35 psig and a capillary voltage of 3500 V in negative mode.
LC-MS data is converted into RNA sequence information. The unique mass tag of each canonical ribonucleotide and its associated modifications on the RNA molecule, allows one to not only determine the primary nucleotide sequence of the RNA but also to determine the presence, type and location of RNA modifications.
In the event of DNA, LC-MS data is converted into DNA sequence information. The unique mass tag of each canonical deoxynucleotide and its associated modifications on the DNA molecule, allows one to not only determine the primary nucleotide sequence of the DNA but also to determine the presence, type and location of DNA modifications. In a specific embodiment, the raw data derived from LC-MS, which contains the LC/MS data of the desired fragments and/or the undesired fragments is subsequently used for sequence alignment and detection of base modification. In addition to a two-dimensional data analysis which relies on mass and retention times, it is understood that additional types of two- or even three-dimensional data analysis may be performed based on other unique properties of RNA fragments, such as for example, unique electronic or optical signature signals that can be used together with mass for sequence determination.
Mass adducts can be removed from the deconvoluted data and the sequences will be predicted/generated using both mass and retention time data. The retention time-coupled mass data for the fragments is analyzed to determine which data points are “valid” and to be used for subsequent sequence determination and which data points are to be filtered out. After data reduction step, the mass difference (m) between two adjacent RNA fragments [m=m (i)−m(i−1), 1<i<n, n=RNA length], where m(i) is the mass of any ladder fragment and m(i−1) is the preceding lower mass ladder fragment, and match such mass differences with the exact masses of known nucleotide fragments to correlate the derived RNA sequencing information based on mass differences to determine the RNA sequence and its modification. As long as the structural modification on an RNA nucleoside is mass-altering, the disclosed sequencing method will permit identification of the RNA sequence and its modification to be identified. The mass of all the known modified ribonucleosides can be conveniently retrieved from known RNA modification databases (12) or through use of the attached
It should be understood that the examples and embodiments provided herein are exemplary examples embodiments. Those skilled in the art will envision various modifications of the examples and embodiments that are consistent with the scope of the disclosure herein. Such modifications are intended to be encompassed by the claims. The examples provided herein are included solely for augmenting the disclosure herein and should not be considered to be limiting in any respect.
RNA oligonucleotides listed below were obtained from Integrated DNA Technologies (Coralville, Iowa, USA). RNA strand sequences were as follows:
Biotinylated cytidine bisphosphate (pCp-biotin), {Phos (H)}C {BioBB}, was obtained from TriLink BioTechnologies (San Diego, Calif., USA). T4 DNA ligase 1, T4 DNA ligase buffer (10×), the adenylation kit including reaction buffer (10×), 1 mM ATP, and Mth RNA ligase were obtained from New England Biolabs (Ipswich, Mass., USA). The 5′ end tag nucleic acid labeling system kit and biotin maleimide were purchased from Vector Laboratories (Burlingame, Calif., USA). The streptavidin magnetic beads were obtained from Thermo Fisher Scientific (Waltham, Mass., USA).
Adenylation: The following reaction was set up with a total reaction volume of 10 μL in an RNase-free, thin walled 0.5 mL PCR tube: 1× adenylation reaction buffer, 100 μM of ATP, 5.0 μM of Mth RNA ligase, 10.0 μM pCp-biotin, and nuclease-free, deionized water (Thermo Fisher Scientific, USA). The reaction was incubated in a GeneAmp™ PCR System 9700 (Thermo Fisher Scientific, USA) at 65° C. for 1 hour followed by the inactivation of the enzyme Mth RNA ligase at 85° C. for 5 minutes.
Ligation: A 30 μL reaction solution contained 10 μL of reaction solution from the adenylation step, 10× reaction buffer, 5 μM RNA (19-nt, 20-nt or 21-nt, respectively), 10% (v/v) DMSO (anhydrous dimethyl sulfoxide, 99.9%, Sigma-Aldrich, USA), T4 RNA ligase (10 units), and nuclease-free, deionized water. The reaction was incubated for overnight at 16° C. followed by the column purification as follows.
Column Purification: Oligo Clean & Concentrator (Zymo Research, Irvine, Calif., U.S.A.) was used to remove enzymes, free biotin, and short oligos. 100 μL Oligo Binding Buffer was added to a 50 μL sample (20 μL nuclease-free water was added to bring the total sample volume to 50 μL). 400 μL ethanol was added (200 proof, 100%, Decon Labs, USA), mixed the solution briefly by pipetting, and transferred the mixture to a provided column in a collection tube. The sample was then centrifuged at 10,000 rcf for 30 seconds, the flow-through was discarded, and 750 μL DNA Wash Buffer was added to the column. The sample was then centrifuged again at 10,000 rcf for 30 seconds and the flow-through was discarded, followed by centrifugation at maximum speed for 1 minute. The column was transferred to a microcentrifuge tube, and 15 μL nuclease-free water was directly added to the column matrix (with 1 minute of incubation time) and the sample was centrifuged at 10,000 rcf for 30 seconds to elute the oligonucleotide.
The concentration of the purified RNA reported in (ng/4) was measured by a NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific Waltham, Mass., USA).
The efficiency of biotin labeling to the 3′ or 5′ end of RNA oligo expressed in % was measured by Matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) by a Voyager-DE Biospectrometry Workstation (Jet Propulsion Laboratory, USA), based on the calculation of peak intensity at mass (m/z) of starting material and mass (m/z) of labeled product.
Labeling biotin to 5′ end of RNA requires two steps: A thiophosphate is transferred from ATPγS to the 5′ hydroxyl group of the target RNA by T4 polynucleotide kinase (NEB, USA); after addition of biotin maleimide, the thiol-reactive label is chemically coupled to the 5′ end of the target RNA. The experimental protocol is as follows. The following was combined in an RNase-free, thin walled 0.5 mL PCR tube: 10× reaction buffer, 30 μM of RNA (19-nt, 20-nt, or 21-nt, respectively), 0.1 mM of ATPγS, 10 units of T4 polynucleotide kinase, while bring total reaction volume to 10 μL with nuclease-free, deionized water. This sample was mixed and incubated for 30 minutes at 37° C. Then 5 μL of biotin maleimide or Cy3 maleimide (dissolved in 312 μL anhydrous DMF (anhydrous dimethyl sulfoxide, 99.9%, Sigma-Aldrich, USA) was added, mixed, and incubated the sample for 30 minutes at 65° C. Column purification was required as well according to the above-mentioned procedures.
Direct RNA sequencing relies on generating degradative products, and RNA fragments produced by single scission events can be directly sequenced via observing mass differences between compound masses. Acid hydrolysis can rapidly generate internal fragments by multiple scission events from any starting material, and thus formic acid, especially, is a mild and volatile organic acid used extensively in MS because it has a low boiling point and can therefore be easily removed by lyophilization. RNA samples are biotinylated in one time point or divide each of the RNA sample solution into three smaller equal. Aliquots are degraded by acid degradation using 50% (v/v) formic acid at 40° C. with one for 2 min, one for 5 min, and one for 15 min, and then combine them all together for one LC/MS measurement. The reaction mixture was immediately frozen on dry ice followed by lyophilization to dryness, which was typically completed within 1 h. The dried samples were immediately suspended in 20 μL nuclease-free, deionized water for subsequent biotin/streptavidin capture/release step or stored at −20° C.
Biotin/Streptavidin capture uses streptavidin-coated magnetic beads to bind biotin-labeled RNAs, which are immobilized onto streptavidin coated magnetic beads and drawn to a magnet. Bound RNAs should, therefore, be isolated from non-biotin labeled RNAs and impurities and can be later eluted from the beads for LC-MS sequencing analysis.
200 μL of Dynabeads™ MyOne™ Streptavidin Cl beads (Thermo Fisher Scientific, USA) were prepared by first adding an equal volume of 1× B&W buffer. This solution was vortexed and placed on the magnet for 2 min, followed by discarding of the supernatant. The beads were washed twice with 200 μL of Solution A (DEPC-treated 0.1 M NaOH & DEPC-treated 0.05 M NaCl) and once in Solution B (DEPC-treated 0.1 M NaCl). A final addition of 100 μL of 2× B&W buffer brought the concentration of the beads to 20 mg/mL. An equal volume of biotinylated RNA was added in 1× B&W buffer, incubated the sample for 15 min at room temperature using gentle rotation, placed the tube in a magnet for 2 min, and discarded the supernatant. The coated beads were washed 3 times in 1× B&W buffer and the final concentration of each wash step supernatants was measured by Nanodrop for recovery analysis. For releasing the immobilized biotinylated RNAs, the beads were incubated in 10 mM EDTA (Thermo Fisher Scientific, USA), pH 8.2 with 95% formamide (Thermo Fisher Scientific, USA) at 65° C. for 5 min. Finally, this sample tube was placed in a magnet for 2 min and we collect the supernatant by pipetting.
Samples were separated and analyzed on an iFunnel Agilent 6550 Q-TOF coupled to an Agilent 1290 Infinity LC system (Agilent Technologies, Santa Clara, Calif., USA) equipped with a MicroAS autosampler and Surveyor MS Pump Plus HPLC system. All separations were performed using an aqueous mobile phase (A) as 25 mM hexafluoro-2-propanol (HFIP) (Thermo Fisher Scientific, USA) with 10 mM diisopropylamine (DIPA) (Thermo Fisher Scientific, USA) at pH 7.0 and organic mobile phase (B) as methanol across a 50 mm×2.1 mm Xbridge C18 column with a particle size of 1.7 μm (Waters, Milford, Mass., USA). The flow rate was 0.3 mL/min, and all separations were performed with the column temperature maintained at 60° C. Injection volumes were 20 μL, and sample amounts were 15-400 pmol of RNA. Data was recorded in negative polarity. The sample data were acquired using the Agilent Technologies MassHunter LC/MS Acquisition software. To extract relevant spectral and chromatographic information from the LC-MS experiments, Molecular Feature Extraction workflow in MassHunter Qualitative Analysis (Agilent Technologies) was used. This molecular feature extractor algorithm performs untargeted feature finding in the mass and retention time dimensions. In principal any software capable of compound identification could be used. The software settings were varied depending on the amount of RNA used in the experiment. In general, as many identified compounds as possible were included. For samples with low concentrations, profile spectral peaks were filtered using a signal-to-noise ratio (SNR) threshold of 5 and, for more concentrated samples, an SNR threshold of up to 20. The other algorithm settings were as follows: “Small Molecules (chromatographic)” extraction algorithm, charge states from −1 to −15, only loss of hydrogen (—H) ions, “Common Organic Molecules” isotope model, minimum quality score 70 (range 0-100), and minimum ion count 500.
A method is provided for determining the sequence of RNA molecules which is based on the physical separation of two ladders of RNA fragments. The method is designed to prevent any confusion as to which fragment belongs to which ladder by physical separation of two ladders, and the output is expected to contain only one sigmoidal curve rather than two sigmoid curves (which is much more difficult to analyze) in the first-generation method. Another benefit of the sequential separation of two ladders is simplification of the base-calling procedures because after ladder separation, each resultant LC/MS dataset size becomes less than half of the size of the un-separated precursor's dataset. With the help of these two favorable factors, one can sequence more complicated RNA samples with more than one strand while being able to simultaneously analyze their associated modifications. Experiments were designed as shown in
After isolating the 5′ ladder pool (which will be analyzed by LC/MS), the remaining residue, which contains the 3′ ladder pool with all of the original 3′-hydroxyl groups, is subjected to 3′ end labeling. For this purpose, biotinylated cytidine bisphosphate (pCp-biotin) is activated by adenylation using ATP and Mth RNA ligase to produce AppCp-biotin. Then the members of the 3′ ladder pool with a free 3′ terminal hydroxyl are ligated to the activated 5′-biotinylated AppCp via T4 RNA ligase, thus resulting in the 3′ end of each sequence in the 3′ ladder pool becoming biotin-labeled. Similarly, streptavidin-coupled beads can be used to isolate the 3′ ladder pool, which can be released for subsequent LC/MS analysis (separate from the 5′ ladder pool) after breaking the biotin-streptavidin interaction.
A series of synthetic RNA oligos (19-nt, 20-nt, and 21-nt RNA; see Methods for sequences) were designed and synthesized as model RNA oligonucleotides for individual and group test. Biotin-labeled 5′ ends were obtained using the two-step reaction as described above. After acid degradation and bead separation of the 5′ ladder pool for LC/MS analysis, the remaining residue was subjected to 3′-labeling. The members of the 3′ sequence ladder pool were then also biotin end-labeled, streptavidin-captured, and then released for LC/MS analysis as described above.
Experiments were performed focused on tRNA sequencing, as tRNA is very important in protein synthesis and its expression and mutations have major implications in various diseases such as neurological pathologies and cancer development (7-10). However, lack of efficient tRNA sequencing methods has hindered structural and functional studies of tRNA in biological and biochemical processes. tRNA is one class of small cellular RNA for which standard sequencing methods cannot yet be applied efficiently (11); significant obstacles for the sequencing of tRNA include the presence of numerous post-transcriptional modifications and its stable and extensive secondary structure, which can interfere with cDNA synthesis and adaptor ligation. However, as the length of tRNA ranges from 60 to 95 nt, with an average length of 76 nt, it is a very good system to use in the LC/MS-based direct sequencing method disclosed herein.
To directly sequence tRNA with the LC/MS-based method, T1 ribonuclease was used to partially digest the complete tRNA into smaller fragments to allow for successful sequencing. Partial T1 ribonuclease digestion, which specifically cleaves single-stranded RNA phosphodiester bonds after guanosine residues, producing 3′-phosphorylated ends (
The 3′ tRNA portion, which has an OH group at each of the 3′ and 5′ ends, is labeled using T4 RNA ligase and 5′-adenylated biotin-methyl-ddC as a substrate. Streptavidin magnetic beads are used to isolate the biotinylated tRNA fragments and acid degradation is performed on the fragments to create the 3′ ladder for sequencing analysis using LC/MS (
LC/MS data from short oligonucleotides showed that it was possible to observe exactly one sigmoidal curve corresponding to each specific ladder as expected when their masses were plotted against their retention times (tR) (
To determine the labeling efficiency, MALDI-TOF MS was applied to estimate the efficiency of biotinylation at 3′ and 5′ end of RNA, respectively (
Chromatographic separation of sequence ladders simplified identification of reads in the same orientation. The sequencing reads were defined by their mass, RT, and abundance. The nucleotides (A, G, U, C) were determined by mass differences of two adjacent ladder fragments. Thus, the sequence can be read out very easily. For example, the sequence CGGAUUUAGCUCAGU can be read out automatically from the 5′ to 3′ end for the 5′ end biotin labeled 21-nt RNA (
The sequencing method described herein provides a tool for RNA sequence analysis through its ability to isolate biotin labeled fragments from two ends, respectively, that can simplify LC/MS data analysis and help read out sequences from each ladder (either 5′ ladder or 3′ ladder) after its physical separation from the other one. This strategy allows one to sequence more complicated RNA samples with more than one RNA strand as well as tRNA, and subsequently analyze their associated modifications simultaneously.
Enhancing RNA labeling efficiency. It remains a challenge to introduce tags, like biotin or fluorescent dyes, onto RNA with high yield. However, labeling two ends of RNA with selected tags is aa step of the direct RNA sequencing method disclosed herein. The labeling efficiency is directly related to how much of an RNA sample can be used to generate MS signals, with a higher labeling efficiency leading to a reduced sample requirement. To increase the labeling efficiency, new labeling strategies have continued to be optimize. A high labeling efficiency (˜90%) was recently observed when labeling the 5′ end of RNA with the 2-step reaction (
Enhancing sequencing read length. In order to increase the read length, the molecular feature extraction (MFE) settings for Agilent MassHunter Qualitative Analysis were optimized. From the MFE data exported out of Agilent software, it was possible to automatically read longer RNAs up to 30 nt using the sequencing algorithm, a significant increase in read length compared to the ˜20-nt RNAs. It was also discovered that with the available software, there are two modes of identification depending on the size of the molecule: (i) a small molecule mode depending on accurate determination of the monoisotopic mass for identification, which works only to about 30-nt or ˜10,000 Da, judged by the RNA samples currently available; and (ii) a large molecule mode requiring accurate determination of the average mass for identification, which works only for molecules larger than about 30-nt.
Enhancing sequencing throughput to multiple RNA strand sequencing of 5 and 12 RNAs. It has been demonstrated that the LC/MS-based method can not only sequence purified single stranded RNA, but also sequence RNA samples with multiple RNA strands. Two different RNAs could be read out, one 19 nt and one 20 nt simultaneously with the novel sample preparation protocol and bead separation described herein. A sample containing mixtures containing 5 and 12 RNAs has been tested. With the improvements in labeling efficiency and read length as described above, it was possible to detect all the ladder fragments needed for reading out the complete sequences of all the RNAs in these mixtures. This was achieved by (i) obtaining measurements on an Agilent 6550 ion-funnel Q-TOF LC/MS, and (ii) optimizing the MFE settings for Agilent MassHunter Qualitative Analysis. It was possible to manually read the sequences in the 5 and 12 RNA mixtures (
In order to increase the throughput and robustness of the MS-based sequencing method to enable sequencing of mixed RNA samples with multiple RNA strands, a new strategy was developed, as described herein, to optimize the experimental workflow and to significantly simplify 2D LC/MS data analysis for identifying the ladders needed for sequencing, while testing the efficacy of the new strategies on a series of synthetic RNA oligonucleotides of varying lengths containing both canonical and modified bases as a proof-of-concept study. It was possible to sequence pseudouridine (ψ) and 5-methylcytosine (m5C) simultaneously at single-base resolution. Together with the described end-labeling strategy, it was possible to identify, locate, and quantify these multiple base modifications while accurately sequencing the complete RNA not only in a single purified RNA strand, but also in sample mixtures containing 12 distinct sequences of RNAs.
In the experimental approaches described herein, either one RNA end was labeled and the other end left unlabeled, or the two ends of the RNA were labeled with different tags to better distinguish them in the 2D LC/MS method. In one labeling strategy, a biotin tag was introduced to either the 3′ end or the 5′ end of the RNA prior to LC/MS analysis in order to introduce an RT and mass shift to exactly one mass ladder (14). This method can help simplify LC/MS data analysis and prevent confusion as to which fragment belongs to which ladder when sequencing mixed RNA samples. It increases the masses of RNA ladders so that the terminal bases can be identified, avoiding messy low mass regions where it is difficult to differentiate mononucleotides and dinucleotides from multi-cut internal fragments; improves sequencing accuracy by reading a complete sequence from one single ladder, rather than requiring paired-end reads; simplifies base-calling procedures, making it easier for the ladder components to be identified due to selective RT shifts; and improves sample efficiency by allowing for longer degradation time points (15 min) than reported before (5 min) (14). —These improvements can help reduce the minimum RNA sample loading requirement as compared to the first-generation method, increasing the potential to sequence endogenous RNA samples with rare RNA modifications.
For labeling RNAs at their 3′ ends (
As a test example, short RNA oligonucleotides (19 nt and 20 nt RNA: RNA #1 and RNA #2, respectively) were designed and synthesized as model RNA oligonucleotides for individual and group tests. First, RNA #1 was 3′-biotin-labeled and subjected it to physical separation by streptavidin bead capture and release. In
In
Because of this end labeling, both complete sequences in a mixture of two RNAs, one 19 nt (RNA #1) and one 20 nt (RNA #2) can be read out, from exactly one curve per RNA strand. In the case of this sample, the algorithm was used to perform crucial mass adduct clustering in order to further simplify the data for finding the complete sets of mass ladder components needed for sequencing. From the sigmoidal curves consisting of all the mass ladder components in the simplified 2D mass-RT plot (
In order to further increase the observed RT shift afforded by end-labeling, an RNA sample may be labeled with other bulky moieties such as a hydrophobic cyanine 3 (Cy3) or cyanine 5 (Cy5). to magnify their RT difference. Different tags were introduced, such as Cy3, which is bulky and can cause a greater RT shift than biotin (14), at the 5′ end of the original RNA strand to be sequenced; a biotin moiety was introduced to the 3′ end of the RNA as described before. These end labels should systematically affect the RT of all 5′ and 3′ ladder fragments so as to differentiate the two ladder curves for sequencing, which was confirmed by in silico studies (
Despite various reported RNA labeling methods, it remains a challenge to introduce tags, like biotin or fluorescent dyes, onto RNA with high yield. However, labeling two ends of RNA with selected tags is a step of the direct RNA sequencing method disclosed herein. The labeling efficiency directly results in how much RNA sample can be used to generate MS signals, with a higher labeling efficiency leading to a reduced sample requirement. To increase the labeling efficiency, new labeling strategies have been explored and high labeling efficiency has been demonstrated at both the 5′ and 3′end (
The new end labeling-LC/MS sequencing strategy was then applied to a synthetic sample containing a modified nucleobase. Pseudouridine (ψ) is the most abundant and widespread of all modified nucleotides found in RNA. It is present in all species and in many different types of RNAs, including both coding RNAs (mRNAs) and non-coding RNAs (16). However, it is impossible to distinguish w from U directly by MS because they have identical masses. An established chemical labeling approach was previously developed to distinguish ψ from U, relying on a nucleophilic addition with N-cyclohexyl-N′-(2-morpholinoethyl)-carbodiimide metho-p-toluenesulfonate (CMC) to form a CMC-ψ adduct (17). The CMC-ψ adduct stalls reverse transcription and terminates the cDNA one nucleotide towards the 3′ end downstream to it and is currently used to detect w sites in various RNAs at single-base resolution (18). Here, the same chemistry is adapted to form the same CMC-ψ adduct in our system (
Automated sequencing was applied to RNA #12 and #13 after acid degradation by formic acid. In the 2D mass-RT plot (
Sequencing an RNA Mixture with Multiple Modifications
Finally, with the end-labeling and w base-modification methods in hand, it was next sought to increase the throughput of the method in order to sequence a multiplex RNA sample (simultaneous sequencing of a mixed sample containing multiple distinct RNA sequences) containing RNA strands with multiple modifications. A sample mixture containing 12 RNAs with distinct sequences, containing 11 unmodified RNAs and one multiply-modified RNA containing 1 ψ and 1 m5C, was subjected to the protocol. First, the 3′ ends of all RNA samples were chemically labeled with biotin, while sulfo-Cy3 was added to the 5′ ends (except for the RNA strand containing the base modifications). After measurement by LC/MS, the data were analyzed using Agilent MassHunter Qualitative Analysis software with optimized MFE settings to extract data for sequence generation. With the improvements in labeling efficiency described above, it was possible to detect all ladder fragments needed to accurately read out the complete sequences of all RNAs in the mixture. In the analysis of the multiplexed samples, the typical basecalling algorithm (as was used in all previous figures) was not used. These sequences were base-called manually, and all sequences could be read-out (
Previous MS-based RNA sequencing methods controlled degradation conditions to generate well-defined mass ladders with single cuts for sequencing, as opposed to the unwanted appearance of multiple-cut fragments (14). As such, a 5 min formic acid treatment was performed to digest ˜10% of a 20 nt (RNA #3) sample into its corresponding 5′- and 3′-sequencing ladders to minimize formation of internal RNA fragments with more than one cut. (14) Thus, ˜90% of the starting material remained intact, and could not yield any sequence information. For real biological samples with low abundance, the fact that ˜90% of the sample would be unusable for sequencing results in the method's inability to generate enough signals to accurately sequence these low-abundance samples. In order to increase the percentage of usable sample, a longer degradation step is required. However, the process of generating more of the desired ladder fragments in a longer chemical/enzymatic degradation step will lead to the production of large amounts of internal fragments that do not possess a 5′ or 3′ end from the original RNA sequence by virtue of more than one cut-site on a given sequence (this is a stochastically-controlled process). The previous method (14) disregarded internal fragments simply as “noise” as they were not a part of the RNA ladders that were actually used in determining the sequence of bases and modification analysis. Although there is still inherent information in these internal fragments, utilizing information from internal fragments effectively is difficult because these sequences are mixed with the desired ladder compounds, especially for fragments in the lower mass regions with mass less than 2000 Daltons (Da). In this low mass region, monomer, dimer, and trimer nucleotides from any part of a given RNA strand cannot be easily separated in the LC phase of the LC/MS, leading to difficulty in accurate sequence identification and analysis. However, separation of desired ladder fragments from internal fragments by double-end labeling of the original sample before acid degradation makes it possible to actually take advantage of the previously unused internal fragments. It is proposed to gather and apply information from the internal fragments with more than one cut towards sequence generation/alignment where there are gaps (ironically generated from the same long acidic degradation step that generated the internal fragments) in the reported sequence greater than one missing base as observed in the sequence curve of the 2-D mass-RT plot of an RNA sample which has been subjected to a 60 min degradation step. As shown in
Development of 2D-mass-RT direct RNA sequencing methodology brings the power of MS-based laddering technology to RNA, addressing a long-standing unmet need in the broad field of RNA modification studies. Not only does it provide a direct method for RNA sequencing without the need of a cDNA intermediate, it also provides a general method for sequencing multiple base modifications on multiple RNA strands in one single experiment. The developed method has been proven successful to sequence short single strands of synthetic RNA (˜20 nucleotides) (
Accordingly, the sequencing methods disclosed herein can facilitate the efficient sequencing of modified RNA molecules, including, for example, tRNAs, siRNAs, therapeutic synthetic oligoribonucleotides having pharmacological properties, mixtures of RNA molecules, as well as detection of modifications of such RNA molecules. This approach may be expanded to sequence cellular RNAs with known chemical modifications, such as endogenous tRNA and mRNA, to benchmark the method's efficacy in read length and identification of extensive modifications. It is expected that this direct MS-based RNA sequencing method will facilitate the discovery of more unknown modifications along with their location and abundance information, which no other established sequencing methods are currently capable of. With continued improvements in read length, this direct sequencing strategy can be expanded to sequence longer RNAs, such as mRNA and long non-coding RNA, and pinpoint the chemical identity and position of nucleotide modifications.
The following RNA oligonucleotides were obtained from Integrated DNA Technologies and used without further purification (Coralville, Iowa, USA).
Formic acid (98-100%) was purchased from Merck (Darmstadt, Germany). Biotinylated cytidine bisphosphate (pCp-biotin), {Phos (H)}C{BioBB}, was obtained from TriLink BioTechnologies (San Diego, Calif., USA). Adenosine-5′-5′-diphosphate-{5′-(cytidine-2′-O-methyl-3′-phosphate-TEG}-biotin, A(5′)pp(5′)Cp-TEG-biotin-3′, was synthesized by ChemGenes (Wilmington, Mass., USA). T4 DNA ligase 1, T4 DNA ligase buffer (10×), the adenylation kit including reaction buffer (10×), 1 mM ATP, and Mth RNA ligase were obtained from New England Biolabs (Ipswich, Mass., USA). ATPγS and T4 polynucleotide kinase (3′-phosphatase free) were obtained from Sigma-Aldrich (St. Louis, Mo., USA). Biotin maleimide was purchased from Vector Laboratories (Burlingame, Calif., USA). Cyanine3 maleimide (Cy3) and sulfonated Cyanine3 maleimide (sulfo-Cy3) were obtained from Lumiprobe (Hunt Valley, Md., USA). The streptavidin magnetic beads were obtained from Thermo Fisher Scientific (Waltham, Mass., USA). Chemicals needed for conversion of pseudouridine including CMC (N-cyclohexyl-N′-(2-morpholinoethyl)-carbodiimide metho-p-toluenesulfonate), bicine, urea, EDTA and Na2CO3 buffer, were obtained from Sigma-Aldrich (St. Louis, Mo., USA).
(1) Chemical conversion of pseudouridine was applied for distinguishing pseudouridine from uridine. (2) Labels were added on one or both ends of RNA strands with optimized experimental procedures. (3) The single RNA strand or mixtures of RNA strands was/were degraded into a series of short, well-defined fragments (sequence ladder), ideally by random, sequence context-independent, and single-cut cleavage of phosphodiester bonds on each RNA strand over its entire length, through a 2′-OH-assisted acidic hydrolysis mechanism. (4) If needed, physical separation of biotinylated RNA from unlabeled RNA using streptavidin-coated magnetic beads. (5) The digested fragments were then subjected to LC/MS analysis and the deconvoluted masses and RTs were analyzed to identify each ladder fragment. (6) Algorithms were applied to automate the data processing and sequence generation process.
Use a two-step protocol. (1) Adenylation: The following reaction was set up with a total reaction volume of 10 μL in an RNAse-free, thin walled 0.5 mL PCR tube: 1× adenylation reaction buffer (5′ adenylation kit), 100 μM of ATP, 5.0 μM of Mth RNA ligase, 10.0 μM pCp-biotin, and nuclease-free, deionized water (Thermo Fisher Scientific, USA). The reaction was incubated in a GeneAmp™ PCR System 9700 (Thermo Fisher Scientific, USA) at 65° C. for 1 hour followed by the inactivation of the enzyme Mth RNA ligase at 85° C. for 5 minutes. (2) Ligation: A 30 μL reaction solution contained 10 μL of reaction solution from the adenylation step, 1× reaction buffer, 5 μM target RNA sample, 10% (v/v) DMSO (anhydrous dimethyl sulfoxide, 99.9%, Sigma-Aldrich, USA), T4 RNA ligase (10 units), and nuclease-free, deionized water. The reaction was incubated for overnight at 16° C., followed by column purification.
For the one-step protocol. A(5′)pp(5′)Cp-TEG-biotin-3′ was applied to improve the labeling efficiency by eliminating the adenylation step, while simplify the labeling method. The ligation step was achieved by a 30 μL reaction solution containing 1× reaction buffer, 5 μM target RNA sample, 10 μM A(5′)pp(5′)Cp-TEG-biotin-3′, 10% (v/v) DMSO, T4 RNA ligase (10 units), and nuclease-free, deionized water. The reaction was incubated for overnight at 16° C., followed by column purification. Oligo Clean & Concentrator (Zymo Research, Irvine, Calif., U.S.A.) was used to remove enzymes, free biotin, and short oligonucleotides.
Biotin labeling at the 5′end required two steps. In an RNase-free, thin walled PCR tube (0.5 mL) containing 10× reaction buffer, 90 μM of RNA, 1 mM of ATPγS, and 10 units of T4 polynucleotide kinase, bringing the total reaction volume to 10 μL with nuclease-free, deionized water, incubation was carried out for 30 minutes at 37° C. Then 5 μL of biotin maleimide that was dissolved in 312 μL anhydrous DMF (anhydrous dimethyl sulfoxide, 99.9%, Sigma-Aldrich, USA) was added, mixed by vortexing, and incubated the sample for 30 minutes at 65° C. Column purification using Oligo Clean & Concentrator was performed as described above.
A different tag, such as a hydrophobic Cy3 (cyanine 3) or Cy5 (cyanine 5) tag, was introduced to the 5′end by the same method as above (except through Cy3-maleimide or sulfo-Cy3 maleimide replacement of the biotin maleimide), to distinguish its ladder from the 3′ biotinylated ladder. The optimization of the reaction conditions, compared to the above described 2-step protocol, was performed to obtain high labeling efficiency in the following manner: 1) sulfo-Cy3 was used for obtaining high water solubility with a molar ratio of reactants at 50:1 (sulfo-Cy3 to RNA); 2) the pH of the reaction solution was adjusted to 7.5 by Tris-HCl buffer (1 M) with a final concentration of 50 mM; and 3) the reaction time was lengthened to overnight (16 hrs) with constant stirring.
Unless otherwise indicated, formic acid was applied to degrade full length RNA samples for producing mass ladders.30,31 Each RNA sample solution was divided into three equal aliquots for formic acid degradation using 50% (v/v) formic acid at 40° C., with one reaction running for 2 min, one for 5 min, and one for 15 min. For the experiments regarding generation of internal fragments (
Biotin/Streptavidin capture uses streptavidin-coated magnetic beads to bind biotin-labeled RNAs, which are selectively immobilized onto streptavidin-coated magnetic beads and drawn to a magnet. Bound RNAs should, therefore, be isolated from non-biotin labeled RNAs and impurities (which remain in solution and will be washed away) and can be later eluted from the beads for LC-MS sequencing analysis. For the sample in
Chemistry for Differentiating Pseudouridine from Uridine
The experimental approach to modify pseudouridine was performed according to the report by Bakin and Ofengand (Bakin, A.; Ofengand, J. Biochemistry 1993, 32 (37), 9754-62). Each RNA sample (1 nmol) was treated with 0.17 M CMC in 50 mM Bicine, pH 8.3, 4 mM EDTA, and 7 M urea at 37° C. for 20 min in a total reaction volume of 90 μL. The reaction was stopped with 60 μL of 1.5 M NaOAc and 0.5 mM EDTA, pH 5.6 (buffer A). After purification using an Oligo Clean & Concentrator, 60 μL of 0.1 M Na2CO3 buffer, pH 10.4 was added into the solution, brought to a reaction volume of 120 μL, and incubated at 37° C. for 2 h. The reaction was stopped with buffer A and purified by Oligo Clean & Concentrator.
Samples were separated and analyzed on a 6550 Q-TOF mass spectrometer coupled to a 1290 Infinity LC system equipped with a MicroAS autosampler and Surveyor MS Pump Plus HPLC system (Agilent Technologies, Santa Clara, Calif., USA) (Hunter Mass Spectrometry, NY, USA). All separations were performed reversed-phase HPLC using an aqueous mobile phase (A), 25 mM hexafluoro-2-propanol (HFIP) (Thermo Fisher Scientific, USA) with 10 mM diisopropylamine (DIPA) (Thermo Fisher Scientific, USA) at pH 9.0 and an organic mobile phase (B), methanol across a 50 mm×2.1 mm Xbridge C18 column with a particle size of 1.7 μm (Waters, Milford, Mass., USA). The flow rate was 0.3 mL/min, and all separations were performed with the column temperature maintained at 35° C. Injection volumes were 20 μL, and sample amounts were 15-400 pmol of RNA. Data were recorded in negative polarity. The sample data were acquired using the MassHunter Acquisition software (Agilent Technologies, USA). To extract relevant spectral and chromatographic information from the LC-MS experiments, the Molecular Feature Extraction workflow in MassHunter Qualitative Analysis (Agilent Technologies, USA) was used. This proprietary molecular feature extractor algorithm performs untargeted feature finding in the mass and retention time dimensions. In principal, any software capable of compound identification could be used. The software settings were varied depending on the amount of RNA used in the experiment. In general, the goal was to include as many identified compounds as possible, up to a maximum of 1000. For samples with low concentrations, profile spectral peaks were filtered using a signal-to-noise ratio (SNR) threshold of 5 and, for more concentrated samples, an SNR threshold of up to 20. The other algorithm settings were as follows: “Small Molecules (chromatographic)” extraction algorithm, charge states from −1 to −15, only loss of hydrogen (—H) ions, “Common Organic Molecules” isotope model, minimum quality score 70 (range 0-100), and minimum ion count 500.
In addition to automating the sequence generation, manually reading RNA sequences was also used to confirm the accuracy of the automating sequencing. These sequences were manually read out from the data extracted by the Molecular Feature Extraction (MFE) algorithm integrated in the Agilent's software of MassHunter Qualitative Analysis. In Tables S1-S38, provided are the theoretical mass of each fragment (obtained by ChemDraw), base mass, base name, observed mass, RT, volume (peak intensity), quality score, and ppm mass difference. All figures presented are representative data of multiple experimental trials (n≥3). For ease of visualization, the 5′-sulfo-Cy3 labeled mass ladders and the 3′-biotinylated mass ladders were plotted separately (i.e., 3′-biotinylated mass ladders were all plotted in FIG. 20A and the 5′-sulfo-Cy3 labeled mass ladders were all plotted in
The first step of the LC/MS data analysis is to perform data pre-processing and reduction so that the LC/MS data will become less noisy, and consequently easier to read out the RNA sequence(s) from the data in the next step. From the multi-dimensional LC/MS data, there are several dimensions that can be used to pre-process the data and reduce its volume, such as Retention Time (RT), Intensity (Volume), and Quality Score (QS). Please see Supplementary Information for details on data processing and modifications to the sequencing algorithm. The source code of the revised algorithm is available. Further improvement of the algorithm will enable one to automate base-calling and modification identification when sequencing more complicated cellular RNAs.
Understanding the dynamics of cellular RNA modifications (20, 21) requires a method to quantify the stoichiometry/percentage of RNA with site-specific modifications vs. its canonical counterpart RNA, as base modifications may not occur on 100% of all identical RNA sequences in a cell or sample. Applying the above quantification strategy to other sequences, this method is expected to allow one to accurately determine the percentage of RNA with any mass-altered modification vs. its corresponding non-modified counterpart. As shown in
Adding a 5′ tag to spatially separate ladders on a retention time (RT) vs. mass plot, a simulated mass spectrum peak set for both 5′ and 3′ ladders of a synthetic, unmodified A10 (10-mer of polyadenine) sequence was first generated in silico. Each row represents a given mass ladder peak, and each peak was assigned a unitless retention time (RT) and an arbitrarily constant unitless peak volume of 1000. The RT assigned for each ladder increased systematically with increasing mass, starting with 0 and increasing in 0.1 unit increments. The peak list for the simulated A10 mass spectrum was as follows:
The mass ladder starting from 347.063065 represents the 5′mass ladder, while the mass ladder starting from the 267.096732 represents the 3′mass ladder.
Next a simulated mass spectrum peak set for both 5′ and 3′ladders of a synthetic, 5′-cyanine 3 (Cy3)-labeled A10 (10-mer of polyadenine) sequence was generated in silico. This was done by taking the data set above, and adding the additional mass afforded by a 5′-Cy3 label (614.3061) to each member of the 5′-ladder in the data set. The peak volumes did not change. The associated RT for this new Cy3-labeled 5′-ladder was generated by now starting from an RT of 10, and decreased by an increment of 0.2 with increasing mass. This was done to simulate the potential change to an RT vs. mass spectrum of any end-labeled ladder (in this case, 5′-Cy3-labeled) in both absolute RT values, RT trends (monotonically increasing curve to a monotonically decreasing curve, for example), and absolute mass values. Of course, real changes in all of these values in a real system could not be absolutely predicted in silico, and thus this should be only taken as a proof-of-principle example. The peak list for the simulated 5′-Cy3-labeled A10 mass spectrum was as follows:
The mass ladder starting from 961.369165 represents the 5′-Cy3-labeled mass ladder, while the mass ladder starting from the 267.096732 represents the 3′mass ladder.
Comparing these two RT vs. mass plots, one sees that the two mass ladder curves are almost superimposed when there is no end-labeling (
In addition to automating the sequence generation, one can also manually search for the mass ladders by the Molecular Feature Extraction (MFE) workflow in MassHunter Qualitative Analysis (Agilent Technologies), for confirming the accuracy of automating sequencing. In Table S1-S38, provided are the theoretical mass of each fragment (obtained by ChemDraw), base mass, base name, observed mass, RT, volume (peak intensity), quality score, and error expressed as ppm (calculated by the equation as follows). The MFE settings were optimized to extract as many identified compounds as possible but with reasonable quality score. The MFE settings applied are as follows: “centroid data format, small molecules (chromatographic), peak with height≥500, quality score≥70”. However, data reduction was performed to simplify algorithm sequencing if needed. For instance, retention time could be selected from 6 to 10 min for biotin labeled samples for a 20 nt RNA. Also, the numbers of input compounds used for algorithm analysis are generally an order-of-magnitude higher than the numbers ladder fragments needed for generating complete sequences, unless indicated otherwise; these input compounds are sorted out of all MFE extracted compounds typically with higher volumes and/or better quality scores.
The following formula was used to calculate the PPM described in Example 8:
ppm=10−6×(Masstheoretical−Massobserved)/Masstheoretical
The data was processed as the following:
For
Take the top 500 by volume (above and including 1219)
Take the top 1000 by volume (above and including 33693)
Take the top 500 by volume (above and including 241698)
Take the top 1000 by volume because the CMC-labeling efficiency was somewhat low (above and including 63110)
Take the top 300 by volume (above and including 121230)
The second step is to analyze the LC/MS data and automatically recognize the RNA sequences.
A modified version of the algorithm from [JACS 2015] was used.
A modification was first made to the default.cfg file:
1) the requirement of a strictly monotonically increasing or decreasing sequence plot was deleted
2) a mass filtering step was disabled:
3) For
Additional changes for specific figures were done according to the following:
For
Take the top 500 by Volume (above and including 3486)
The plotting direction was also flipped (changes in bold):
All patents, patent applications and references cited throughout the specification are expressly incorporated by reference.
This application claims benefit and priority to U.S. Provisional Application Nos. 62/676,703, filed May 25, 2018; 62/730,592, filed Sep. 13, 2018; 62/800,054, filed Feb. 1, 2019; and 62/833,964 filed Apr. 15, 2019, which are all incorporated herein by reference in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/033920 | 5/24/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62676703 | May 2018 | US | |
62730592 | Sep 2018 | US | |
62800054 | Feb 2019 | US | |
62833964 | Apr 2019 | US |