The present disclosure relates generally to novel algorithms developed for liquid chromatography-mass-spectrometry (LC-MS) based RNA sequencing techniques based on end-labeling of RNA to be sequenced and the fragmented ladders of RNA that cover the complete suite of ladder fragments from first ribonucleotide to the final one. The algorithms simultaneously read a target RNA sequence with single nucleotide resolution and determine the presence, type, location, and quantity of a wide spectrum of target RNA modifications. The disclosed algorithms introduce computational simulations resulting in reciprocal verification between experimental data and simulated data. The simulation provides a means for sequencing RNA molecules of increased length as well as RNA samples with increased strands and population diversity.
Mass spectrometry (MS) is a tool for studying protein modifications, where peptide fragmentation produces “ladders” that reveal the identity and position of various amino acid modifications. As of yet, a similar approach is not yet feasible for nucleic acids, because in situ fragmentation techniques providing satisfactory sequence coverage do not exist. Aberrant nucleic acid modifications, especially methylations and pseudouridylations in RNA, have been correlated to the development of major diseases like breast cancer, type-2 diabetes, and obesity, each of which affects millions of people around of the world. Despite their significance, the available tools to reliably identify, locate, and quantify modifications in RNA are very limited.
Accordingly, new methods are needed to facilitate the efficient sequencing of RNA molecules.
To enable automated direct sequencing of RNA, algorithms with improved accuracy are desired, given that LC/MS data contains data from multiple-cut RNA fragments, making it difficult to analyze, especially for the sequences to be generated from the lower mass regions where smaller degraded RNA fragments are located. The present disclosure relates to development of an algorithm for use with mass RNA laddering sequencing methods.
In accordance with aspects of the present disclosure, a computer implemented method for determining an order of nucleotides of an RNA molecule is presented. The method includes: receiving liquid chromatography-mass-spectrometry (LC-MS) data of an RNA sample, filtering the LC-MS data based on mass, analyzing the filtered LC-MS data to determine an RNA sequence, and reading-out an RNA sequence as a sequence read based on determining no remaining valid nucleotides in the remaining LC-MS data. The RNA sequence includes a sequence order of each identified canonical nucleotide and any identified modified nucleotides. The LC-MS data including a mass, retention time (RT), volume, and quality score (QS). The filtering including removing masses smaller than a predetermined size. The sequencing includes: determining a mass difference between at least two adjacent ladder fragments, and determining whether the mass difference is equal to at least one of a canonical nucleotide, or a modified nucleotide.
In an aspect of the present disclosure, the method may further include: determining whether there are any gaps in the sequenced LC-MS data, determining whether there are any remaining RNA fragment that did not yield a valid nucleotide based on the gaps, performing a hierarchical clustering algorithm on the compounds to identify possible nucleotides from their related mass-adducts, determining the mass of an RNA fragment for each cluster, based on item-wise comparison between the identified mass-adducts and the cluster of masses, predicting a ladder fragment based on the determined mass for each cluster, reading-out an RNA sequence based on the predicted ladder fragment, and reporting the RNA sequence. The hierarchical clustering algorithm includes: determining a distance metric based on a mass as well as RT for the RNA fragment; and grouping RNA fragment, into a cluster of masses, based on their mass relationship, such that each fragment includes possible mass-adducts of a true ladder fragment. The RNA sequence selected to report out can include the nucleotide identified form any mass-adducts.
In another aspect of the present disclosure, a length of the RNA molecule is more than 20 nucleotides.
In an aspect of the present disclosure, one or more RNA molecules are present in the RNA sample to be sequenced.
In yet another aspect of the present disclosure, the RNA sample includes a purified RNA sample.
In a further aspect of the present disclosure, the RNA sample includes a therapeutic RNA molecule.
In an aspect of the present disclosure, the RNA sequence is determined by correlation of MS data output with a mass of known ribonucleotides.
In a further aspect of the present disclosure, including determining a type, location, and quantity of modified ribonucleotides based on correlating mass-spectrometry (MS) data output with a mass of known modified ribonucleotides.
In yet another aspect of the present disclosure, the sequencing of the filtered LC-MS data is based on a unique property of the RNA fragment. In a further aspect of the present disclosure, the unique property of an RNA fragment includes at least one of electronic or optical signature signals.
In accordance with aspects of the present disclosure, a system for determining an order of nucleotides of an RNA molecule is presented. The system includes a processor and a memory. The memory stores instructions which, when executed by the one or more processors, cause the system to: receive liquid chromatography-mass-spectrometry (LC-MS) data of an RNA sample, the LC-MS data including a mass, retention time (RT), volume, and quality score (QS); filter the LC-MS data based on mass, the filtering including removing masses smaller than a predetermined size; analyzing the filtered LC-MS data, to determine a plurality of RNA sequences, and reading-out an RNA sequence as a sequence read after determining no remaining valid nucleotides in the remaining LC-MS data. The RNA sequence including a sequence of each identified canonical nucleotide and any identified modified nucleotides. Analyzing the filtered LC-MS data includes: determining a mass difference between at least two adjacent ladder fragments; and determining whether the mass difference is equal to at least one of: a canonical nucleotide, or a modified nucleotide.
In accordance with aspects of the present disclosure, a computer implemented method for determining an order of nucleotides of an RNA molecule is presented. The method includes accessing liquid chromatography-mass-spectrometry (LC-MS) data of an RNA sample, the RNA sample including an RNA ladder fragment; accessing a database including theoretical mass, calculated from chemical formula, of all known ribonucleotides including those with modifications to the base; performing anchor-based sub-setting on the LC-MS data, the anchor based sub-setting including selecting a data zone; performing base calling on the selected subset of LC-MS data to generate a dataset of tuples; building trajectories linking tuples in the dataset to generate a draft read of the RNA ladder fragment; and performing a draft read strategy.
In yet a further aspect of the present disclosure, the draft read strategy includes scoring based on at least one of read length, average volume, average QS, or average parts per million (PPM).
In yet another aspect of the present disclosure, PPM is determined as:
wherein: Massexperimental is an experimental mass corresponding to a molecular tag, and Masstheoretical is the theoretical mass.
In a further aspect of the present disclosure, average PPM is the sum of all PPM values associated with data points contained in a draft read divided by read length.
In yet a further aspect of the present disclosure, building trajectories further includes performing a Depth First Search (DFS) algorithm to ensure that all possible draft reads will be found from the LC-MS data.
In yet another aspect of the present disclosure, the method further includes biochemical labeling of the RNA samples.
In a further aspect of the present disclosure, the draft read strategy includes a global hierarchical ranking strategy.
In an aspect of the present disclosure, the draft read strategy includes a local best score strategy. In another aspect of the present disclosure, the method further includes performing an alignment/assembly algorithm configured to assemble a complete RNA sequence from different fragments of the RNA molecule.
Further details and aspects of exemplary embodiments of the disclosure are described in more detail below with reference to the appended figures. Any of the above aspects and embodiments of the disclosure may be combined without departing from the scope of the disclosure.
Various embodiment of the present methods for RNA sequencing and algorithm are described herein with reference to the drawings wherein:
Further details and aspects of exemplary embodiments of the disclosure are described in more detail below with reference to the appended figures. Any of the above aspects and embodiments of the disclosure may be combined without departing from the scope of the disclosure.
Although the present disclosure will be described in terms of specific embodiments, it will be readily apparent to those skilled in this art that various modifications, rearrangements, and substitutions may be made without departing from the spirit of the present disclosure. The scope of the present disclosure is defined by the claims appended hereto.
For purposes of promoting an understanding of the principles of the present disclosure, reference will now be made to exemplary embodiments illustrated in the drawings, and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the present disclosure is thereby intended. Any alterations and further modifications of the inventive features illustrated herein, and any additional applications of the principles of the present disclosure as illustrated herein, which would occur to one skilled in the relevant art and having possession of this disclosure, are to be considered within the scope of the present disclosure.
For automation of RNA sequencing, algorithms with improved accuracy are needed. The present disclosure relates to development of an algorithm for use with mass RNA laddering sequencing methods (for example, those described in U.S. Patent Ser. No. 62/833,964 which is incorporated herein by reference in its entirety). For a detailed discussion of LC/MS-based RNA sequencing, reference may be made to U.S. Patent Ser. No. 62/833,964 and “A general LC/MS-based RNA sequencing method for direct analysis of multiple-base modifications in RNA mixtures,” Zhang et. al. (available at https://doi.org/10.1101/643387), the entire contents of which are incorporated by reference herein.
RNA sequencing is the process of determining the nucleic acid sequence—the order of nucleotides in RNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and uracil. In addition to the determining the nucleic acid sequence, the methods disclosed herein can also identify, locate, and quantify RNA modifications within the nucleic acid sequence.
The disclosed algorithm includes computational simulations resulting in reciprocal verification between experimental data and simulated data. The simulation provides a means for probing RNA molecules of increased length as well as diverse RNA samples having a mixture of RNA. A hierarchical clustering algorithm has been used to automate RNA sequence generation from the monoisotopic mass data obtained for example, from Agilent's molecular feature algorithm. Although an example Python-based algorithm works well on short RNAs, it was found that when running LC/MS data from tRNA, it slowed down significantly and the error rates increased in the algorithm-generated RNA sequences, likely due to the increased computational workload from the datasets and complexity of the tRNA samples. The 76 nucleotide long tRNA is substantially longer than 20 nt RNAs for which this algorithm was originally derived. Furthermore, the tRNA has 11 different chemical modifications (see Table 1 below). The increase in both chemical modifications and RNA length not only challenged capacity of the Python-based algorithms, but also make the error rate issues pronounced. For short RNA with ˜20 nucleotides long, one can manually calculate the mass differences between two adjacent ladder components to verify accuracy of each sequence readout from the algorithm. For longer RNA, this manual verification becomes more challenging and less efficient. For automation of RNA sequence generation and modification analysis, the development of more robust methods will provide a means for verifying the accuracy of MS-based sequencing data, especially as sequencing of more complicated and longer cellular RNA samples progresses. The algorithm disclosed herein is designed to improve the accuracy of RNA sequencing methods via a two-way sequencing reconfirmation for better accuracy. The algorithm comprises the steps of (i) reading out from MS data to proposed draft sequence reads, (ii) simulation from the proposed draft sequence reads into ideal ladder patterns, and (iii) re-affirmation to see how well they fit.
Although, MS-based RNA sequencing methods control degradation conditions to generate well-defined mass ladders for sequencing, the process of generating ladder fragments in the chemical/enzymatic degradation step can lead to the creation of internal fragments that do not possess a 3′ or 5′ end. Use of the algorithm disclosed herein provides a means for utilizing the internal fragments for sequence alignment by piecing them together via clustering undesired RNA oligonucleotide fragments and computational simulation. The algorithm of the disclosure, also helps to increase the accuracy of sequence alignment for RNA with long sequences when fragmentation is utilized to produce shorter RNAs for use in, for example, MS-based sequencing.
In one aspect, the algorithm of the disclosure may be used in conjunction with a variety of different RNA sequencing methods. One such non-limiting method comprises the steps of: (i) affinity labeling of the 5′ and 3′ end of the RNA molecules; (ii) random degradation of the labeled RNA; (iii) optionally, 5′ and 3′ end labeled fragment separation; (iv) separation of resultant target RNA fragments using reverse-phase high performance liquid chromatography (HPLC); and (iv) sequential analysis of resultant mass ladders with high-resolution mass spectrometry for sequence/modification identification. Such an RNA sequencing method is based on the formation and sequential physical separation of two ladder pools of degraded RNA fragments, referred to herein as 5′ and 3′ ladder pools, which are then subjected to LC/MS for HPLC and MS determination of the RNA sequence as well as the presence, type, location and quantity of RNA modifications. The algorithm disclosed herein, is advantageously utilized to analyze the obtained LC/MS derived data.
In one aspect, the algorithm of the present disclosure may be used in conjunction with a variety of different RNA sequencing methods. One such non-limiting method comprises the steps of: (i) chemical labeling of the 5′ and 3′ end of the RNA molecules with different tags; (ii) random degradation of the labeled RNA; (iii) separation of resultant target RNA fragments using reverse-phase high performance liquid chromatography (HPLC); and (iv) sequential analysis of resultant mass ladders with high-resolution mass spectrometry for sequence/modification identification.
The disclosed algorithm recognizes the identities and locations of not only the four canonical ribonucleotides, but also different types of modified ribonucleotides, by their own and/or in their sequential orders, based on the fact that all types of nucleotides have their unique mass and retention time (RT) features in LC-MS data. The algorithms automatically generate sequences that reveal the presence, type, location and quantity of a wide spectrum of different RNA modifications. The algorithms take advantage of the LC/MS characteristic features, including mass and retention time (RT), volume, and quality score for generate sequence reads, and are able to de novo generate the RNA sequences revealing the identity and location of each canonical ribonucleotides and non-canonical base modifications. The data used for the algorithm development including the mass, RT, volume and quality score (QS) were directly exported from LC/MS workstation without any other processing. The algorithms were tested on tRNA (tRNA (phenylalanine specific from brewer's yeast), and their sequence readouts was verified to be accurate.
With reference to
In the auxiliary step, a hierarchical clustering algorithm 128 is used to identify related mass-adducts. In various embodiments, using a distance metric that factors into account the mass as well as RT, the hierarchical clustering algorithm 128 groups compounds based on their mass-relationship so that each cluster contains possible mass-adducts of a true ladder fragment. To cut down on the complexity of the data, points that have already been sequenced in the previous step, and thus subsequently their related mass clusters, will be excluded from the hierarchical clustering step. At step 130, once mass clusters have been identified, the masses will be tested against the masses of the adducts to determine the true mass of the ladder fragment that gave rise to the different mass-adduct fragments. The algorithm will create a new data point with the mass equal to the mass of the ladder fragment identified through the formula in
With reference to
In the event that there are RNA modifications on the 2′-hydroxyl group that block acidic degradation, a different approach will be adopted to fill the gap caused by the blocking group at the 2′-O position. RNA modifications, e.g., methylation on the 2′-hydroxyl group of RNA, render the adjacent 3′-5′-phosphodiester linkage non-hydrolysable, create a mass gap in both the 5′- and the 3′-mass ladder families that are larger than one nucleotide. As a result, it is determined that there is a single modification on the 2′-O position and the combination of two nucleotides, but their order is unknown. To resolve such ambiguities, the computational simulation is used to match the observed LC/MS data 102 against the simulated 2′-O-modified sequence, and thus the results from these analyses should match well if there is a modification at 2′-O-position. In addition, the complete nucleotide sequence can be assembled through conventional RNA sequencing platforms. Alternatively, collision induced dissociation (CID) MS can be performed on the 2′-O-modified dimer fragment to elucidate the structure of the dinucleotide fragment.
In various embodiments, the last step of the sequencing process is to harness the presence of multiple internal fragments in the data to function as a new sequence or a check for the final sequence. Masses that are not included in the mass clusters or used in the sequencing reads are divided by the average value of the four canonical bases to estimate their sequence length. In various embodiments, sequences from 3 to 6 bases in length are compared to a list of generated masses of internal fragments that are 3 to 6 bases in length to find a precise match t. These short fragments can be used to fill gaps in the sequence or confirm the accuracy of the sequence.
In various embodiments, the raw data derived from LC-MS, which contains the m/z data of the desired fragments and/or the undesired fragments bearing more than one cleavage may be decovoluted over the entire LC run using Agilent's molecular feature algorithm built into MassHunter™ software, which is subsequently used for sequence alignment. Mass adducts can be removed from the deconvoluted data and the sequences will be predicted/generated using both mass and retention time data. The retention time-coupled m/z data for the fragments is analyzed and classified using a developed support vector machine (SVM) classifier algorithm; to determine which data points are “valid” and to be used for subsequent sequence determination and which data points are to be filtered out. After data reduction step, the mass difference (m) between two adjacent RNA ladder fragments [m=m (i)−m(i−1), 1<i<n, n=RNA length], where m(i) is the mass of any ladder fragment and m(i−1) is the preceding lower mass ladder fragment, and match such mass differences with the exact masses of known nucleotide fragments using search algorithms designed-to correlate the derived RNA sequencing information based on mass differences to determine the identity of canonical nucleotides and their modification. As long as the structural modification on an RNA nucleoside is mass-altering, the search algorithms and the dynamic programming method together will permit identification of the RNA sequence and its modification to be identified. In various embodiments, the mass of the known modified ribonucleotides can be conveniently retrieved from known RNA modification database or through use of the table shown in
With reference to
With reference to
Next, at step 804, the system filters the LC-MS data based on mass, the filtering including removing masses smaller than a predetermined size. In various embodiments, the data is filtered based on mass, eliminating masses that are too small to be useful in sequencing.
Next, at step 806, the system sequences the filtered LC-MS data, to generate an RNA sequence. The sequencing includes steps 808 thru 812. At step 808, the system determines whether two adjacent compounds are close together in RT. Next, at step 810, the system determines a mass difference between the two adjacent ladder fragments. In various embodiments, the system may, starting at a random compound, identify a neighboring compound that is close in RT and calculates the mass difference between the two (See
Next, at step 812, the system determines whether the mass difference is equal to at least one of a canonical nucleotide, or a modified nucleotide. In various embodiments, the system determines whether the mass difference matches the mass of one of the four canonical nucleotides: A, U, C, G, or a modified base from a database of over 110 known modified RNA bases. Next, at step 814, the system stores in a memory, as part of a sequencing read, the result as a valid nucleotide based on the determined mass difference.
Next, at step 816, the system determines whether any two adjacent compounds remain in the LC-MS data that will produce a mass difference that yields a valid nucleotide. In various embodiments, the algorithm then continues following the same set of rules for steps 808 thru 812 for finding the next compound, until no more valid compounds can be found or no more compounds can be found that will produce a mass difference that yields a valid canonical nucleotide or modified nucleotide. In various embodiments, the system determines if it is able to read out all of the base-pairs. In various embodiments, if there are any gaps in the sequence, then the algorithm proceeds to an auxiliary step.
In various embodiments, in the auxiliary step the system determines whether there are any remaining compounds that did not yield a valid nucleotide based on the gaps. If there are any gaps, the system performs a hierarchical clustering algorithm on the compounds to identify related mass-adducts. In various embodiments, the hierarchical clustering algorithm includes determining a distance metric based on a mass as well as RT for the compound, grouping compounds, into a cluster of masses, based on their mass relationship, such that each fragment includes possible mass-adducts of a true ladder fragment. In various embodiments, points that have already been sequenced in the previous step, and thus subsequently their related mass clusters, will be excluded from the hierarchical clustering step.
In various embodiments, the system then determines the mass of a fragment for each cluster, based on item-wise comparison between the identified mass-adducts and the cluster of masses. In various embodiments, the system then predicts a ladder fragment based on the determined mass for each cluster. In various embodiments, the system then reads-out an RNA sequence based on the predicted ladder fragment, and reports the RNA sequence
Next, at step 818, the system reads-out an RNA sequence based on determining there are no remaining valid nucleotides in the remaining LC-MS data. Next, at step 820, the system reports the RNA sequence. In various embodiments, the system may display on a display the RNA sequence.
In various embodiments, liquid-chromatography-mass spectrometry-(herein referred to as LC-MS) based RNA sequencing method may be used to simultaneously determine the nucleotide sequence of a target RNA molecule with single nucleotide resolution, as well as, detect the presence of target RNA modifications. The disclosed method can be used to determine the type, location and quantity of each modification within the target RNA sample. Such techniques can be used advantageously to correlate the biological functions of any given RNA molecule with its associated modifications and for quality control of RNA-based therapeutics.
In various embodiments, the above method 800 of
With reference to
With reference to
In various embodiments, there are at least two reasons to subset the dataset 904 before parsing into the algorithm. First is to identify mass ladders needed for sequencing and to eliminate noise data from the dataset. Second is to make the algorithm easy to process a partial dataset, rather than the complete dataset. In various embodiments, it is possible because we have introduced a hydrophobic tag like biotin or Cy3 to the RNA to be sequenced experimentally. The hydrophobicity of the label causes a significant increase in RT values of the ladder components when compared to the unlabeled ladder components, and help all the labeled mass ladder components upshift to the top zone so that we can easily identify labeled mass ladders in the 2-D mass-RT plot. Here we show the graphical distribution of data points from the test tRNA sequencing (
With reference to
The algorithm performs base calling for all data points until all possible tuples are stored in set V. Note that each tuple in set V represents an individual base-calling possibility.
With reference to
In various embodiments, because the outputs from LC-MS contains a huge number of data points, graph G contains the same number of vertices and also huge number of edges, resulting in tremendous number of total paths, each representing a draft read. To effectively filter the draft reads for reporting correct sequences, two draft read selection strategies, have been developed namely the global hierarchical ranking strategy 900 and the local best score strategy 1000. Nonetheless, both strategies use same parameters acquired from the LC-MS dataset to score the draft reads 914 which include PPM, RT, volume, quality score (QS), read length.
With reference to
With reference to
With reference to
Next at step 2004, the system accesses a database which includes theoretical mass, calculated from chemical formula, of all known ribonucleotides including those with modifications to the base. Next at step 2004, the system performs anchor-based sub-setting on the LC-MS data, the anchor based sub-setting including selecting a data zone.
Next at step 2006, the system performs base calling on the subset of LC-MS data to generate a dataset of tuples. Next at step 2008, the system builds trajectories linking tuples in the dataset to generate a draft read of the RNA fragment. In various embodiments, the draft read strategy includes a global hierarchy ranking strategy or a local best strategy. In various embodiments, the draft read strategy includes a local best strategy. In various embodiments, building trajectories further includes performing a Depth First Search (DFS) algorithm to ensure that all possible draft reads will be found from the LC-MS data.
Next at step 2010, the system performs a draft read strategy. With reference to
In various embodiments, the draft read strategy includes scoring based on at least one of read length, average volume, average QS, or average PPM.
The systems described herein may also utilize one or more controllers to receive various information and transform the received information to generate an output. The controller may include any type of computing device, computational circuit, or any type of processor or processing circuit capable of executing a series of instructions that are stored in a memory. The controller may include multiple processors and/or multicore central processing units (CPUs) and may include any type of processor, such as a microprocessor, digital signal processor, microcontroller, programmable logic device (PLD), field programmable gate array (FPGA), or the like. The controller may also include a memory to store data and/or instructions that, when executed by the one or more processors, causes the one or more processors to perform one or more methods and/or algorithms.
Any of the herein described methods, programs, algorithms or codes may be contained on one or more machine-readable media or memory. The term “memory” may include a mechanism that provides (for example, stores and/or transmits) information in a form readable by a machine such a processor, computer, or a digital processing device. For example, a memory may include a read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, or any other volatile or non-volatile memory storage device. Code or instructions contained thereon can be represented by carrier wave signals, infrared signals, digital signals, and by other like signals.
The embodiments disclosed herein are examples of the disclosure and may be embodied in various forms. For instance, although certain embodiments herein are described as separate embodiments, each of the embodiments herein may be combined with one or more of the other embodiments herein. Specific structural and functional details disclosed herein are not to be interpreted as limiting, but as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present disclosure in virtually any appropriately detailed structure.
The phrases “in an embodiment,” “in embodiments,” “in various embodiments,” “in some embodiments,” or “in other embodiments” may each refer to one or more of the same and/or different embodiments in accordance with the present disclosure. A phrase in the form “A or B” means “(A), (B), or (A and B).” A phrase in the form “at least one of A, B, or C” means “(A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).”
It should be understood that the description herein is only illustrative of the present disclosure. Various alternatives and modifications can be devised by those skilled in the art without departing from the disclosure. Accordingly, the present disclosure is intended to embrace all such alternatives, modifications and variances. The embodiments described are presented only to demonstrate certain examples of the disclosure. Other elements, steps, methods, and techniques that are insubstantially different from those described above and/or in the appended claims are also intended to be within the scope of the present disclosure.
This application claims benefit and priority to U.S. Provisional Application No. 62/676,754, filed May 25, 2018, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/033895 | 5/24/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62676754 | May 2018 | US |