When sequencing is used for the analysis of composition of nucleic acid mixtures with a large dynamic range of concentrations of individual components, the reliability of results differs significantly for abundant and rare components. This is a common problem for studying of transcriptomes and for analysis of biodiversity by sequencing of environmental and clinical samples. We suggest a method of analysis which allows adjusting the reliability of results individually for each component of the nucleic acid mixture in a highly reproducible manner: Controllable Oligonucleotide-Based Ratio Adjustment (COBRA). The method is based on using locus-specific oligonucleotides to change the relative abundance of individual components of nucleic acid mixture before sequencing.
The method is especially useful for routine analysis of biodiversity and routine expression profiling, like for clinical studies.
RNA-Seq (RNA Sequencing) is a hypothesis-free approach for studying of transcriptome by sequencing of millions of cDNA fragments. The abundance of cDNA fragments matches the abundance of the corresponding transcript. The obtained sequencing results give a possibility to retrieve information about abundance and structure of transcripts.
RNA-Seq is complicated by two problems:
Only a portion of reads mapped to the similar transcripts may be used for characterization of expression levels of individual homologues: namely those reads which overlap sites, different between the homologues. Other reads may be used only for characterization of cumulative expression level.
Usually only a part of RNA-Seq library is sequenced. Concentration of abundant transcripts is determined with excessive reliability, but concentration of rare transcripts only with insufficient reliability. Sequencing of the rest of the library would improve the reliability of measurement of concentration of rare transcripts. But only a small part of the additional sequencing reads would correspond to rare transcripts, most of the additional sequencing reads would correspond to abundant transcripts.
It would be more attractive to reduce the number of sequencing reads corresponding to abundant transcripts (which are analyzed with redundant reliability). In this case more reads would correspond to rare transcripts and reliability of analysis of rare transcripts would increase.
COBRA-Approach
In this invention we suggest to change the way how massively parallel sequencing is used for the analysis of mixtures containing different nucleic acids, in particular for determination of concentrations of individual components.
Currently, a sequencing library is prepared from the mixture under analysis by such a way, that the relative abundances of the individual components in the library match as close as possible to the abundance of the corresponding components in the mixture under analysis. Thus, when sequencing reveals abundances of the components of the sequencing library it also determines the abundance of the components in the mixture under analysis. The problem is that the reliability of results significantly differs for abundant and rare components.
We suggest preparing sequencing libraries, in which abundances of individual components are selectively and controllably modified (
The idea is to controllably and reproducibly modify the abundances of some components of the mixture before sequencing: to decrease the abundances of those components, which are analyzed with excessive reliability and/or to increase the abundances of those components, which are analyzed with insufficient reliability. As a result the desirable accuracy of concentration measurement (for all analyzed components) would be achieved with less sequencing reads if compare with sequencing without preliminary abundance modification.
Locus-specific oligonucleotides allow to affect independently individual components of nucleic acid mixture. As soon as we can address individual components we can apply a number of molecular biology techniques to vary effectiveness of converting of molecules of the analyzed mixture into the molecules of sequencing library.
In this application we describe three methods for reproducible and predictable regulation of abundance of sequencing library molecules correspondent to different components of nucleic acid mixture:
It is quite possible that there are other methodological solutions for COBRA-approach. But even these three approaches and their combinations provide a variety of protocols for preparation of COBRA sequencing libraries.
The present invention refers in particular to a method for analysis of concentrations of components of nucleic acid mixtures by sequencing, wherein relative abundances of at least two components for which concentrations should be measured is changed before sequencing in a reproducible way using locus-specific oligonucleotides and wherein said change of abundances comprises the following steps:
Within the methods of the present invention the analysis of concentrations of components of nucleic acid mixtures with changed abundance by sequencing takes place subsequently to step ii). Thus, the present invention refers to a method for analysis of concentrations of components of nucleic acid mixtures by sequencing, wherein relative abundances of at least two components for which concentrations should be measured is changed before sequencing in a reproducible way using locus-specific oligonucleotides and wherein said method comprises the following steps:
Within the inventive method it is preferred that the relative abundances of components corresponding to the components selected on step i) are changed on step ii) by
Preferred are methods according to the present invention, wherein relative abundances of components selected on step i) are changed in such a way, that the dynamic range of concentrations of components under analysis in the subsequent nucleic acid mixture is lower than the dynamic range of concentrations of components under analysis in the original mixture containing nucleic acids or in a way which decreases the abundance of components for which concentration without change of abundances is measured with excessive accuracy and/or increases the abundance of components for which it is desirable to increase the accuracy of concentration measurement if compared with measurement of concentration without change of abundances.
The present invention refers further to a method of analysis of concentrations of nucleic acid components in mixtures containing nucleic acids, comprising the following steps:
An alternative formulation for this method the present invention refers to is:
To determine concentrations of components in the original nucleic acid (NA) mixture their concentrations in the sequencing library should be multiplied on corresponding abundance change factors. Thus it is possible to compare not only experiments of the same series between each other, but also the experiments performed by different people using different COBRA-based protocols.
Because the relative abundances of the at least two components for which concentrations should be measured is changed in a reproducible and preferably also predictable way it is possible to calculated the concentration of the component in the original mixture using division by correspondent abundant change factors. Preferred are methods according to the invention, wherein relative concentrations of components under analysis in the original nucleic acid mixture are calculated by dividing results obtained after changing of abundances by correspondent abundant change factors.
Locus-specific oligonucleotides allow not only to affect individual components of a mixture of nucleic acids but also to select for sequencing certain parts of these components to avoid difficult-for-analysis regions. For expression profiling locus-specific oligonucleotides give a possibility to select for sequencing only non-repetitive regions of genes. For analysis of biodiversity it is preferred to exclude from the sequencing library evolutionary conserved regions.
Using of locus-specific oligonucleotides allows to combine the selectivity of microarrays with the accuracy and sensitivity of massive parallel sequencing. As in microarray technologies, COBRA procedure requires hundreds and thousands of locus-specific oligonucleotides. That is why COBRA procedure may be not relevant for preparation of single libraries. But for a massive screening or for routine analyses, large set of locus-specific oligonucleotides is not a big inconvenience, because such set should be prepared only once.
Besides, for a lot of applications, the COBRA oligonucleotide set is determined mainly by the type of tissue under analysis, because particular a type of tissue defines which genes are over-expressed and consequently over-sequenced. In clinical analyses only a few types of human tissues are easily available (such as blood, saliva, buccal cells, sperm). For each of these tissues, appropriate locus-specific COBRA oligonucleotides may be designed.
Practical Implementation
Although we propose to use, for analysis of nucleic acid mixtures, a new type of libraries (with altered abundances of individual components), it does not mean that new molecular methods are needed. Already known and proven approaches can be adapted for COBRA. Two issues are required for adaptation:
Locus-specific oligonucleotides are widely used in biomedicine. They allow specifically targeting components with definite known nucleotide sequences in complex mixtures of nucleic acids. Specificity of targeting is based on specificity of hybridization of nucleic acids: the most stable hybrid is formed with perfectly matched sequences. Locus-specific oligonucleotides provide specificity of many types of molecular biology reactions:
All these methods are associated with some background because of unspecific hybridization. Unspecific hybridization may appear because of repetitive regions of the genome. Besides, some completely unique sequences may interact too strong with not perfectly matched sequences. But for all mentioned procedures and for most non-repetitive regions a person skilled in the art is capable to select locus-specific oligonucleotides which provide acceptable background level. In case of analyzing of results by sequencing, significant part of non-specific products may be eliminated on analysis stage, for example, because extension reaction results in wrong nucleotide sequence or incorrect primer combination appeared as a result of ligation.
The term “locus-specific oligonucleotides” or “site-specific oligonucleotides” as used herein refers to a short, chemically synthesized nucleic acid complementary to the sequence of a site in the component of the nucleic acid mixture. The locus-specific oligonucleotides hybridize in a sequence-specific manner to a specified locus, portion or region of a selected component of the nucleic acid mixture. Therefore the locus-specific oligonucleotides can be used to determine the locus, region or fragment of the selected component of the nucleic acid mixture. The locus, region or fragment is determined to be targeted by a subsequent enzymatic reaction such as amplification or sequencing. Locus specific oligonucleotides may be for example: primer as a starting point for DNA synthesis (eg during PCR), probes or oligonucleotides for hybridization or ligation reactions.
If a library preparation method is already using locus-specific oligonucleotides, it is possible to use those oligonucleotides for regulation of the abundances of correspondent sequencing library molecules. For example, Illumina TruSeq™ Targeted RNA Expression Kits is based on extension/ligation of locus-specific oligonucleotides on cDNA. These oligonucleotides can be used as an instrument for abundance regulation.
If there are no locus-specific oligonucleotides in the protocol, it is possible to introduce them at some stage. Classic protocol for preparing RNA-Seq libraries does not involve any locus-specific oligonucleotides. But they may be included in the protocol, for example, in the following way:
The following paragraphs describe the second issue necessary for implementation of COBRA-libraries, namely procedures for reproducible and predictable modification of the abundances. Three approaches with easily predictable abundance change factors are described in detail: (i) using different number of loci per transcript; (ii) using of different library-preparation protocols for different groups of loci; (iii) using a mixture of “functional” and “blocked” locus-specific oligonucleotides. Besides, approaches are outlined for which it is difficult to predict in advance the abundance change factors, but which can provide reproducible change of abundance.
Using a method according to the invention a subsequent nucleic acid mixture is created which is preferably selected from the group comprising or consisting of: sequencing library, set of ligated locus-specific oligonucleotides, set of locus-specific oligonucleotides extended in a template-dependent reaction, set of fluorescently labeled molecules, nucleic acids molecules selected with the help of hybridization with locus-specific oligonucleotides.
Number of Detectable Loci per Transcript
If not one but several detectable loci or sites (preferably, located in a way that they do not compete with each other during library preparation) are selected for a certain component of the nucleic acid mixture, the number of sequencing reads matching this component would increase proportionately. This will increase the reliability of concentration measurement of the component.
Selection of a different number of detectable loci for regulation of abundance of correspondent molecules in sequencing library has certain advantages and disadvantages.
Advantages:
Disadvantages:
Combining Loci in “Change of Abundance” Groups
If it is not necessary to provide a precise value of abundance change factor for each selected component, loci with similar required adjustment levels may be combined in groups. Then the COBRA-library may be planned as following:
a) select the desired abundance change factor for each locus;
b) combine loci with similar abundance change factors into groups and choose a common factor for each group;
c) select for each group of loci a library preparation protocol with the required abundance change factors value.
Groupwise regulation of the relative abundances allows reducing the dynamic range of concentrations. One can for example combine transcripts in three groups: “without suppression”, “10× suppression” and “100× suppression,” according to their expression level, than the dynamic range is reduced from five to three orders of magnitude (
Locus-specific oligonucleotides corresponding to different adjustment levels (and participating in different protocols) should be somehow grouped. This can be done in two ways:
Spatial isolation of locus-specific oligonucleotides enables performing of spatially isolated reactions. Library preparation reactions correspondent to different adjustment level groups may be completely independent from each other or differing only by a certain stage. Independent preparation of libraries for loci with different adjustment levels gives a full freedom in choosing the protocol (different principles, different enzymes), but requires more labor and can lead to unstable results of comparison of expression levels of genes from different adjustment level groups. Minimizing the number of differing stages decreases labor costs and makes the comparison of abundances of different components more reproducible. A spatially separated stage can be introduced at any point of the library preparation protocol:
The reaction conditions would be as similar as possible, if locus-specific oligonucleotides for different groups are added subsequently to the same reaction (see Examples 5 and 6).
Markers of abundance level correction introduced in the locus-specific oligonucleotides allow to minimize differences in the reaction conditions and even to synthesize a sequencing library for all groups together. There are a variety of experimental realizations of using marker regions for abundance level correction, which vary from primitive like “divide the mixture into fractions by hybridization with a marker region and then take the appropriate part of the volume of each fraction”, to sophisticated methods like marker-specific PCR with different number of cycles for different markers (see the next paragraph).
Groupwise abundance level correction has certain advantages and disadvantages. Advantages are:
Disadvantages are:
One aspect of the present invention is that the relative abundances of components corresponding to the components selected on step i) are changed on step ii) by using for these components differing reaction conditions. Thereby it is preferred that said differences in reaction conditions are selected from the group consisting of or comprising: different amounts of original mixture containing nucleic acids used in reactions; different number of cycles in cyclic amplification reactions; different reaction times in linear amplification reactions. Implementation of different reaction conditions may comprise grouping of several components selected in step i) according to similar abundance change factor.
Functional and Blocked Locus-Specific Oligonucleotides
In fact, any oligonucleotide, which competes with the locus-specific oligonucleotides suppress the reaction. But when blocked and functional locus-specific oligonucleotides have the same nucleotide sequences, the degree of suppression is easily predictable, determined only by the ratio of concentrations of functional to blocked locus-specific oligonucleotides and do not depends on the reaction conditions (temperature, time, buffer, etc.). Thus it is preferred that the functional and blocked locus-specific oligonucleotides specific for a certain locus or site of a component have an identical sequence.
The ratio of functional to blocked oligonucleotides can be selected independently for each locus of the selected nucleic acid component. As a result, the efficiency of conversion of original molecules into molecules of the library can be tuned independently for each locus.
Different blocking approaches may be used for locus-specific oligonucleotide-dependent reactions:
According to the invention it is preferred that the relative abundances of components corresponding to the components selected on step i) are changed on step ii) using for these components mixtures of functional and blocked locus-specific oligonucleotides with differing ratio of said “functional to blocked” locus-specific oligonucleotides. Thereby it is further preferred that the functional locus specific oligonucleotides can while the blocked locus specific oligonucleotides cannot be elongated in reaction of primer extension, or reaction of first-strand synthesis, or reaction of second-strand synthesis, or in PCR, or in gap-filling reaction because they have 3′ end modification.
One further aspect of the present invention relates to methods wherein the functional oligonucleotides can while blocked oligonucleotides cannot participate in ligation steps of ligation detection reaction, or in gap-filling reaction, or in LCR, or in DANSR because they have 3′ or 5′ end modifications.
One further aspect of the present invention relates to methods wherein functional and/or blocked locus-specific oligonucleotides have markers. These markers allow separating of subsequent molecules containing the functional locus-specific oligonucleotides or their marker from subsequent molecules containing the blocked locus-specific oligonucleotides or their markers. Such subsequent molecules can for example be hybrids of functional locus-specific oligonucleotides with target nucleic acid components or products of reaction involving functional oligonucleotides and respectively hybrids of blocked locus-specific oligonucleotides with target nucleic acid components or products of reaction involving blocked locus-specific oligonucleotides.
Functional and blocked locus-specific oligonucleotides producing separable reaction products, allow to work both with suppressed sequencing library (after removal of the sequencing library molecules synthesized by using the “blocked” primers), and with non-suppressed sequencing library (without separation of the sequencing library molecules synthesized by using the “blocked” primers). Besides, enzymatic reactions are provided with a high concentration of substrate at all stages of library preparation (some enzymes do not work well with the substrate at low concentrations).
Different approaches allow to separate the reaction products obtained from functional and blocked locus-specific oligonucleotides, for example:
Therefore within the methods of the present invention it is preferred that functional and/or blocked locus-specific oligonucleotides have markers selected from the group comprising or consisting of:
Advantages of COBRA methods based on using a mixture of functional and blocked locus-specific oligonucleotides are:
Disadvantages are:
“Abundance change factor” is introduced to characterize the amount of change of relative abundance of an individual component of the nucleic acid mixture. It is calculated by dividing of relative abundance of this component after changing of abundances on relative abundance of the component before changing of abundances. Thus, if 80% of the copies of one particular component in the nucleic acid mixture are blocked because the ratio of functional to blocked locus-specific oligonucleotides is 1:4 the abundance change factor for this component is 0.2. Abundance change factor for the component is 1 if the relative abundance for this component did not change.
Other Approaches for Regulation of Abundance Using Locus-Specific Oligonucleotides
In the three approaches described above the abundance change factor is known in advance. For example when two detectable loci are selected instead of one for a certain component of the nucleic acid mixture, the abundance of this component in a sequencing library increases two times and the abundance change factor is 2. For a 1:1 mixture of functional to blocked locus-specific oligonucleotides a two times decrease of the abundance of the corresponding component in the sequencing library takes place. This feature (predictability of abundance change factor) is convenient, but not obligatory for preparation of libraries with modified abundances of components.
It is possible to use such techniques for changing of abundance of components, for which the abundance change factor is difficult to predict theoretically but can be revealed experimentally. The main thing is that abundance change factors remain the same in different experiments. If necessary, values of abundance change factors may be determined in the control experiment. Below we describe some examples of such techniques.
Functional and Blocked Locus-Specific Oligonucleotides with Differing Nucleotide Sequences.
When the nucleotide sequences of functional and blocked locus-specific oligonucleotides are identical, the abundance change factor depends only on the ratio of concentrations of functional and blocked locus-specific oligonucleotides and remains the same under any experimental conditions. Blocked locus-specific oligonucleotides with non-identical length or with non-identical nucleotide sequence (if compare to “functional”) still would suppress the conversion of components of analyzed mixture into the library molecules, but suppression rate would somehow depend on reaction conditions (temperature, buffer, etc.). Nevertheless, providing standard conditions it may be possible to preserve the same abundance change factors in different experiments. Thus, functional and blocked locus-specific oligonucleotides with different sequences can also be used for the preparation of COBRA-libraries.
Locus-Specific Oligonucleotides with Impaired Hybridization Properties.
It is possible to change the nucleotide sequence of locus-specific oligonucleotides (nucleotide substitutions, change the length), in order to weaken binding of oligonucleotides to the template and thus to suppress the conversion of correspondent components of analyzed mixture into the library molecules. Suppression level is hardly predictable, but it may be determined in a control experiment.
Change of Concentration of Locus-Specific Oligonucleotides
Influence of concentration of locus-specific oligonucleotides on the efficiency of conversion of components of the analyzed mixture into the library molecules is nonlinear and difficult to predict. But from general considerations it is clear that decreasing the concentration would at some point lead to the suppression of the conversion of components of analyzed mixture into the library molecules. Suppression level can be set up in control experiments.
It is possible to use a combination of abundance change methods.
Therefore the present invention refers to a kit, suitable for analysis of concentrations of nucleic acid according to any one of claims 1-13, which produce from original mixture containing nucleic acids some subsequent nucleic acid mixture, wherein abundance of definite set of components is decreased in reproducible manner using functional and blocked locus-specific oligonucleotide sets.
Discussion
Sequencing is one of the most powerful methods of analysis of nucleic acid mixtures. The method allows to identify composition of nucleic acid mixtures and to determine concentrations of individual components. In this case the sequencer is used not for revealing of the unknown nucleotide sequences, but for recognizing of the known molecules. Analysis of concentrations of components of nucleic acid mixtures by sequencing is widely used for studies of biodiversity and expression profiling in medicine, veterinary, agriculture, and ecological studies.
Expression profiling is used for analysis of mixtures of RNA molecules: which molecules are present in the mixture and in what proportion. Sequencers cannot read RNA molecules directly. First, RNA molecules have to be converted into sequencing library molecules. Depending on the method, different parts of RNA molecules are converted into sequencing library molecules (DNA): random fragments of RNA molecules (RNA-Seq method), terminal regions of RNA molecules (5′- or 3′-terminal regions), or specifically selected internal fragments of RNA molecules (e.g. Illumina TruSeq™ Targeted RNA Expression Kits). Sequencing libraries may contain a very large number of molecules. The entire library or some portion of the library is sequenced. Usually not the full-length library molecule but just a part of it is sequenced (depending on the type and operation mode of a sequencer).
Certain efforts are required to get from a set of sequencing reads information about composition of the mixture and concentration of its components. Each read should be associated with the corresponding transcript. More reads are associated with highly expressed transcripts, less reads—with weakly expressed. Frequency of read occurrence is directly proportional to the abundances of corresponding transcripts.
Sequencing provides relative abundances of transcripts (usually referred to as the number of a certain type of transcripts per million of RNA molecules). Additional work is required to determine the absolute number of transcripts per cell. Analysis of the mixture by sequencing is very sensitive and specific. Even only one rare molecule in the initial mixture has a chance to be sequenced and accurate sequencing would leave no doubt that the transcript is exactly identified. Very similar isoforms can be distinguished, by sequencing of the differing regions.
In practice, there may be problems both with the identification of molecules and with calculation of their abundances. For accurate identification of molecules it is necessary to know the nucleotide sequences of possible transcripts. Inaccurate description of the transcriptome in the database will cause problems with identification of sequencing reads. Reads corresponding to the repetitive regions cannot be unambiguously ascribed to certain transcripts. Recognition of transcripts from organisms with large genomes requires analysis of large volumes of data, use of powerful computers and complex algorithms.
The main problem in determining of abundances is rare transcripts. In principle, the concentration analysis by sequencing is a scalable method. The greater the total number of reads, the more accurately rare transcripts will be analyzed. The problem is that the bulk of additional reads would correspond to common transcripts for which the abundances are already determined with sufficient accuracy.
Another problem is that in the course of RNA isolation and library preparation the abundances of transcripts are distorted. This may be due to the different efficiency of isolation of long and short RNA molecules, different conversion efficiency of RNA molecules into library molecules (5′- regions are less effectively converted into cDNA, than 3′- regions) or with different efficiencies of amplification during library preparation (amplification is dependent on GC- composition, presence of palindromes, etc.). For proper evaluation of abundances it is also necessary to consider that longer transcripts give more library molecules than shorter, unique sites results in more recognizable library molecules than areas with repeats, areas of RNA with secondary structure results in less library molecules than areas without it and so on. Not all of these factors are taken into account in practice, and abundances of transcripts are systematically over- or underrepresented. This is not a problem, since in most cases researchers are interested not in the absolute values of abundances, but in how changes in transcription level correlate with various biomedical effects. For example, how gene expression levels change in tumor tissue compared to healthy tissue, or how gene expression levels change in an ill patient compared to healthy persons. Accordingly, not absolute but relative abundances are normally of interest: the ratios of expression levels in the sample to the expression levels in the control.
The emergence of new generations of sequencing technologies significantly reduced sequencing price per nucleotide, but did not change the fact that the bulk of the funds during massive screenings is still spent particularly on sequencing. Introduction of COBRA-approach may improve the sequencing efficiency in routine clinical and environmental analyses and in research studies.
During routine clinical and environmental analyses part of the sequencing data is useless, such as:
COBRA approaches with a positive selection (wherein sequencing library contains only components corresponding to locus-specific oligonucleotides, because all other nucleic acid components of original mixture are lost) allow to solve these problems and to provide the following advantages:
Besides, positive selection allows to get rid of ribosomal RNA, which is especially convenient for the analysis of bacterial transcription, where polyA+ selection cannot be applied.
As a result, useless sequencing results will be eliminated and it would be possible to achieve the same accuracy of concentration measurements with a smaller total number of sequencing reads.
Therefore within the present invention one preferred aspect are methods wherein subsequent nucleic acid mixture is created by positive selection with locus-specific oligonucleotides and contains only components corresponding to locus-specific oligonucleotides while all other nucleic acid components of original mixture are removed.
An essential requirement for research studies is the hypothesis-free nature of the analysis so that information about all components of the mixture should be obtained. The sources of useless sequencing results in research studies are:
Positive selection can't be applied for research studies, where it is not known in advance which component of the mixture is important. But it is possible to apply negative COBRA selection, where locus-specific oligonucleotides are used to reduce the number of unwanted nucleic acid components and sequencing library preserves all components which have no corresponding locus-specific oligonucleotides.
Negative COBRA selection has the following advantages:
Therefore within the present invention another preferred aspect are methods wherein subsequent nucleic acid mixture is created by negative selection with locus-specific oligonucleotides and in the subsequent nucleic acid mixture relative abundances are changed only for components corresponding to locus-specific oligonucleotides.
The goal of changing abundances within the inventive methods is not to bring all components to the same concentrations. The idea of the COBRA approach is to provide a possibility to the researcher to choose the reliability of abundance measurement depending on the experimental goal and on properties of the biological system under study. Different nucleic acid components may:
When discussing COBRA-techniques for which the abundance change factor is hardly predictable, it was already said that to know the abundance change factor in advance is convenient, but not necessary. What is important is that abundance change factors remain the same in different experiments, which is meant by the term “reproducible”. In fact the abundance change factor for each selected nucleic acid component can be reproduced, either by the researcher or by someone else working independently (in distinct experimental trials) according to the same reproducible experimental description and procedure. The exact values of abundance change factors may be measured in a control experiment.
Abundance change factors are required to convert concentrations of components in the subsequent nucleic acid mixture, namely the COBRA-library, into the concentrations of correspondent components in the analyzed, original mixture. It is worth noting that for some tasks it is enough to know only concentrations of components in the COBRA-library (so, it is possible to go without abundance change factors). For example:
Besides, we already mentioned, that in most cases researchers are interested not in absolute, but in relative abundances.
Useless Sequencing Reads
Over the past two decades, the tendency is that instead of analyzing individual components of nucleic acid mixtures (Northern, RT-PCR, digital PCR) massive analysis of all or substantially all components of mixtures is performed, for example for expression profiling, analysis of biodiversity, etc. Currently such massive analysis is most often performed using microarrays or high-throughput sequencing machines.
The inventors have noticed that microarrays or high-throughput sequencers react differently on the change of composition of analyzed nucleic acid mixture. If some components would be removed from the analyzed mixture it would practically not affect analysis of other components on a microarray. In contrast, in massively parallel sequencing analysis after removal of some component other components would get more sequencing reads. Similarly, if some additional component would be added to the analyzed mixture it would not affect a microarray assay, but would hurt massively parallel sequencing, because this component would “occupy” some of the sequencing reads.
So, unlike to microarrays, efficiency of massively parallel sequencing may be improved by excluding useless components from the analyzed mixture of nucleic acids. Useless components are those, (i) which are completely uninteresting for the researcher, or (ii) which are overrepresented in the mixture. In the first case it would be desirable to remove components from the mixture completely; in the second case it would be desirable to decrease their abundances. We found out, that controllable change of abundance may be accomplished by relatively simple molecular biology procedures.
Despite the fact that the controllable change of abundance may be accomplished by relatively simple molecular biology procedures, this method has never been used to analyze the concentrations of components of nucleic acid mixtures. The generally accepted strategy was either to preserve the composition of the mixture as accurate as possible, or to remove some components completely. A good example is ribosomal RNA. Although ribosomal RNA makes up most of the cellular RNA, analysis of rRNA concentrations is almost never carried out. Instead it is discarded from the analysis. At the same time it is known that rRNA content is not constant and might be important in some biological processes or serve as a diagnostic marker. rRNA would remain in the analysis, if its abundance is reproducibly reduced to some acceptable level. For example, using the inventive methods it is possible to reduce the rRNA concentration. According to the present invention it is preferred to change the relative abundance of the component in a controllable manner instead of eliminating it completely from the analyzed nucleic acid mixture.
Useless reads also occur when sequencers are used for other biomedical applications. Sequencing machines of the previous generation (sequencing by Sanger) were used for the construction of EST-libraries. To catch rare transcripts it was necessary to repeatedly sequence clones corresponding to abundant transcripts (useless reads). To solve the problem, it was proposed to use normalized libraries. In a normalized DNA library all DNAs are represented at comparable frequencies. During their preparation no information about concentration of a single molecule in the original mixture is conserved. There are several protocols for preparation of normalized libraries based on the dependence of the rehybridization rate of nucleic acids on concentration. Attempts have been made to use normalized libraries for comparison of expression profiles. However, this approach is not widely used because of a lot of drawbacks:
Useless reads may appear when sequencing machines are used for sequencing or resequencing of genomic DNA.
Each region in a genome should be read a certain number of times (sequencing coverage). Insufficient coverage is unacceptable because it would lead to inaccurate results. If for a particular genomic region excessive (relative to a required coverage) reads are generated, they will be useless.
Useless reads may occur due to the errors in sequencing planning, if the total number of reads is too large for the size of a particular genome.
In some cases sequencing of the entire genomic DNA would inevitably result in a too large portion of useless reads. For example, in clinical studies it is required to know the nucleotide sequences not of the entire genome but of certain areas of the genome. Special methods are developed to prepare sequencing libraries containing only particular regions of the genome, e.g. multiplex PCR, hybridization-based enrichment.
Another source of useless reads is a distortion of uniform representation of components of mixture which should be sequenced. Distortion may be a result of non-uniform amplification or of non-uniform hybridization-based selection. Before distortion, all genomic regions have same abundances. After distortion, some regions become more abundant than others. To reach the required sequencing coverage for rare components, the abundant ones should be over-sequenced. Usually, researchers put efforts to prevent such distortion, for example:
Although in the discussed methods regarding sequencing and resequencing of genomic DNA (as in the current invention) the idea is to avoid useless sequencing reads, they differ from the methods of the present invention. “Sequencing/resequencing of genomic DNA” on one side and “analyzing concentrations of the components of nucleic acid mixture” are different research tasks. Besides, the main idea in “sequencing/resequencing” is to preserve the abundance of analyzed components: either of the entire genome or of the regions required to be sequenced.
Thus the methods according to the present invention suitable for expression profiling preferably refer to nucleic acids in the original mixture selected from the group comprising or consisting of: RNA, total RNA, mRNA, mtRNA, rRNA, tRNA, dsRNA, small RNA/micro RNA, and cDNA.
If the method according to the present invention is used for analysis of biodiversity, it is preferred that the nucleic acid of the original mixture is selected from the group comprising or consisting of: RNA or DNA from an environmental or clinical sample.
Different next generation sequencing platforms are used in biomedicine. Effectiveness of all of them may be improved by decreasing the amount of useless sequencing reads. Besides, there are other detection technologies, which are sensitive to the presence of useless components in the analyzed mixture, for example the long known serial analysis of gene expression or recently appeared digital color-coded barcode technology. Efficiency of all methods of concentration measurement which are sensitive to the presence of useless components in the analyzed mixture may be improved by using COBRA-approach.
The scheme of the sequencing library preparation is shown in
Following ligation and getting rid of most of the non-ligated oligonucleotides the library amplification is performed. During amplification ligated molecules acquire full-size sequencing adapters.
Sequencing is used for detection, accounting and quality control of library molecules. If a sequenced molecule contains fragments belonging to different loci or fragments are ligated in the wrong order, it is excluded from the further analysis.
T4Rnl2 RNA ligase enzyme can be used for ligation of detector oligonucleotides directly on the RNA template [2].
Following ligation, getting rid of most of the non-ligated oligonucleotides and reverse transcription the library amplification is performed. During amplification ligated molecules acquire full-size sequencing adapters.
Sequencing is used for detection, accounting and quality control of library molecules. If a sequenced molecule contains fragments belonging to different loci or fragments are ligated in the wrong order, it is excluded from the further analysis.
Genes with “high”, “intermediate” and “low” levels of expression were selected, 10 genes in each group. Using the procedure described in Example 1 two sequencing libraries were prepared. When preparing the first library primers for all loci were used together. For the preparation of the second library reaction mixture was divided into three separate reactions, as shown in
It was found that the frequency of sequencing reads corresponding to genes with a “high” and “intermediate” levels of expression is reduced in the second library 100 and 10 times respectively.
COBRA library was prepared using the same primers as in the Example 3, but the reaction mixture was divided, as shown in
In Example 3 some unwanted suppression occurs, since ˜10% of the starting material is inaccessible to the primers corresponding to low expressed genes. On the scheme shown in
When using a thermostable ligase (e.g. Pfu or Taq ligase) detection reaction described in the Example 1 can be performed cyclically, each cycle consisting of steps of denaturation, annealing and ligation. This allows to obtain several library molecules from each template cDNA.
It is possible to change the relative abundance of the library molecules corresponding to different adjustment level groups, if corresponding groups of locus-specific detector oligonucleotides are introduced into a cyclic ligase reaction after different numbers of cycles. The earlier detector oligonucleotides are introduced into the cyclic ligase reaction, the more library molecules would be obtained from each template cDNA.
On the scheme shown in
If oligonucleotides used in the reaction described in Example 1 have the structure shown in
As in the previous example, group-specific PCR-primers should be added on different PCR cycles (
On the scheme shown in
Examples 5 and 6 show how stepwise level adjustment can be carried out in a common reaction mixture. In Example 5, spatial isolation of oligonucleotides from different adjustment level groups is used, and in Example 6 oligonucleotides of different adjustment level groups have different markers.
COBRA changing of abundance can be carried out directly prior to sequencing of a standard RNA-Seq library. The scheme is shown in
Library is divided into portions. For each portion selector primers belonging to groups with corresponding adjustment levels are applied. Relative abundance of different transcripts is changed because only a certain part of the library is available to selector oligonucleotides from a particular adjustment level group.
Performing COBRA-procedure prior to sequencing is convenient because:
For example, in the routine clinical analysis only a few types of human tissues are easily available (blood, saliva, buccal cells, sperm, feces). For each of these tissues, appropriate COBRA selector oligonucleotides can be designed.
Examples of using blocked primers which are unable to participate in primer extension reaction are shown in
To reduce the number of molecules of the library, synthesized from non-specific primers, it makes sense to use primers with the 5′ part correspondent to the sequencing adapter. Then during the preparation of the library, only the second sequencing adapter should be ligated.
Examples of using blocked primers which are unable to participate in ligation reaction are shown in
The use of two (or three) specific primers for each locus reduces the number of non-specific molecules in the library. If it is necessary to analyze a polymorphic region, the ligation can be combined with a gap-filling reaction (
If 5′- parts of upstream and 3′-parts of downstream primers are conservative and correspond to sequencing adapters, library molecules are obtained immediately after ligation.
COBRA library was made according to the protocol described in Example 1 using a mixture of functional/blocked primers (
It was found out that the frequency of sequencing reads corresponding to genes with “high” and “intermediate” level of expression is reduced in 100 and 10 times respectively.
Schemes of methods that allow to separate the molecules produced in the reaction with the participation of functional primers from the molecules derived from reactions with blocked primers are shown in
The use of functional/blocked oligonucleotides allows to perform COBRA-selection of RNA-Seq library molecules before sequencing without splitting the reaction mixture into portions (as in Example 7). Sets of “functional” and “blocked” selector oligonucleotides for different suppression levels are shown in
For stepwise level adjustment original mixture is divided into portions and selector oligonucleotides corresponding to different adjustment levels are added to the portions, as shown in
Negative selection can be performed in a single tube without division into portions, if functional/blocked selector oligonucleotides are used (
1. Bullard D R, Bowater R P. Direct comparison of nick-joining activity of the nucleic acid ligases from bacteriophage T4. Biochem J. 2006 Aug. 15; 398(1):135-44.
2. Sparks A B, Wang E T, Struble C A, Barrett W, Stokowski R, McBride C, Zahn J, Lee K, Shen N, Doshi J, Sun M, Garrison J, Sandler J, Hollemon D, Pattee P, Tomita-Mitchell A, Mitchell M, Stuelpnagel J, Song K, Oliphant A. Selective analysis of cell-free DNA in maternal blood for evaluation of fetal trisomy. Prenat Diagn. 2012 January; 32(1):3-9. doi: 10.1002/pd. 2922. Epub 2012 Jan. 6.
3. US 2009/1246760A1 (Harris Timothy et al.)
Number | Date | Country | Kind |
---|---|---|---|
12199784.5 | Dec 2012 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/078177 | 12/31/2013 | WO | 00 |