To study macromolecules and molecular complexes, researchers often have to fragment them. Afterwards it is necessary to reconstruct the composition of macromolecules (molecular complexes) before fragmentation. In the present invention we suggest to label macromolecules (or molecular complexes) prior to fragmentation so that the components of each macromolecule (or molecular complex) receive identical codes. By further analysis the code would allow to group together fragments, which belonged to the same macromolecules (or molecular complexes) before dissociation.
Molecular complexes can be of any scale: from proteins consisting of multiple subunits and long nucleic acids molecules to content of cells and cell compartments. Based on this invention we present protocols for next generation sequencing (NGS), which allow to determine haplotype, to analyze whole RNA molecules, and to reveal accurate sequences of the repetitive genomic regions.
Many biological methods are not applicable for analysis of large macromolecules (MM) and molecular complexes (MC) as a whole. MM/MC should be fragmented before being analyzed by those methods. For example, proteins should be digested before mass-spectrometry analysis and nucleic acids should be fragmented for preparation of sequencing libraries. There exists a problem of reconstruction of the original content and a structure of MM/MC after analysis of fragments.
The present invention allows preserving information about the content of MM/MC despite fragmentation and mixing together fragments from different MM/MC. We suggest labeling MM/MC prior to mixing of fragments so that the components of each individual MM/MC receive identical codes. In the subsequent analysis codes allow to group fragments, which belonged to the same MM/MC before dissociation (
For the implementation of the proposed approach it is necessary:
In this invention we suggest several approaches for introduction of specific codes into MM/MC (requirement number 2). The essential part of these approaches is the preservation of MM/MC integrity up to the labeling reaction. We use oligonucleotides with specific nucleotide sequences as code molecules and describe methods for creation of huge set of oligonucleotide codes.
There are several advantages to use oligonucleotides with specific nucleotide sequences as markers or code molecules: (i) the individual oligonucleotide molecule may be sequenced (requirement number 3); (ii) comparatively short oligonucleotides are able to provide large variety of nucleic acid sequence variants (codes), because at each position of an oligonucleotide there can be one of the four nucleotides; (iii) there are a lot of chemical and molecular biology methods for dealing with oligonucleotides (synthesis, cloning, amplification, covalent and non-covalent attachment of oligonucleotides to surfaces and macromolecules) and (iv) it is a common practice to use oligonucleotide sequences as barcodes in large-scale sequencing.
There are special methods of combinatorial chemistry (combinatorial synthesis, synthesis of compounds on microarray) and molecular biology (amplification of library of random molecules) which may be applied for creation of library of oligonucleotide markers suitable as codes (separate sets of oligonucleotide molecules with the identical sequence) on (i) solid supports (microbeads or microarrays) or (ii) directly on MM/MC. This refers to requirement number 1.
We suggest the following approaches for introduction of specific oligonucleotide markers into MM/MC (requirement number 2):
The essential part of all these approaches is keeping the spatial integrity of MM/MC up to the labeling reaction. This provides a possibility for the highly parallel independent labeling of a huge number of MM/MC. The spatial integrity may be preserved either by avoiding fragmentation of MM/MC before labeling or by avoiding dissociation of fragments of MM (fragments/components of MC) before labeling. It is possible to keep fragments of MM (fragments/components of MC) in close proximity with each other in droplets of water-in-oil emulsion, associated with microbeads, or associated with each other.
MM/MC may be of the same or different nature as molecules used as markers or codes. Therefore oligonucleotides may be used for coding not only of nucleic acids, but also of protein complexes, nucleic acid-protein complexes and macromolecules of other nature. When the nature of coding molecules and MM/MC is the same, the same approach can be used for determination of the code, and for analysis of the fragments of MM/MC. If the nature of coding molecules and MM/MC is different, different analysis methods have to be applied.
Therefore the present invention refers to a method for identification of fragments originating from individual macromolecules (MM) or molecular complexes (MC) in a mixture of fragments of different MM or MC using labeling of MM or MC with oligonucleotide markers comprising the following steps:
a) labeling of MM or MC with oligonucleotide markers wherein each particular MM or MC is labeled with identical oligonucleotide markers and preferentially the different MM or MC are labeled with different oligonucleotide markers and wherein the number of identical oligonucleotide markers is sufficient that after subsequent fragmentation or dissociation of fragments of the MM or the MC each fragment is preferentially labeled with at least one of the oligonucleotide marker;
b) fragmentation or dissociation of MM or MC, wherein step a) and b) are optionally done in parallel;
c) mixing labeled fragments of different MM or MC together;
d) analyzing of fragments and determining the nucleotide sequence of the at least one oligonucleotide marker associated with each fragment;
e) identification of fragments originating from individual MM or MC of fragments based on the fact that fragments associated with different oligonucleotide markers were part of different MM or MC before said fragmentation.
The present invention refers further to a method, wherein labeling of MM or MC with oligonucleotide markers in step a) is performed by mix-and-split combinatorial synthesis of oligonucleotide markers directly on MM or MC. Another preferred embodiment of the present invention is a method, wherein labeling of MM or MC with oligonucleotide markers in step a) is performed by automated parallel synthesis of said oligonucleotide markers directly on MM or MC distributed on a surface. Thereby it is possible that the synthesis of oligonucleotide markers is performed from short oligonucleotides either by ligation or primer extension or from phosphoramidites by chemical synthesis. Another embodiment of the present invention are further methods, wherein labeling of MM or MC with oligonucleotide markers in step a) is performed by attachment of prepared-in-advance oligonucleotide markers to MM or MC by ligation or primer extension or by chemical reactions.
In step c) of the inventive method the fragments of different MM or MC labeled in step a) and fragmented and/or dissociated in step b) are mixed, for example to generate a sequencing library. This means individual labeled fragments are added to the same solution.
Within the method of the invention for identification of fragments originating from individual macromolecules (MM) or molecular complexes (MC) the objective is to label a particular MM or MC with many identical oligonucleotide markers wherein the number of identical oligonucleotide markers is sufficient that after subsequent fragmentation or dissociation of fragments of the MM or the MC each fragment is labeled with at least one of the oligonucleotide marker. Furthermore different MM or MC should be labeled with different oligonucleotide markers. The number sufficient that after subsequent fragmentation or dissociation of fragments of the MM or the MC nearly each fragment is labeled with at least one of the oligonucleotide marker can be determined after known rules of statistics. Thereby the number of different oligonucleotide markers compared to the number of MM or MC to be labeled should be chosen so that there is a sufficient high probability or likelihood that each MM or MC to be labeled is labeled by a different marker oligonucleotide.
Thereby the term “preferentially the different MM or MC are labeled with different oligonucleotide markers” refers to the case that at least 80% and more preferred at least 85%, further preferred 90% and even more preferred at least 98% of the different MM or MC are labeled with different oligonucleotide markers. The term “each fragment is preferentially labeled with at least one of the oligonucleotide marker” refers respectively to the case that at least 80% and more preferred at least 85% further preferred 90% and even more preferred at least 98% of the fragments are labeled with at least one of the oligonucleotide marker.
The term “macromolecule” as used herein refers to the conventional biopolymers, like nucleic acids, proteins, and carbohydrates, as well as non-polymeric molecules with large molecular mass such as lipids and macrocycles having more than 500 atoms, or preferably more than 1,000 atoms. Macromolecules consist of many smaller structural units linked together.
The term “molecular complex” or “macromolecule complex” refers to a loose association involving two or more molecules, wherein at least one is a macromolecule. The attractive bonding between the molecules of such a complex is normally weaker than in a covalent bond.
The term “oligonucleotide marker” as used herein refers to an oligonucleotide having a definite sequence which can be used to code macromolecules. Synonymously used herein is the term “oligonucleotide code” or “coding oligonucleotide”.
Application of the Invention for NGS Sequencing
Fragmented nucleic acids should be used for preparation of NGS (Next generation sequencing) libraries, in part because the length of sequencing library molecules is restricted. Besides, sequencing read length is limited. Reconstruction of genomes and transcriptomes using those short sequences is a complex task, and obtained results have a restricted value.
Problems appearing during sequencing of genomic DNA:
These problems make it impossible to determine the exact sequence of chromosomes. Uncertainty is only partly dependent on the accuracy of sequencing itself; the other reason is the ambiguity nature of the assembling of short sequencing reads into the genomic sequence.
For transcriptome analysis it is necessary to determine the composition and the quantity of all transcripts present in the sample. Currently there are difficulties both with structure assessment and gene expression analysis:
Listed problems lead to “incompleteness” of genome and transcriptome sequencing. It is impossible to be sure that the sequencing experiments would not have to be repeated on another sequencing platform to provide the lacking data.
It is a common opinion that most sequencing problems could be solved by increasing the length of sequenced fragments up to tens of kilobases. The longer the sequencing reads are the easier to assemble them into genome/transcriptome.
In the framework of present invention we suggest to label nucleic acid (NA) molecules before sequencing-related fragmentation and after sequencing to group together sequencing reads originated from individual NA molecules. This allows (on the ranges correspondent to the length of NA molecules before sequencing-related fragmentation):
Therefore the present invention refers to methods, wherein the MM or MC are nucleic acid macromolecules or complexes which include nucleic acid molecules and wherein step d) comprises sequencing of fragments and oligonucleotide markers associated with said fragments. Furthermore it is preferred that the method according to the invention is applied for genome de novo sequencing, resequencing, haplotyping or analysis of transcriptome.
The full sequence of the original NA molecules (before sequencing-related fragmentation) may be reconstructed only at certain conditions: (i) high enough redundancy, (ii) absence of multiple repetitive regions within original macromolecule. But even without reconstruction of relative positions of sequencing reads information about their linkage would significantly facilitate analysis of NGS sequencing data. Information obtained from coded or marked sequencing libraries produced according to the present invention is quite similar to the information produced by first-generation sequencing methods, where long genomic DNA fragments have to be cloned before sequencing. The typical linkage distance reachable by coding of nucleic acid molecules is up to hundreds of kilobases, and may be expanded up to the full-chromosome range for isolates of metaphase chromosomes.
Another aspect is related to the competition of second- and third-generation sequencing platforms. Currently, high-performance second-generation sequencing platforms can produce up to ˜200 nucleotides long reads. Despite the price per nucleotide for third-generation platforms is considerably higher, some third-generation platforms have a unique feature, they have the ability to generate longer sequencing reads, namely up to several thousand or tens of thousands of bases. Present invention allows second-generation sequencing platforms to produce sequencing data linked within the range of hundred thousands of bases and to be competitive with the third-generation machines.
Haplotyping
One of the main application areas of linkage information is a whole-genome resequencing and haplotyping. Currently resequencing is performed mostly without haplotyping, because existing haplotyping methods are too inconvenient and expensive. Existing haplotyping methods involve:
First method produces high-quality data (full-chromosome sequence, excluding highly-repetitive centromere and telomere regions), but is too expensive to be used routinely. Other methods reduce the data output (excluding repetitive regions from the analysis) and simultaneously significantly reduce the price of the analysis.
Using metaphase chromosomes as a starting material it is impossible to reconstruct the sequence of repetitive regions within individual chromosomes.
If parental DNA fragments are separated into physically distinct pools by such a way that “the statistical likelihood of having corresponding fragment from both parental chromosomes in the same pool markedly diminishes” [3], than only sequencing fragments, that uniquely mapped to the reference genome may be successfully haplotyped. Similar to the approach used in the present invention sequencing reads originated from the individual parenteral DNA molecules are grouped together after sequencing. The grouping methods are different. In the present invention grouping is performed on the base of MM/MC-specific codes only. In the case of [3] grouping is based on two attributes: (i) belonging to the same original physically distinct pool and (ii) the close position of sequencing reads after mapping to the reference genome.
Information obtained from coded sequencing libraries produced according to the present invention is quite similar to the information produced when long genomic DNA fragments are cloned before sequencing. In this respect it is quite close to the first method, but with cheap and handy procedure for library production.
Practical Implementations
There are two major approaches in combinatorial chemistry which is a technology for synthesizing and characterizing collections of compounds and screening them for useful properties. The first method is called “mix-and-split method” and involves attaching the starting compounds to polymer beads. The beads are then split into groups and reacted with the second set of reagents (e.g. a specific nucleotide). After this reaction, all the beads are pooled, mixed together, and split into groups again. The groups of beads are then reacted with the next set of reagents eg another nucleotide). Additional rounds of pooling and splitting allow libraries with millions of compounds (here oligonucleotides) to be generated.
A second method is called “parallel synthesis”. All the different chemical structure combinations are prepared separately, in parallel, using thousands of reaction vessels and a robot programmed to add the appropriate reagents to each one. This method is unsuitable for the creation of very diverse libraries but is very useful for the development of smaller and more specialized libraries.
A code in form of oligonucleotide markers may be (i) a single uninterrupted nucleotide sequence, (ii) a set of nucleotide sequence blocks, subdivided by conservative nucleotide sequence regions (standard or commonly used sequences for sequencing primers such as M13, T7, poly A or polyT); (ii) several nucleotide sequence blocks attached separately to fragments of MM or MC.
Sequencing library molecules have common flanking sequencing library adaptors, which are used for the clonal amplification of the library molecules in the sequencing machine (Illumina, SOLiD).
It is possible to suggest a lot of practical approaches for analysis of MM/MC composition using molecular coding.
Using of coding oligonucleotides for sorting of sequencing data is well established and can be carried out by standard methods. For example, bar-coding is used for the simultaneous sequencing of several libraries. During library preparation a specific oligonucleotide (barcode) is introduced into each molecule. Nucleotide sequences of barcodes are different for different libraries. Bar-coded libraries are pooled and sequenced together. Nucleotide sequence of barcode is determined for each fragment (either as an initial part of one of the sequencing reads,
What is inventive is the introduction of identical oligonucleotide markers in MM/MC. But there are many ways to do it. The proposed and preferred approaches are summarized in Table 1. Rows of the table list contain approaches to create a library of oligonucleotide codes: two methods of combinatorial chemistry (“mix-and-split synthesis” and “parallel synthesis on a microarray”) and one method of molecular biology (clonal amplification, where each single molecule gives rise to an isolated set of identical copies: rolling-circle amplification, bridge-amplification, methods of amplification in emulsion (exponential and linear)). Columns correspond to the methods of association of codes with MM/MC: (i) creation/synthesis of codes directly on the MM/MC and (ii) transfer to the MM/MC of pre-synthesized codes or marker oligonucleotides. For all combinations of “how to create library of codes”-“how to associate codes with MM/MC>> it is possible to offer an experimental protocol.
Therefore the present invention refers preferably to methods, wherein oligonucleotide markers are prepared in advance using:
<<mix-and-split synthesis of oligonucleotide codes>>-<<directly on MM/MC>> (cf. Examples 2-6, 10, 11)
Mix-and-split synthesis is a standard approach of combinatorial chemistry for the synthesis of sets of chemical compounds. The scheme of mix-and-split synthesis is shown in
If using individual MM/MC as carriers (see
In combinatorial chemistry chemical synthesis is usually used. For oligonucleotide-based codes, not only chemical but also enzymatic synthesis (ligation or template-directed primer extension) is possible. The advantage of enzymatic synthesis is that it is a “soft” process (if compared to chemical synthesis), which does not damage macromolecules. Chemical synthesis of coding oligonucleotides allows only four synthesis variants at each split stage (according to the number of possible nucleotides). For ligation-based code extension, the number of variants (number of parallel reactions at each split stage) can be much larger. If codes ligated at each split stage have a length of “n” nucleotides, there are 4n variants of codes possible. Accordingly the same number (4n) of ligation reactions may be performed in parallel at each split stage. For “k” stages of ligation-based combinatorial coding 4n·k versions of code can be obtained (Table 2).
Oligonucleotide adapters (the reagent added in each stage of ligation-based code extension) may contain not only a code, but also a part that varies from one split stage to another (see
number of codes after ‘k” cycles of coding
6.6 × 104
1.7 × 107
4.3 × 109
1.1 × 1012
2.8 × 1014
1.0 × 106
1.1 × 109
1.1 × 1012
1.1 × 1015
1.2 × 1018
1.7 × 107
6.9 × 1010
2.8 × 1014
1.2 × 1018
4.7 × 1021
Ligation-based combinatorial synthesis is capable to provide almost any desired number of codes in a few stages. Table 3 shows the number of fragments of different length in 1 μg of ds DNA. When constructing libraries using the inventive method, it is desirable that the amount of codes or oligonucleotide markers is an order of magnitude greater than the number of MM/MC. Thus, using adapters with 5-6 nt coding regions it is possible in only a few steps (2-5) to obtain the number of codes sufficient for any practical application.
<<Synthesis of Oligonucleotide Codes on Array>>-<<Directly on MM/MC>>
The second standard combinatorial chemistry approach for creating libraries of coding oligonucleotides is the synthesis on an array. This approach can also be used for the synthesis of coding oligonucleotides directly on the MM/MC. If to distribute MM/MC on the 2-dimensional surface so that they rarely overlap with each other and to carry out the synthesis of oligonucleotide codes on such a surface, each component of the particular MM/MC will receive identical codes (or a set of codes that are located close to each other), see
<<Clonal Amplification>>-<<Directly on MM/MC>>
Clonal amplification may be used as alternative method for construction of mate-paired (MP) libraries. Oligonucleotides containing a coding and a conservative region for sequencing of this code are used as adapters for circularization of the original nucleic acid fragments. Resulting circular molecules are amplified by rolling-circle amplification (RCA), or branched rolling-circle amplification (BRCA). Herewith, both nucleic acid fragments and codes are replicated. Coded concatemers are then randomly fragmented. Only code-containing fragments are selected for construction of NGS-library (for example, by hybridization to an oligonucleotide corresponding to the code-sequencing primer). PE-sequencing and sequencing of codes are performed. Nucleic sequences of codes are used to group clones corresponding to the same original molecules.
MP-library preparation based on clonal amplification has some advantages compared to the traditional protocol. For traditional MP libraries: “original fragment->1 library molecule->2 sequencing reads”. For the described method: “original fragment->set of library molecules->multiple reads covering terminal regions of the original fragment”,
Transfer of Pre-Synthesized Oligonucleotide Marker on MM/MC
The second column of Table 1 corresponds to experimental approaches, in which the collection of codes is synthesized in advance, and during preparation of coded sequencing library is transferred to MM/MC. Since codes are synthesized in advance, the protocol of library preparation might be shorter and more stable. Collection of codes may be prepared according to the methods listed in rows of the Table 1:
Some approaches to transfer pre-synthesized codes to MM/MC are described in the examples 1, 7-9, 12, and 15. In many cases, these approaches are applicable to any way of preparation of collection of oligonucleotide markers.
Technical Implementations
One preferred embodiment of the invention refers to methods, wherein oligonucleotide markers are prepared on a microarray in a form of spatially isolated groups with identical oligonucleotides and association of particular MM or MC with particular oligonucleotide marker is achieved by adsorption of MM or MC to said microarray.
Further embodiments of the present invention are methods, wherein oligonucleotide markers are prepared in solution as individual oligonucleotide molecules, or as self-associated identical oligonucleotide molecules, or as associates of identical oligonucleotide molecules with microbeads and association of particular MM or MC with particular oligonucleotide marker is achieved in water-in-oil emulsion or by adsorption of MM or MC with said oligonucleotide markers in solution.
Introduction of oligonucleotide markers into MM/MC often involves performing of multiple parallel reactions.
Parallel reactions may be organized in a common reaction solution:
(i) in spatially isolated droplets in water-in-oil emulsion;
(ii) by adsorption on each other the equivalent amounts of presynthesized oligonucleotide markers (on microbeads or on microarray) and MM/MC (2D adsorption on microarray or 3D adsorption to beads in the diluted solution);
(iii) by using MM/MC as carriers for synthesis of a library of codes (in combinatorial synthesis, in synthesis on 2D surface (microarray)) or in amplification reaction).
Current robotics and automation also permit to organize a number of physically separated aliquots:
It is inconvenient to add enzymes/chemicals to many separate reactions. It is better to work with a common inactivated mixture (master mix) and to start reaction after splitting. Reaction may be inactivated by external conditions (for example, decreasing a temperature) or by excluding some key component from the reaction (double valent ions, cofactors, etc.) which is later introduced together with split component (usually, coded oligonucleotides).
For many examples described in this invention large sets of oligonucleotides are required. If oligonucleotides consist of conservative and variable parts and the total number of oligonucleotides is too large for the direct synthesis, the collection of oligonucleotides might be produced by ligation of a common part to locus-specific oligonucleotides. A double-stranded common region may be introduced using ligation-based oligonucleotide synthesis. This is convenient for many applications, because the common part is masked from non-specific hybridization.
Coded Libraries
Coded (prepared by a method according to this invention) libraries differ from traditional ones. Traditional libraries consist of completely independent clones, whereas the coded libraries consist of sets of clones with the same code.
Traditional libraries are prepared preferably with a large excess: number of independent molecules is much larger than the expected number of sequencing reads. Only a small part of the library is sequenced. This helps to minimize the resequencing of the same clones.
This approach is not applicable for coded libraries, where the relationship of clones should be revealed. If only a small portion of the library is sequenced, then only a small fraction of existing relationships would be detected. In the extreme case—when just one clone is sequenced from each set of clones with the same code—no relationships between clones would be revealed at all.
The ideal solution would be a complete sequence of the coded library. In practice, it would be necessary:
(i) in case of non-amplified libraries: to develop a method of loading of the whole library into a flowcell (without loss of molecules in liquid-handling system and in non-readable regions of a flowcell);
(ii) in case of amplified libraries: to find a compromise between the desires (i) to sequence the whole library and (ii) to avoid an unacceptably large number of resequencing of the same clones.
The simplest way to compensate for the losses during preparation of the traditional library is to increase the amount of starting material. If the starting material is available in excess then this approach has no negative effects. On the contrary, loss of clones during preparation of coded library is equivalent to the loss of information about components of a MM/MC. Ideally, the coded library should be constructed from the minimal amount of material with minimal losses.
The critical step, which is sensitive to the demand for “a minimum of material,” is the step of fragmentation (dissociation) of MM/MC. Up to this point it is safe to work with excess of material, but before dissociation it is necessary to take as much material as will actually be sequenced, excess should be avoided. In this respect it is convenient to use for library preparation those methods, which preserve fragment association till the very end of the protocol (whole-genome amplification within water-in-oil emulsion, as described in Example 15; fragmentation without dissociation, as described in Example 10). In this case it is possible
(i) to prepare coded libraries with a large excess as a traditional ones;
(ii) to determine a library titer taking an aliquot of emulsion (bead suspension);
(iii) to take the necessary volume of emulsion (bead suspension) for sequencing.
Coded libraries are more useful for haplotyping than traditional ones. In order to reveal that two particular alleles are located on the same chromosome using traditional libraries, they have to be found in the same library molecule. Since only a small part of sequencing reads cover two heterozygous sites at once, only a small part of sequencing data contains information useful for haplotyping. Besides, it is impossible to straddle homozygous regions, which are longer than the fragments used for preparation of PE- (or MP-) libraries. In order to reveal that two distinct alleles are located on the same chromosome using coded libraries, they have to be discovered in the library as molecules with the same code.
This means that:
Coded libraries might simplify de novo sequencing. Codes permit to reconstruct the content of parental NA molecules. Besides, if coding is associated with NA amplification (see Examples 1A, 7) and the redundancy of sequencing reads originated from parental NA molecules is high enough, the relative positions of sequencing reads may be reconstructed—as a result the whole parental NA molecule would be sequenced. In case of presence of multiple repetitive regions within original NA molecule analysis of overlapping parental NA molecules would required for sequence reconstruction.
When using coded libraries for transcriptome research it would be necessary to choose which type of analysis is more important: analysis of the structure or analysis of the expression level, since they have contrary demands to the library construction. To get more detailed information about the structure of transcripts it is desirable that as many library molecules as possible originate from the same RNA molecule, and thus—have the same code. However, when analyzing the expression levels, all molecules with the same code should be counted as one original molecule. Therefore, to increase the statistical reliability of expression analysis it is desirable that as little as possible library molecules have the same code.
It was already mentioned, that it is desirable that the possible number of codes is significantly larger than the number of MM/MC in the sample, since it would reduce the likelihood that independent MM/MC would get the same code. However, useful results can be obtained even when the number of codes or different marker oligonucleotides is less than or comparable to the number of MM/MC. In this case, some of the MM/MC will get the same codes and extra efforts is required to understand the linkage of fragments. However, the analysis would still be simpler than it is without the inventive method, when the sequencing data is analyzed without any additional information about the linkage of fragments to each other.
Locus-Specific Sequencing of Coded Sequencing Libraries
It is often required to sequence not the entire genome, but only a certain part of it. Currently locus-specific sequencing is based on enrichment: oligonucleotides which cover the desired area are synthesized and are used for hybridization-based selection of relevant clones from the sequencing library. Coded libraries allow another way of locus-specific sequencing: after a low coverage sequencing codes corresponding to the original fragments which overlap area of interest are identified. These identified codes are used for selection of library molecules for further sequencing.
A particular case of locus-specific sequencing is the task to bring the genome sequencing projects to completeness. Due to the random nature of fragmentation and because of some experimental limitations (like GC-content) it is impossible to obtain an absolutely uniform distribution of sequencing reads. By using marker oligonucleotides it is possible to fish out from the library only fragments which correspond to the areas with low coverage.
Barcoding of Combinatorial Coded Sequencing Libraries
In parallel with the coding of individual molecules other parameters of the fragments can be coded too. For example, it is possible to combine coding of molecules with coding of samples (barcoding). Barcodes may be introduced at the earliest stages of the coded library preparation. The samples are then combined and only one library is prepared for the entire project. This approach allows to create one sequencing library for the whole project, to check it with low-coverage sequencing and perform large-scale sequencing only in case of a good library quality.
Molecular Complexes
Another aspect are methods according to the invention applied for analysis of composition of protein molecules and/or protein molecular complexes wherein said complexes which include nucleic acid molecules are aptamers or proximity ligation probes, associated with said protein molecules and/or protein molecular complexes.
Molecular complex is a set of molecules associated with each other. Molecular complexes may have a natural origin (for example, a protein consisting of several subunits) or may be produced during an experiment (for example, a single-stranded nucleic acid molecule with hybridized oligonucleotides).
Depending on the type of the analysis different entities may be understood as a content of the same MM/MC. For example, if peptide-specific aptamers are used for the analysis of multi-subunit proteins, then the content is “an individual protein subunit”. If proximity-ligation probes are used for the analysis of multi-subunit proteins, then content is “an individual protein-protein contact”. In both cases only those “protein subunits” (protein-protein contacts) are analyzed for which the user has a specific probe.
Sometimes it is inconvenient to introduce codes directly into an intact MM/MC. It might be easier to produce some derivative molecular complexes (MC), which preserves the association of entities under study, but is more convenient for coding. For example, it is a non trivial task to introduce number of codes into double-stranded DNA molecules. In Example 4 this task is solved by conversion of dsDNA into ssDNA with hybridized random primers; in Example 10 this task is solved by conversion of dsDNA into dsDNA fragments attached to microbeads.
Molecular complexes can be of almost any nature, such as proteins consisting of multiple subunits and nucleic acids associated to cell content (proteins or cell compartments) or cells. For solving of different tasks it might be necessary to analyze the same molecules (for example, genomic DNA), but organized in MM/MC of different nature:
It is known that cancer tumors are very heterogeneous. Molecular coding allows labeling of individual cells. In the subsequent analysis codes would allow to identify components (nucleic acids or proteins), which belonged to the same cell. Thus it will be possible to reconstruct the contents of heterogeneous cells. It would be too expensive to determine the whole genomic sequence of each individual cell, but it is a reasonable task to determine the sequence of all oncogenes within the cells. Currently, to study colocalization of cell surface markers cell sorters are used. Colocalization analysis can also be conducted using molecular coding as described herein.
Therefore one preferred embodiment are methods of the present invention applied for analysis of composition of individual cells, organelles or cell compartments wherein said complexes which include nucleic acids molecules are nucleic acids originated from said individual cells, organelles or cell compartments. It is further preferred that the method according to the present invention is applied for analysis of genotype of individual cells or cell compartments, wherein complexes which include nucleic acid molecules are DNA molecules originated from said individual cells or cell compartments trapped within agarose beads.
Another aspect of the present invention are kits suitable for labeling of MM or MC with oligonucleotide markers according to the invention, wherein each particular MM or MC is labeled with identical oligonucleotide markers and preferentially the different MM or MC are labeled with different oligonucleotide markers comprising either set of prepared in advance oligonucleotides for direct labeling of MM or MC or set of oligonucleotides for combinatorial coding of MM or MC by “split-and-mix” method.
The protocol of preparation of coded NGS library based on a random primer whole genome PCR amplification is shown in
To obtain “N” types of binary combinatorial codes a minimum of a “square root of N” types of primers (and separate split-reactions) for each of two coding steps would be required. That is, if ˜106 different binary codes are required (this is a number of 1 Mb ds DNA molecules in 1 ng), two oligonucleotide sets each containing ˜103 types of oligonucleotides would have to be used, which is acceptable for the existing methods of oligonucleotides synthesis.
The structure of the molecules obtained as the result of two primer extensions is shown in
Multiplex PCR is used for the preparation of sequencing library from the definite set of loci. Mix-and-split combinatorial coding may be introduced into PCR reaction as in Example 1A. As a result, it would be possible not only to sequence the selected loci but also to determine the cis/trans location of allelic variants which are separated by distances smaller than the length of template nucleic acid molecules used for PCR reaction.
Large sets of primers may be used in non-coding multiplex PCR: up to thousands of PCR pairs [7]. To perform a two-stage binary coding, each such set should be converted into a collection of sets with different codes. If the total number of primers would be too large for the direct synthesis, the collection of coded primers sets might be obtained by ligation of common coding part to locus-specific oligonucleotides (ligation-based oligonucleotide synthesis). Double-stranded primer region resulting in the ligation-based oligonucleotide synthesis very nicely blocks common parts of primers preventing non-specific hybridization.
To demonstrate that identical codes are generated on each MM/MC by the mix-and-split combinatorial coding, we have applied the mix-and-split combinatorial ligation for coding of the ends of double-stranded DNA molecules (
1. shear 1 μg of mouse genomic DNA on a Covaris® ultrasonicator, so that the mean size of fragments is ˜400 bp
2. end repair
3. ligate common adapters
4. 3-stage mix-and-split ligation of coding adaptors (CA):
5. preparation of sequencing library
6. PE-sequencing
7. comparison of codes.
The experimental scheme is shown in
The structure of coding adapters is shown in
The structure of the resulting PE library molecules is shown in
Since different non-palindromic cohesive ends of CA's prevent the ligation of adapters on the wrong stages, then, in principle, it is possible to proceed from one split stage to another without getting rid of non-ligated adapters from the previous stage. Two things should be taken into account:
Using the idea of the present invention a new method of MP libraries construction may be suggested. Instead of keeping the ends of DNA molecules physically connected, they can be labeled with the same code. The scheme of preparation of coded MP library is shown ion
The traditional method of construction of MP libraries is inefficient for long initial fragments. Coded MP-libraries may be prepared from any initial fragments which are stable in the solution.
Coded terminal fragments may be selected in different ways:
In examples 2 and 3 combinatorial coding is used to label the ends of DNA fragments. A similar approach may be used for labeling the inner parts of the nucleic acid molecules. An example of such a protocol is shown in
On the first step primers with a random 3′ part and the predetermined 5′ part (designed for attachment of coding adapters) are annealed to the single-stranded nucleic acid molecules.
After <<primer extension>> and <<mix-and-split combinatorial coding>> (as in Examples 2 and 3) a molecular complex is obtained, which consists of the original nucleic acid molecule and extended random primers, where random primers are marked by identical codes. After dissociation, codes allow to find out which fragments belonged to the same molecular complexes.
Depending on the particular application, it is possible to choose in which order <<primer extension>> and <<mix-and-split coding>> operations should be performed.
The approach with extended RP's is applicable both to DNA and RNA molecules (first-strand synthesis by reverse transcriptase).
Gap filling—a primer extension followed by ligation—is used, if a specific set of loci needs to be analyzed (a version without primer extension with allele-specific ligation also exists). For each locus two primers are used corresponding to the boundaries of the locus (in contrast to PCR, they are complementary to the same chain), see
Original molecule and annealed primers remain associated in a complex both during primer extension and ligation reactions. Coding of obtained complexes would make it possible to determine the cis/trans location of allelic variants which are separated by distances smaller than the length of the original nucleic acid molecules (and allows determining haplotypes).
Codes may be attached to the primers (to one or both) after hybridization (e.g., using ligation-based combinatorial coding). Besides, binary combinatorial codes, analogous to codes in the Example 1, maybe prepared by using two sets of coded primers. As in the Example 1B set of coded primers can be generated by ligation-based oligonucleotide synthesis. The structure of molecules resulting from the binary coding is shown in
For analysis of protein complexes it is necessary to mark protein subunits. This can be done as shown in
Collection of codes attached to the microbeads can be transferred to the nucleic acid molecules without the use of the emulsion (
To demonstrate the possibility to create coded libraries by adsorbing DNA to microbeads in diluted solution, the experiment with two types of DNA (from Drosophila and Arabidopsis) and two types of microbeads, covered with coded random primers (“code I” and “code II”) was conducted. Each type of DNA was adsorbed to one type of microbeads: Drosophila+“code I” and Arabidopsis+“code II”. Then the mixtures were combined with each other and elongation of random primers was performed. Resulting molecular complexes were used for NGS library preparation and obtained clones were sequenced from both ends (PE sequencing). Analysis of the obtained sequences has shown that the Drosophila DNA was always elongated from “code I” primers, and Arabidopsis DNA—from “code II” primers. That demonstrates that in the elongation reaction DNA is associated with only one microbead. If a large collection of coded beads (instead of only two types) is used in the reaction, each DNA molecule would receive a unique code.
If nucleic acid molecules are adsorbed on a support so that after fragmentation individual parts remain associated with each other, then the coded library can be constructed as shown in
One of the advantages of this approach—molecules may remain associated with each other until the end of the library construction. Dissociation can be carried out immediately prior to sequencing. This means that, as in the traditional method of NGS-libraries preparation, library can be prepared in excess.
Coding oligonucleotides does not necessarily has to form a single molecule with MM/MC, it can be only associated with MM/MC. Two examples are shown in
For the analysis of such associates a modified NGS platform is required. It should be able to sequence two different molecules at the same position of flowcell: the library molecule itself and the code molecule. Such modifications could be for example:
In
DNA can be adsorbed not only on microbeads (as in Example 8), but also on a microarray (
Microarrays have an additional advantage: distribution of the coding oligonucleotides on the surface is known in advance. This can be used for DNA mapping. If the adsorbed nucleic acid molecule would be stretched along the surface of the microarray, then the codes of extended random primers would change along the molecule in a predictable manner, and would allow to reveal not only fragments belonging to the same initial macromolecule, but also the location of the fragments relative to each other. Given that the 1 kb DNA region has a length of ˜0.3 μm, mapping resolution may be in the range of several kb-tens of kb.
Nucleic acids may be included into agarose beads (
Nucleic acids from individual cells are enclosed in individual agarose beads as shown in
Whole-genome PCR amplification in emulsion permits to isolate spatially amplification of individual parental DNA fragments.
To perform 5′-coding (
The structure of synthesized molecules is shown on
For 3′-coding whole genome amplification primers are included in water phase of water-in-oil emulsion because they have no codes. Special primers with codes may be delivered into droplets by different ways:
Number | Date | Country | Kind |
---|---|---|---|
12199781.1 | Dec 2012 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/078174 | 12/31/2013 | WO | 00 |