The present invention relates to a purification process of nascent DNA.
In metazoans, thousands of chromosomal sites are activated at each cell cycle to initiate DNA synthesis and permit total duplication of the genome. They all should be activated only once to avoid any amplification and maintain genome integrity. How these sites are defined remains elusive despite considerable efforts trying to unravel a possible replication origin code.
In Saccharomyces cerevisiae, DNA replication origins are specifically identified by specific DNA elements, called Autonomous Replication Sequence elements (ARS), which have a common AT-rich 11 bp specific consensus. However, sequence specificity identifies but not determines origin selection.
In multicellular organisms, it was more difficult to identify common features of DNA replication origins. No strict consensus sequence element has been found, which can have predictive value, although specific sites are recognized as DNA replication origins in chromosomes of somatic cells.
The identification of the sequence of DNA replication origin offers new perspectives in the comprehension of pathologies involving miss regulation of DNA replication, and new perspective in the cellular therapy, by using “humanized” vectors.
International application WO 98/27200 discloses a putative consensus sequence of human and mammalian replication origin. However, the consensus sequences disclosed in WO 98/27200 appears to be not representative of all the replications origins normally used in multicellular eukaryotic cells.
Prior art also discloses methods for purifying nascent DNA for mapping of DNA replication origins in multicellular eukaryotic cells [Prioleau et al. 2003, Molecular and Cellular Biology, 23(10), pages 3536-3549; Cadoret et al. 2008, P.N.A.S., 105(41), pages 15837-15842; Gomez et al. 2008, Genes & Development, 22(3), pages 375-385; Sequeira-Mendes, 2009, PloS Genetics, 5(4)].
Prioleau et al. 2003, Molecular and Cellular Biology, 23(10), pages 3536-3549 disclose methods for purification of nascent strands. The method included sucrose gradients, heat denaturation and exonuclease digestion. The heat denaturation step allows to open the DNA and separate the nascent DNA strand by sucrose gradient.
So, there is a need to provide a new consensus sequence representing all the DNA replication origins of a multicellular eukaryotic cell.
There is also a need to provide a new method for determining the DNA replication origins of a multicellular eukaryotic cell.
One aim of the invention is to provide a method for purifying RNA-primed nascent DNA in a large amount and with a very high purity.
One aim of the invention is to provide a method for identifying eukaryotic replication origin.
Another aim of the invention is to provide the sequence of said eukaryotic replication origin.
Another aim is the use of RNA-primed nascent DNA produced by said replication origin for providing a method of diagnosis.
The invention relates to the use of purified RNA-primed nascent DNA for the implementation of a process allowing the mapping and the numbering of the active DNA replication origins of multi cellular eukaryotic cells, and the characterization of the sequence of said replication origins,
said process comprising
The initiation of new DNA strands at origins of replication in multicellular eukaryotic cells requires de novo synthesis of RNA primers by the DNA primase activity of DNA polymerase alfa and subsequent elongation from RNA primers also by DNA polymerase alpha. These RNA-primed nascent DNA are thus hybrid molecules consisting of a short molecule of RNA fused in its 3′ end to a DNA molecule.
The inventors have demonstrated that a double treatment by lambda exonuclease in exhaustive digestion conditions allows to drastically enhances the purity of isolated nascent DNA.
The RNA-primed nascent DNAs are purified, which means that said nascent DNA are substantially pure: after one step of lambda exonuclease, contaminant DNA may represent up to 25% of the purified DNA.
According to the invention, at least two exonuclease treatments allows to eliminate contaminant DNA (after 2 steps: less than 5% of contaminating DNA is present in the mixture, after 3 steps less than 2% of contaminating DNA is present in the mixture).
In one advantageous embodiment, the invention relates to the use as defined above, wherein said RNA-primed nascent DNA are produced by the active replication origins.
In one advantageous embodiment, the invention relates to the use of purified RNA-primed nascent DNA for mapping and numbering the active DNA replication origins as defined above, wherein said process is carried out by using multicellular organism totipotent cells.
In one advantageous embodiment, the invention relates to the use of purified RNA-primed nascent DNA for mapping and numbering the active DNA replication origins as defined above, wherein process is carried out by using totipotent cells, such as ES cells or multicellular organism differentiated cells.
In one advantageous embodiment, the invention relates to the use of purified RNA-primed nascent DNA for the characterization of the sequence the active DNA replication origins as defined above, wherein said sequence consists of
wherein c vary from 3 to 20
wherein N7 and N8 represent any nucleotide,
wherein a and e independently from each other can be equal to 0, 1 2 or 3, or vary from about 15 to 30, and
wherein band d independently from each other can be equal to 0, 1 2 or 3 or vary from about 10 to 300,
N8 being such that if b vary from 10 to 300, (N8)b represents a nucleic acid chain which is such that
N9 being such that if d vary from 10 to 300, (N9)d represents a nucleic acid chain which is such that
In the invention pyridine means T or C, or U for RNA.
In one other embodiment, the invention relates to the use of purified nascent DNA as defined above, wherein said nucleic acid sequence being such that
wherein N1 is a Gor a A and N2 is a pyridine or a A
wherein N3 is a T or a G base and N4 is a G or a C, and
wherein Ns is different from N6, N5 is a G or a C and N6 is a T or a A said minimal consensus sequence being repeated from 3 to 20 times without interruption between said repeated minimal consensus sequence.
In one other embodiment, the invention relates to the use of purified RNA-primed nascent DNA. as defined above, wherein said nucleic acid sequence consists of the following sequence SEQ ID NO: 4:
In the invention, the following nomenclature in nucleic acid sequence is used:
R represents A or G
Y represents C or T
M represents A or C
K represents G or T
S represents G or C
W represents A or T
B represents G, Tor C D represents G, A or T
H represents A, C or T
V represents G, C or A, and
N represents any nucleotide (A, T, G or C)
In one other embodiment, the invention relates to the use of purified nascent DNA as defined above, wherein said nucleic acid sequence consists of the following sequence SEQ ID NO: 5:
In one other embodiment, the invention relates to the use of purified nascent DNA as defined above, wherein said nucleic acid sequence consists of one of the following sequences
The invention also relates to an isolated nucleic acid sequence representing multi cellular DNA replication origins, wherein said nucleic acid sequence consists of one of the following sequences
wherein c vary from 3 to 20
wherein N7 and N8 represent any nucleotide,
wherein a and e independently from each other can be equal to 0, 1, 2 or 3, or vary from about 15 to 30, and
wherein band d independently from each other can be equal to 0, 1, 2 or 3 or vary from about 10 to 300,
N8 being such that if b vary from 10 to 300, (N8)b represents a nucleic acid chain which is such that
N9 being such that if d vary from 10 to 300, (N9)d represents a nucleic acid chain which is such that
The invention relates to the use of an isolated nucleic acid sequence, as a multi cellular DNA replication origin wherein said nucleic acid sequence consists of
The above sequences that correspond to DNA eukaryotic origins are novel.
The invention relates to the isolated nucleic acid sequence according to claim 10,
wherein said nucleic acid sequence being such that
wherein N1 is a G or a A and N2 is a pyridine or a A
wherein N3 is a T or a G base and N4 is a G or a C, and
wherein N5 is different from N6, Ns is a G or a C and N6 is a Tor a A said minimal consensus sequence being repeated from 3 to 20 times without interruption between said repeated minimal consensus sequence.
In one advantageous embodiment, the invention relates to the isolated nucleic acid sequence as defined above, wherein said nucleic acid sequence consists of the following sequence SEQ ID NO: 4:
In one advantageous embodiment, the invention relates to the isolated nucleic acid sequence as defined above, wherein said nucleic acid sequence consists of the following sequence SEQ ID NO: 5:
In one advantageous embodiment, the invention relates to the isolated nucleic acid sequence as defined above, wherein said nucleic acid sequence consists of one of the following sequences
The invention also relates to a recombinant vector comprising at least one isolated nucleic acid sequence as defined above.
The above vector contains at least one origin of replication that replicates as the endogenous chromosomal DNA replication origins. Therefore, the vector is duplicated as an “endogenous chromosome”. The Inventors have shown that this replication is effective (the above origins are active).
The invention also relates to a method, for controlling the replication of a nucleotidic sequence into a pluricellular eukaryotic cell, including mammal cells, comprising the insertion of, into said nucleotidic sequence, a nucleic acid sequence as defined above.
In one advantageous embodiment, the invention relates to the method as defined above, comprising a step of introducing said nucleotidic sequence into a pluricellular eukaryotic cell.
In one advantageous embodiment, the invention relates to the method as defined above for treating pathologies involving a deregulation of DNA replication, said method comprising the administration to an individual in a need thereof of a pharmaceutically effective amount of a nucleic acid sequence as defined above.
In one advantageous embodiment, the invention relates to the use of a nucleic acid sequence as defined above, for the preparation of a drug intended for the treatment of pathologies involving a deregulation of DNA replication.
In one advantageous embodiment, the invention relates to a nucleic acid sequence as defined above, for its use for the treatment of pathologies involving a deregulation of DNA replication.
The invention also relates to a pharmaceutical composition comprising, in particular as active substance, a nucleic acid sequence as defined above, in association with a pharmaceutically acceptable carrier.
The invention also relates to a method for initiating the replication of a deoxyribonucleic acid molecule in a pluricellular eukaryotic cell or in an eukaryotic cell extract, said method comprising a step of inserting, into said deoxyribonucleic acid molecule, at least one nucleic acid molecule representing a multicellular DNA replication origin, the replication origin comprising a at least nine nucleotides sequence, the at least nine nucleotides sequence consisting of at least three uninterrupted origin repeating elements (OGRE) having the sequence N3GN4,
wherein N3 is T or G and N4 is G or C.
In the invention, “initiating the replication of a deoxyribonucleic acid molecule” means that all steps necessary for replicating a double strand DNA molecule are carried out.
Also, in the invention, the OGRE, repeated at least 3 times constitutes the core of the DNA replication origin of multicellular eukaryotic cells.
Advantageously, the invention relates to the method above-mentioned, wherein the replication origin comprises one of the following sequences:
Advantageously, the invention relates to the method above mentioned, wherein the ratio G/C in the replication origin is greater than 1.
The inventors have shown that a better efficiency is obtained when the replication origin is able to form a ternary structure that form a G-quadruplex.
In molecular biology, G-quadruplexes (also known as G-tetrads or G4-DNA) are nucleic acid sequences that are rich in guanine and are capable of forming a four-stranded structure. Four guanine bases can associate through Hoogsteen hydrogen bonding to form a square planar structure called a guanine tetrad, and two or more guanine tetrads can stack on top of each other to form a G-quadruplex. The quadruplex structure is further stabilized by the presence of a cation.
In one advantageous embodiment, the invention relates to the method according to the above definition, wherein the replication origin comprises one of the following sequences:
In one another advantageous embodiment, the invention relates to the above-mentioned method, wherein the replication origin comprises one of the following sequences:
GGGGGCGGGGAGGGAAGGGGG (SEQ ID NO: 32), which is the replication origin of the mouse cc4 gene and
GGGGGATGGGGTTGGAATGGGGGCGGG (SEQ ID NO: 33), which is the replication origin of the mouse cc2 gene.
The invention also relates to a method for conferring autonomous replicative properties to a non self-replicating deoxyribonucleic acid molecule in a pluricellular eukaryotic cell or cell extract, said method comprising a step of inserting, into said deoxyribonucleic acid molecule, at least one nucleic acid molecule representing a multicellular DNA replication origin, the replication origin comprising at a least nine nucleotides sequence, the at least nine nucleotides sequence consisting of at least three uninterrupted origin G-rich Repeated elements (OGRE) having the sequence N3GN4,
wherein N3 is T or G and N4 is G or C.
Advantageously, the invention relates to the method above mentioned, wherein the ratio G/C in the replication origin is greater than 1.
Advantageously, the invention relates to the method above mentioned, wherein the replication origin comprises one of the following sequences:
Advantageously, the invention relates to the method above mentioned, wherein the replication origin comprises one of the following sequences:
wherein the replication origin comprises one of the following sequences:
The invention also relates to a process for preparing a recombinant non-naturally occurring DNA vector comprising as the unique means for replicating DNA at least one multicellular DNA replication origin, said process comprising a step of inserting into a vector at least one nucleic acid molecule representing a multicellular DNA replication origin, the replication origin comprising at a least nine nucleotides sequence, the at least nine nucleotides sequence consisting of at least three uninterrupted origin repeating elements (OGRE) having the sequence N3GN4,
wherein N3 is T or G and N4 is G or C,
wherein the replication origin is originated from a nucleic acid molecule, the nucleic acid molecule being absent in the vector before its insertion.
By “recombinant non-naturally occurring DNA vector” it is meant in the invention a vector that does not exist without man intervention.
In other words, the vectors encompassed by the invention are artificially constructed by biologists. Vectors such as artificial chromosomes (of mouse for instance) are not encompassed by the invention.
Vectors of the invention are commonly constituted by a backbone from prokaryotic or yeast vectors (such as pBR322 vector, cosmid vector, or yeast artificial chromosomes) in which has been introduced at least one replication origin according to the invention.
The invention also relates to a map referencing all the DNA replication organisms of multicellular eukaryotic cells, said map being obtainable by the process as defined above.
The invention also relates to a map referencing all the DNA replication origins of multicellular eukaryotic totipotent cells, said map being obtainable by the process as defined above.
The invention also relates to a map referencing all the DNA replication origins activated in multicellular eukaryotic differentiated cells, said map being obtainable by the process as defined above.
The invention also relates to a method for the diagnostic, preferably in vitro or ex vivo, of pathologies involving a deregulation of DNA replication in an individual, or in a biological sample from an individual, said method comprising the steps:
The invention also relates to a method for the diagnostic, preferably in vitro or ex vivo, of the genetic modification of a cell of an individual, preferably a pluripotent cell, said method comprising the steps:
The invention also relates to a process for purifying nascent DNA, said process comprising
In one advantageous embodiment, the invention relates to the process as defined above, for purifying RNA-primed nascent DNA allowing the localization and the numbering of the active DNA replication origins of multi cellular eukaryotic cells, said process comprising the steps:
In one advantageous embodiment, the invention relates to the process as defined above, for purifying nascent DNA allowing the localization and the numbering of all the DNA replication origins of multi cellular eukaryotic cells, said process being carried out in totipotent cells as well as on a variety of differentiating cells or cancer cells.
The invention also relates to a method for initiating the replication of a first double stranded deoxyribonucleic acid (DNA) molecule in a pluricellular eukaryotic cell, said first molecule being devoid of self-replication capabilities in a pluricellular eukaryotic cell, said method comprising
Advantageously, the step of identifying replicated molecules above mentioned is carried out by identifying the nascent DNA synthesized from the IS of the inserted DNA replication origin in said a second double stranded DNA molecule, the nascent DNA identifying said IS.
According to the method disclosed above, it is possible to induce the replication of a double stranded DNA molecule (dsDNA) devoid of replicative properties, by inserting a replication origin characterized by the inventors.
The replication origin identified by the inventors contains is a sequence of about 50 nucleotides to about 800 nucleotides, containing two specific domains involved in the replication of DNA:
According to the process largely discussed in the invention it is easy to determine an IS, since this site can be identified by the nascent DNA, i.e. the newly synthetized DNA by the DNA polymerase during the progression of the replication loop.
The inventors also identify that the IS is located downstream to a regulatory region having a large number of G nucleotides (G; called “G-rich region”) which is characterized by the fact that it contains at least 3 uninterrupted repeats of a core element (OGRE) each OGRE having one of the following sequences:
Unfortunately, and contrary to lower eukaryotes and bacteria, the replication origins of the pluricellular eukaryotes, including mice and humans, are not defined by a unique strict sequence, but by specific domains having a relative flexibility of sequence to encompass the complex structure of chromosomes and the extreme variability of situations where DNA has to be replicated.
This is why, before the invention, the skilled person was not able to characterize such replications origins.
Moreover, advantageously, the inventors identify that the RE is may be able to form G-quadruplex structures.
The inherent propensity of G to self-associate forming four-stranded helical structures has been known since the early 1960s (1). Subsequently it was demonstrated that the conserved DNA sequence repeats of telomeres form G-quadruplex (or G4) structures in vitro. Since then numerous biochemical and structural analyses have established that nucleic acid sequences, both DNA and RNA, containing runs of guanines (G-tracts) separated by other bases spontaneously fold into G-quadruplex structures in vitro. The building blocks of G-quadruplexes are G-quartets that are formed through a cyclic Hoogsten hydrogen-bonding arrangement of four guanines with each other. The planar G-quartets stack on top of one another forming four-stranded helical structures. G-quadruplex formation is driven by monovalent cations such as Na+ and K+, and hence physiological buffer conditions favor their formation. G-quadruplex structures are topologically very polymorphic and can arise from the intra- or inter-molecular folding of G-rich strands. Intra-molecular folding requires the presence of four or more G-tracts in one strand, whereas inter-molecular folding can arise from two or four strands giving rise to parallel or antiparallel structures depending on the orientation of the strands in a G-quadruplex.
In an advantageous embodiment, the invention relates to the method as defined above, wherein the replication origin comprises one of the following sequences:
wherein for instance [AG] represents A or G, and where a square bracket represents only one nucleotide.
Advantageously, the invention relates to the method mentioned above, wherein said RE controls progression of a replication loop initiated in said IS.
The inventors identify that the RE region may control the activity and progression of the initiation loop, i.e. the loop which occurs when the two stand of a dsDNA opens and which allows replication of DNA by a DNA polymerase.
The invention also relates to a process for preparing a recombinant non-naturally occurring double stranded DNA multicellular eukaryotic replicative vector, or replicative vector, comprising at least one multicellular DNA replication origin as the unique means for replicating the vector in a pluricellular eukaryotic cell or cell extract, said process comprising a step of
inserting into a first recombinant non-naturally occurring double stranded DNA vector at least one DNA molecule comprising at least one multicellular DNA replication origin in order to obtain a multicellular eukaryotic replicative vector, said replication origin consisting essentially of a sequence from about 50 to about 800 nucleotides, said replication origin comprising at least a regulatory element (RE), and an initiation site (IS), wherein said IS is located downstream to the RE at about 50 to 800 nucleotides, the RE comprising at least a nine-nucleotide sequence consisting of at least three uninterrupted origin repeating elements (OGRE), each OGRE having one of the following sequences:
N1N2G, wherein N1 is a G or a A and N2 is a pyrimidine or a A;
N3GN4, wherein N3 is T or G and N4 is G or C; and
GN5N6, wherein N5 is different from N6, N5 is a G or a C and N6 is a T or a A;
introducing said replicative vector into a pluricellular eukaryotic cell; and then recovering the vectors resulting from the replication;
wherein said a first recombinant non-naturally occurring double stranded DNA vector being devoid of self-replicative capabilities in a pluricellular eukaryotic cell, and
wherein the inserted at least one multicellular DNA replication origin allows said DNA vector to self-replicate in a pluricellular eukaryotic cell or cell extract.
Advantageously, the invention relates to the method as defined above, wherein the at least one multicellular DNA replication origin comprises at least one of the sequences as set forth in SEQ ID NO: 39 to SEQ ID NO: 81.
Advantageously, the invention relates to the method as defined above, wherein said RE forms a potential G quadruplex structure.
Advantageously, the invention relates to the method as defined above, wherein said RE interacts with a preRC complex.
The invention also relates to a double stranded DNA vector comprising as its unique replicative DNA replication origin a replication origin consisting essentially of a sequence from about 50 to about 800 nucleotides, said replication origin comprising at least a regulatory element (RE), and an initiation site (IS), wherein said IS is located downstream to the RE at about 50 to 800 nucleotides, the RE comprising at least a nine-nucleotide sequence consisting of at least three uninterrupted origin repeating elements (OGRE), each OGRE having one of the following sequences:
N1N2G, wherein N1 is a G or a A and N2 is a pyrimidine or a A;
N3GN4, wherein N3 is T or G and N4 is G or C; and
GN5N6, wherein N5 is different from N6, N5 is a G or a C and N6 is a T or a A;
wherein said vector is devoid of bacterial or unicellular eukaryotic replication origin.
Advantageously, the invention relates to the vector as defined above, wherein the at least one multicellular DNA replication origin comprises at least one of the sequences as set forth in SEQ ID NO: 39 to SEQ ID NO: 81.
Advantageously, the invention relates to the vector as defined above, said RE forms a potential G quadruplex structure.
Advantageously, the invention relates to the vector as defined above, wherein said RE interacts with the preRC complex.
Represented is a qPCR analysis of the replication origin of e-mye gene.
The DNA content of individual cells is stained and quantified using a flow cytometer. The populations of cells before (2n) and after (4n) DNA replication are indicated. Cells in between 2n and 4n are replicating DNA.
Short nascent DNA strands (0.5-2 kb) were isolated by purification and denaturation of the
genomic DNA and isolation of nascent strands on sucrose gradients. The nascent strand population was further treated by exhaustive λ-exonuclease digestion, as described in Cayrou et al (2015) and Methods. The background level which might be left after the λ-exonuclease digestion was measured by treating half of the sample containing the nascent DNA strands with RNAseA/RNaseT2 prior to another λ-exonuclease digestion. Purified nascent strands were then analysed by qPCR or high-throughput whole-genome sequencing.
The potential to form a G4 by each sequence was predicted by G4H and confirmed by a combination of two spectroscopic techniques—thermal difference spectra and circular dichroism. All sequences except their mutated and randomised counterparts exhibit the hallmarks of quadruplex formation.
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
Association with EzH2 and RBPP5a sites (
(
(
(
(
(
Precipitation DNA
DNA is resuspended in 2 ml of TEN20 at 70° C. tris 10 mM pH7.9 final
The solution was boiled for 10-15 mM, chilled on ice
Sucrose Gradient NS Purification
Load 1 mL onto a single neutral 5 to 30% sucrose gradient prepared in TEN500 in a 38.5-ml centrifuge tube. tris 10 mM pH7.9 final
Gradients were centrifuged in a Beckman SW28 rotor for 20 h at 24 000 rpm at 4° C.
1 ml Fractions were withdrawn from the top of the gradient using a wide-bore pipette tip 50 μl of each fraction was run with appropriate size markers on a 2% alkaline agarose gel, ON at 4° C. at 40-50 volt.
neutralized gel with TBE1 X and stained with GelRed.
Fractions corresponding to 0.5-1 kb, 1-1.5 kb, 1.5-2 kb and 2-3 kb were rescued and precipitated with 2.5 Vol of ethanol 100% 15 min at −80° C.
Pellets were washed with I ml of ethanol 70% and resuspended in 20 μl of water with 100 U of RNasin.
DNA Contaminant Withdrawn
1—After addition of 2 μl Buffer PNK (New England Biolabs), fractions were boiled for 5 min, chilled on ice,
2—phosphorylation with T4 polynucleotide kinase in a volume of 1 OO μl final vol T4 mix:
The reaction is incubated at 37° C. for 1H, 15 min at 75° C. and directly precipitated by ethanol (2.5 vol)-Na-acetate (0.3M) for 15 min at −80° C.
3—Pellets were washed with 1 ml of ethanol 70% and resuspended in 50 μl of water with 100 U of RNasin.
4—The remainder is digested with 5 μl of lambda exonuclease in a final volume of 100 μl Lambda exo mix:
Fermentas L-Exo Buffer
67 mM glycine-KOH (pH 9.4)
2.5 mM MgCl2
50 μg of bovine serum albumin per ml)
The reaction is incubated overnight at 37° C.
Aliquots of both the digested DNA and the undigested control were run on an 2% agarose gel.
5—the nascent strands were extracted once with phenol/chloroform/JAA and once with chloroform/JAA, and ethanol (2.5 vol)-Na-acetate (0.3M) precipitated for 15 min at −80° C.
6—Pellets were washed with 1 ml of ethanol 70% and resuspended in 20 μl of water.
7—The NS is subjected to another step of phosphorylation by T4 PNK and lamda-exo digestion (2- to 5-)
8—The final NS resuspended in 50 μl of tris 10 mM is directly quantified with Roche-LC480.
Purification of Nascent Strand with Cyscibe-GFX kit
Elution in 50 ul
use 10 ul and amplify with WGA-Sigma kit without the first fragmentation step.
Purify amplicons with nucleospin kit with a 1/5 dilution in NBA buffer prior to fix on column.
Elution in 50μ1.
LC480 (Light cycler 480) on 0.1 a 0.5 ul of the amplicon.
In metazoans, thousands of chromosomal sites are activated at each cell cycle to initiate DNA synthesis and permit total duplication of the genome. They all should be activated only once to avoid any amplification and maintain genome integrity. How these sites are defined remains elusive despite considerable efforts trying to unravel a possible replication origin code. In Saccharomyces cerevisiae, DNA replication origins are specifically identified by specific DNA elements, called Autonomous Replication Sequence elements (ARS), which have a common AT-rich 11 bp specific consensus. However, sequence specificity identifies but not determines origin selection. Indeed, of the 12,000 ACS sites present in S. cerevisiae genome only 400 are functional [Nieduszynski C A, et al. Genes Dev. 2006 Jul. 15; 20(14):1874-9]. In S. pombe, ARS elements were also identified but they do not share a specific consensus sequence like in S. cerevisiae. Here, DNA replication origins are characterized by AT-rich islands [Dai J, et al. Proe Natl Acad Sci USA. 2005 Jan. 11; 102(2):337-42; Heichinger C, et al. EMBO J. 2006 Nov. 1; 25(21):5171-9] and poly-dA/dT tracks.
In multicellular organisms, it was more difficult to identify common features of DNA replication origins. No consensus sequence element has been found, which can have predictive value, although specific sites are recognized as DNA replication origins in chromosomes of somatic cells. It was soon suspected that metazoan ORis might be linked to other genetic features of complex organisms as the requirement to coordinate DNA replication not only with cell growth but also cell differentiation, and correlations with transcription and/or chromatin status have been found [Cayrou C, et al. Chromosome Res. 201 O January; 18 (1): 13 7-45]. However, identification of replication origins has been hampered by the lack of a genetic test as the ARS test in yeast, and methods to map replication origins which were not always adapted to a robust genome-wide analysis. First recent genome-wide studies to map origins in mouse and human cells (Cadoret et al., 2008; Sequeira-Mendes et al., 2009) have observed a correlation with unmethylated CpG islands regions as well as some overlap with promoter regions [Sequeira-Mendes J, et al. PLoS Genet. 2009 April; 5(4):e1000446]. However, it is not clear whether CpG islands are here a specific mark of replication origins or of the associated transcription promoters.
The Inventors tried to reveal new features of eukaryotic origins, first by upgrading the method used to map nascent stands DNA at origins to a specificity and reproducibility compatible with a genome-wide analysis compatible with the use of tiling arrays, then in a further upgrade compatible with using Next Generation High throughput sequencing The Inventors first used this method for four kinds of cell systems: mouse embryonic stem cells (ES), mouse teratocarcinoma cells (P19), mouse differentiated fibroblasts (MEFs), and Drosophila cells (Kc cells). The aim of using mouse cells and drosophila cells was to possibly detect conserved features in evolution and the aim of using mouse cells in different cell behaviours was to analyze the contribution to differentiation as opposed to pluripotent cells.
Genome-Wide Replication Origins Maps
The RNA-primed nascent DNA procedure of preparation was initially improved using P 19 cells that grow in large amounts, and the method is detailed in Supplementary material and
Nascent strand preparations were hybridized on tiling micro-array (Nimblegen, oligonucleotides spaced every 100 bp). The full data set consists of continuous 60.4 Mbp on mouse chromosome 11 and 118.3 Mbp of Drosophila genome. Origins maps show enrichment at specific genomic locations with a high degree of reproducibility (
Replication Origins Distribution
The method used allows scoring potentially all activated origins activated during the whole S-phase as exponentially growing cells were used. If there is existing variation between the origins activated in a given cell relative to another in the same growing cell population, all the potential replication initiation sites will be scored. In such conditions, the Inventors scored 146700 potential origins per genome, similar for the both mouse pluripotent cell types (
Replication origins of Drosophila cells display the same length than MEF cells but with density higher than mouse cells (see later).
With regard to genes, mouse replications origins were found to be significantly associated with genes (p<0.001; Fig ID). More particularly, origins overlap significantly (p<0.001) promoter and exonic sequences in all murine cell types (Fig IE). Drosophila origins were found associated significantly with exonic sequences (
Replication Origins are Determinated by CpG Island-Like Regions
Given their association with transcriptional units and with promoter regions, the Inventors examined the distribution of replication origins around the transcription start sites (TSS) in mouse cells. Overall, TSS are highly associated with nascent strands signals (
Mammalian promoters and particularly from highly expressed genes are CpG-rich while genes highly regulated during development are often CpG-poor or free. CpG-rich sequences are known as CpG Islands (CGI). To better understand the bimodal distribution, the Inventors divided our analysis on TSS CpG-positive (n=820) and TSS CpG-free (n=434) separately. Notably, nascent strands specific signals are strongly associated with CGI-positive promoter while CG I-negative promoters are devoid of such signals in all three mouse cell lines (
CGI are usually defined as regions of 200 pb min in length with 60% of CG-richness and a ratio of CpG observed/CpG>0.6. Because cytosine methylation is almost inexistent in drosophila melanogaster, there is not a genome-wide bias toward eliminating CpG dinucleotides during evolution. The drosophila genome nevertheless contains region with identical properties as mammalian CGI. The Inventors delimitated these regions as CGI-like sequences. More of the half of CGI-like regions (54%) in drosophila cells and more than 70% of these sequences in mouse cells lines are associated with replication origin. These values drop to 32% and 43% for the randomized origins dataset. Moreover, the population of origins that is longer than average is even more associated with this sequence (82% in mice,
The Inventors concluded that sequences related to CGI are determinant for localization of origins in mice as well as drosophila, regardless of their genomic position, e. g. not only in promoter region, consistent with presence of CGI-like sequences in exonic region from drosophila genome. These results also provided a novel possible function for CGI sequences conserved both in vertebrates and invertebrate species.
Nevertheless, CpG island rich sequences do not recognize the majority of replication origins (see
The Majority of Metazoan Replication Origin Shares a Common Motif
No consensus sequence is known to be associated with metazoan origins. Nevertheless, the Inventors hypothesized that such a sequence could potentially be identified in drosophila origins because of the compactness of the fly genome. As a first approach, fifteen 3 kb length origins sequence were submitted to the MEME server (http://meme.sdsc.edu/meme4_4_O/intro.html) using default settings. A repetitive G-rich motif was found. When matched on the drosophila genome, this motif detected a large (>50%) proportion of replication origins. Several rounds of optimization gave rise to a repeated G-rich sequence that contained G every three nucleotides along the repeat (
Hieratical Organization of Replication Origins in Metazoan
Genome-wide data permit to identify sites which can serve as DNA replication origins, but do not permit to have a view of origin usage along individual DNA molecules. Analysis at the single molecule level can be performed by DNA combing, where replicating DNA is labeled with pulses of modified nucleotide in vivo, and high molecular weight then stretched at a constant rate onto a slide. This method allows the precise determination of replication speed and inter-origin distances (
Sequential dual nucleotide labeling to determinate fork direction and bi-directional origins of replication was performed. The Inventors observed a near two-fold difference in inter-origin distances between mouse cells (139 kb) and drosophila cells (73 kb) (
If all mapped origins were activated (firing efficiency of 100%) the resulting very short inter-origin distance distribution would be significantly different from the distribution observed by DNA combing (
Density of Replication Origins in Chromosome 11
DNA replication origins are often synchronously activated in clusters. The Inventors looked at the origin density on areas of 70 Kb in mice and 50 Kb in Drosophila through a sliding window every 10 bp. First, the Inventors observed that zones of high density of origins were at similar positions along chromosome 11, for all three mouse cells lines (Fig SA). Then, the Inventors compared these areas with other genomic features such as density in genes, promoter and CpG islands. For example, the areas of density origins coincide well with areas of density of CpG islands in MEF cells (Fig SA). A similar trend was observed for ES and P19 cells (data not shown). The replication timing of different ESC cells was recently published, and showed a very high conservation profile between distantly related pluripotent cells. The Inventors observed a strong correlation between early replicated regions and areas of high origins density in ES and P19 cells (
The inventors thus propose that a replication cluster includes consecutive groups of adjacent flexible Oris, each set constituting a replicon, that are activated synchronously (see
Materials and Methods
Cell Culture
The HEK-293 cell line stably expressing EBNA1 (HEK293 EBNA1+) that was cultivated in Dulbecco's modified Eagle's minimal medium containing 10% fetal calf serum and 220 μg/ml Neomycin.
Plasmid Replication Assay.
2 μg of the reporter plasmid containing the various origin variants were transfected into HEK293 EBNA1+ cell line using Lipofectamine2000 (Life technologies) according to manufacturers instructions. Transfections with comparable efficiencies were verified by visualizing GFP-positive cells. Six days post-transfection, cells were harvested according to the HIRT protocol (Hirt B. (1967) J Mol Biol. 26, 365-9). After washing with PBS, cells were first equilibrated in 5 ml TEN buffer (10 mM Tris-HCl pH 7.5, 1 mM EDTA, 150 mM NaCl). After resolubilization in 1.5 ml TEN and equal volume of 2×HIRT buffer (1.2% SDS, 20 mM Tris-HCl pH 7.5, 20 mM EDTA) was then added for cell lysis. The lysate was then incubated at 4° C. for 16 h, in the presence of 1.25 M NaCl. After centrifugation for 1 h at 20000×g at 4° C., DNA was purified by phenol-chloroform extraction and digested with 40 U DpnI (NEB) in presence of RNase (Roche). Digested DNA (300 ng) was electroporated into Electromax DH10B competent cells (Invitrogen) and ampicillin-resistant colonies, representing the number of recovered plasmids, were counted. The wild-type oriP-plasmid was always transfected in parallel and the number of resulting colonies was used for normalization.
The deletion was tested using a specific restriction enzyme MsII recognizing a motif found in close proximity to the G quadruplex.
Cell and Tissue Culture
hESC cells were maintained in an appropriated medium under 5% CO, at 37° C. CD34(+)ve hematopoietic progenitor cells were isolated from human cord blood using previously established protocols. These cells were treated with erythropoietin for 3 or 6 days (day3, day6). HMEC cells were generated as previously described (ref). HMEC cells were initially immortalized using a stably transfected shRNA against p53 gene (ImM-1). A subclones of ImM-1 cell line was later generated by stably transfecting human RAS (ImM-2) or WNT (ImM-3).
Nascent Strand Isolation
Nascent strands were purified and sequenced as previously described with the following modifications:
SNS-Seq Analysis
Illumina reads (50 bp, single-end) from each SNS-seq replicates were trimmed and aligned to hg38 using bowtie (as previously described Cayrou et al). Peaks were called using a combination of two peak calling programs, MACS2 and SICER. Peaks were called using MACS2 (default parameters plus --bw 500 -p le-5 -s 60 -m 10 30 --gsize 2.7e9), followed by peak calling by SICER (parameters: redundancy threshold=1, window size (bp)=200, fragment size=150 effective genome fraction=0.85, gap size (bp)=600, FDR=1e-3). MACS2 peaks that intersect SICER peaks from each sample were merged using bedtools intersect to generate a comprehensive list of all human DNA initiation sites (IS). Blacklisted regions as defined by ENCODE project (hg38, ENCSR636HFF) were subtracted from our final human DNA replication origins list, leaving 320,748 regions. Summits of origins were defined by calculating the highest number of SNS-seq reads in bins of 50 bp from 25 bp sliding windows. Middle point of the bin with highest number of reads was considered the summit of the IS.
Quantification and Classification of DNA Replication Origins
Regions that correspond to Quantification of DNA replication origins were done using the R-package DiffBind (TMM_minus_background), using all human/mouse origin coordinates. Following TMM normalization, we calculated average normalized SNS-seq counts across 19 samples for each origin and assigned each origin to a quantile (Q1-Q10) accordingly. Each quantile consisted of 32,074 origins. Super origins were defined as having >50 normalized SNS-seq counts in 18 or more samples. Tissue specific origins were determined by selecting origins that had >50 average normalised SNS-seq counts in the tissue of interest, which was more than 2 standard deviations further from the average normalised SNS-seq counts in other untransformed cell types.
Data Analysis
Heatmaps, boxplots and other plots were generated using ggplot2 and heatmap.2. Both Pearson's and Spearman's correlation matrices were calculated in R using (command cor( )). Comparison between genomic coordinates (quantiles, alternative origin mapping methods, histone/TF/ORC binding sites) as well as generation of randomized genomic coordinates were computed using bedtools suite (intersectBed with a minimum overlap of 1 bp, bedtools shuffle -chrom). COVERAGE (G4)—add. Principal component analysis (PCA) was carried out in R. SNS-seq read density plots and heatmaps were generated using deeptools (plotProfile, plotHeatmap). Where necessary, genome coordinates were converted between different genome assemblies using UCSC LiftOver (UCSC Toolkit).
Analysis of Base Composition in Genomic Regions
Base composition analysis was done using HOMER (REF), with 100 bp as window size taking IS summit as the peak center. The data was then visualized using Microsoft excel.
Evolutionary Conservation Analysis
Refseq exons, introns and promoter regions (−500 to 0 bp upstream of transcription start sites) and Phastcon scores (Phastcon20way) were downloaded from UCSC browser (last update December 2017). Mean cumulative phastcon scores of each set of regions were calculated using R and bedtools suite (bedtools coverage). Human origin coordinates were converted to mouse coordinates either using LiftOver (UCSC toolkit).
Prediction of DNA Replication Origins in the Human Genome
Human and mouse genome was divided into paired 500 bp windows (Watson and crick strands separately) with a sliding window size of 100 bp using bedtools (makewindows) suite. We then calculated the number of each nucleotide (A,C,G,T) in each paired window (bedtools nuc). Paired 500 bp windows were evaluated with permissive (min 0.25% G in the first window, followed by another 500 bp window in which G content drops by 8-40%, with a max A/T content 0.21 or strict (min 0.28% G in the first window, followed by another 500 bp window in which G content drops by 8-40%. In addition, we only retained window pairs, if and only if the total G content +/−the center of the window pair was >25%) criteria. The window pairs that were retained were then merged using bedtools merge to identify non-overlapping putative origin regions.
This application is a continuation-in part of the patent application U.S. Ser. No. 14/681,351, filed on Apr. 8, 2015, which is a continuation-in-part of the patent application U.S. Ser. No. 13/393,259 filed on Aug. 31, 2010.
Number | Date | Country | |
---|---|---|---|
61238315 | Aug 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14681351 | Apr 2015 | US |
Child | 16155421 | US | |
Parent | 13393259 | May 2012 | US |
Child | 14681351 | US |