SYSTEMS AND METHODS OF GENETIC ANALYSIS

FIELD OF THE INVENTION

This disclosure relates to systems and methods for determining copy number variations, chromosomal abnormalities or micro-deletions in a subject in need thereof.

BACKGROUND OF THE INVENTION

Genetic carrier screening is a type of testing that can identify risks of individual subjects, typically prospective parents, at having a child with one of the hereditary diseases that can cause death or disability. A person who has one normal gene and one abnormal gene that can cause a genetic disorder, is called a carrier. A carrier is not affected with the disorder, but they can pass on the abnormal gene to the next generation. For example, genetic carrier screening can determine if a prospective parent is a carrier of a recessive genetic disorder, such as cystic fibrosis, sickle cell disease, thalassemia, Tay-Sachs disease, and spinal muscular atrophy (SMA). If both prospective parents are carriers of a defective gene for a recessive genetic disorder, then they are at risk for having children with that genetic disorder. If neither parent is a carrier, then they can rule out such risk. Therefore, genetic carrier screening is very informative to prospective parents.

Spinal muscular atrophy (SMA) is one of the most common inherited causes of infant death. It affects a person's ability to control their muscles, including those involved in breathing, eating, crawling and walking. SMA has different levels of severity, none of which affects intelligence. However, the most common form of the disorder causes death by age two. About one in every 6,000 to one in every 10,000 babies born in the U.S. has SMA.

SMA is a recessive genetic disorder. It is caused by mutations in the SMN (Survival Motor Neuron) genes, SMN1 and SMN2, that are located on chromosome 5. The SMN gene is composed of 9 exons, with a stop codon near the end of exon 7. Two almost identical SMN genes are present on chromosome 5q13: the telomeric or SMN1 gene, which is the SMA-determining gene, and the centromere or SMN2 gene. The gene sequences of SMN1 and SMN2 differ by only 5 base pairs, and the coding sequence differs by a single nucleotide (840C>T). This single nucleotide difference does not alter an amino acid, but it does affect splicing and causes about 90% of transcripts from SMN2 to lack exon 7. Consequently, in contrast to the SMN1 gene, which produces a full-length SMN protein, the SMN2 gene produces predominantly a shortened, unstable and rapidly degraded isoform.

Individuals having SMA typically have inherited a mutant SMN1 gene from each of their parents. The majority of mutations responsible for SMA are either deletions or gene conversions. A deletion involves partial or complete removal of the SMN1 gene. In a gene conversion, the SMN1 gene is converted into an SMN2-like gene because the “C” in exon 7 is mutated to a “T”. In both cases, SMA patients are missing SMN1 exon 7 and make insufficient amounts of full-length SMN protein. Therefore, a SMA carrier testing can determine whether each parent is a carrier or not based on the copy numbers of the SMN1 and SMN2 genes in the parent.

Current methods for genetic carrier screening, such as SMA carrier testing, are time-consuming or expensive, or require extensive bioinformatics analysis. In addition, current methods for detecting exonic deletions or duplications are also time-consuming or expensive, or require extensive bioinformatics analysis.

Pharmacogenomics testing (also referred as drug-gene testing) refers to the study of how a subject's genes affect the body's response to medications. Pharmacogenomic tests look for changes or variants in one or more genes that may determine whether a medication could be an effective treatment for an individual or whether an individual could have side effects to a specific medication.

Therefore, there is a need for developing cost-effective and efficient tests that have high sensitivities and specificities.

SUMMARY OF THE INVENTION

Some embodiments of the disclosure are:

1. A method of detecting copy number variation in a subject comprising:

a) obtaining a nucleic acid sample isolated from the subject;

b) capturing one or more target sequences in the nucleic acid sample obtained in step a) by using one or more target populations of targeting molecular inversion probes (MIPs) to produce a plurality of targeting MIPs replicons for each target sequence,

wherein each of the targeting MIPs in each of the target populations comprises in sequence the following components:

first targeting polynucleotide arm—first unique targeting molecular tag—polynucleotide linker—second unique targeting molecular tag—second targeting polynucleotide arm;

wherein the pair of first and second targeting polynucleotide arms in each of the targeting MIPs in each target population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank the target sequence that is targeted by the one or more targeting MIPs;

wherein the first and second unique targeting molecular tags in each of the targeting MIPs in each target population are distinct in each of the targeting MIPs, in each member of the target population, and in each of the target populations;

c) capturing a plurality of control sequences in the nucleic acid sample obtained in step a) by using a plurality of control populations of control MIPs to produce a plurality of control MIPs replicons, each control population of control MIPs being capable of amplifying a distinct control sequence in the nucleic acid sample obtained in step a),

wherein each of the control MIPs in each control population comprises in sequence the following components:

first control polynucleotide arm—first unique control molecular tag—polynucleotide linker—second unique control molecular tag—second control polynucleotide arm;

wherein the pair of first and second control polynucleotide arms in each of the control MIPs in each control population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank each control sequence;

wherein the first and second unique control molecular tags in each of the control MIPs in each control population are distinct in each of the control MIPs and in each member of the control population, and are different from the unique targeting molecular tags;

d) sequencing the targeting and control MIPs amplicons that are amplified from the targeting and control MIPs replicons obtained in steps b) and c);

e) determining, for each target population, the number of the unique targeting molecular tags present in the targeting MIPs amplicons sequenced in step d);

f) determining, for each control population, the number of the unique control molecular tags present in the control MIPs amplicons sequenced in step d);

g) computing a target probe capture metric, for each of the one or more target sequences, based at least in part on the number of the unique targeting molecular tags determined in step e) and a plurality of control probe capture metrics based at least in part on the numbers of the unique control molecular tags determined in step f);

h) identifying a subset of the control populations of control MIPs that have control probe capture metrics satisfying at least one criterion;

i) normalizing each of the one or more target probe capture metrics by a factor computed from the subset of control probe capture metrics satisfying the at least one criterion, to obtain a test normalized target probe capture metric for each of the one or more target sequences;

j) comparing each test normalized target probe capture metric obtained in step i) to a plurality of reference normalized target probe capture metrics that are computed based on reference nucleic acid samples obtained from reference subjects exhibiting known genotypes using the same target and control sequences, target population, one subset of control populations in steps b)-g) and i); and

k) determining, based on the comparing in step j) and the known genotypes of reference subjects, the copy number variation of each of the one or more target sequences of interest.

2. The method of embodiment 1, wherein the nucleic acid sample is DNA or RNA.

3. The method of embodiment 1 or 2, wherein the nucleic acid sample is genomic DNA.

4. The method of any one of embodiments 1-3, wherein the subject is a carrier screening candidate for one or more diseases or conditions.

5. The method of any one of embodiments 1-3, wherein the subject is a candidate for:

a) a pharmacogenomics test;

b) a targeted tumor test;

c) an exonic deletion test; or

d) an exonic duplication test.

6. The method of any one of embodiments 1-5, wherein the length of each of the targeting polynucleotide arms is between 18 and 35 base pairs.

7. The method of any one of embodiments 1-5, wherein the length of each of the control polynucleotide arms is between 18 and 35 base pairs.

8. The method of any one of embodiments 1-7, wherein each of the targeting polynucleotide arms has a melting temperature between 57° C. and 63° C.

9. The method of any one of embodiments 1-7, wherein each of the control polynucleotide arms has a melting temperature between 57° C. and 63° C.

10. The method of any one of embodiments 1-9, wherein each of the targeting polynucleotide arms has a GC content between 30% and 70%.

11. The method of any one of embodiments 1-9, wherein each of the control polynucleotide arms has a GC content between 30% and 70%.

12. The method of any one of embodiments 1-11, wherein the length of each of the unique targeting molecular tags is between 12 and 20 base pairs.

13. The method of any one of embodiments 1-11, wherein the length of each of the unique control molecular tags is between 12 and 20 base pairs.

14. The method of any one of embodiments 1-13, wherein each of the unique targeting or control molecular tags is not substantially complementary to any genomic region of the subject.

15. The method of any one of embodiments 1-13, wherein the polynucleotide linker is not substantially complementary to any genomic region of the subject.

16. The method of any one of embodiments 1-15, wherein the polynucleotide linker has a length of between 30 and 40 base pairs.

17. The method of any one of embodiments 1-15, wherein the polynucleotide linker has a melting temperature of between 60° C. and 80° C.

18. The method of any one of embodiments 1-15, wherein the polynucleotide linker has a GC content between 30% and 70%.

19. The method of any one of embodiments 1-15, wherein the polynucleotide linker comprises 5′-CTTCAGCTTCCCGATATCCGACGGTAGTGT-3′(SEQ ID NO: 1) 20. The method of any one of embodiments 1-19, wherein the plurality of target population of targeting MIPs and the plurality of control populations of control MIPs are in a probe mixture.

21. The method of embodiment 20, wherein the probe mixture has a concentration between 1-100 pM; 10-100 pM; 50-100 pM; or 10-50 pM.

22. The method of any one of embodiments 1-21, wherein each of the targeting MIPs replicons is a single-stranded circular nucleic acid molecule.

23. The method of embodiment 22, wherein each of the targeting MIPs replicons provided in step b) is produced by:

i) the first and second targeting polynucleotide arms, respectively, hybridizing to the first and second regions in the nucleic acid that, respectively, flank the target sequence; and

ii) after the hybridization, using a ligation/extension mixture to extend and ligate the gap region between the two targeting polynucleotide arms to form single-stranded circular nucleic acid molecules.

24. The method of any one of embodiments 1-23, wherein each of the control MIPs replicons is a single-stranded circular nucleic acid molecule.

25. The method of embodiment 24, wherein each of the control MIPs replicons provided in step b) is produced by:

i) the first and second control polynucleotide arms, respectively, hybridizing to the first and second regions in the nucleic acid that, respectively, flank the control sequence; and

ii) after the hybridization, using a ligation/extension mixture to extend and ligate the gap region between the two control polynucleotide arms to form single-stranded circular nucleic acid molecules.

26. The method of any one of embodiments 1-25, wherein the sequencing step of d) comprises a next-generation sequencing method.

27. The method of embodiment 26, wherein the next-generation sequencing method comprises a massive parallel sequencing method, or a massive parallel short-read sequencing method.

28. The method of any one of embodiments 1-27, wherein the method comprises, before the sequencing step of d), a PCR reaction to amplify the targeting and control MIPs replicons to produce the targeting and control MIPs amplicons for sequencing.

29. The method of embodiment 28, wherein the PCR reaction is an indexing PCR reaction.

30. The method of embodiment 29, wherein the indexing PCR reaction introduces, the following components: a pair of indexing primers, a unique sample barcode and a pair of sequencing adaptors, into each of the targeting or control MIPs replicons to produce barcoded targeting or control MIPs amplicons.

31. The method of embodiment 30, wherein the barcoded targeting MIPs amplicons comprise in sequence the following components:

a first sequencing adaptor—a first sequencing primer—the first unique targeting molecular tag—the first targeting polynucleotide arm—captured target nucleic acid—the second targeting polynucleotide arm—the second unique targeting molecular tag—a unique sample barcode—a second sequencing primer—a second sequencing adaptor; or

wherein the barcoded control MIPs amplicons comprise in sequence the following components:

a first sequencing adaptor—a first sequencing primer—the first unique control molecular tag—the first control polynucleotide arm—captured control nucleic acid—the second control polynucleotide arm—the second unique control molecular tag—a unique sample barcode—a second sequencing primer—a second sequencing adaptor.

32. The method of any one of embodiments 1-31, wherein at least one of the one or more target sequences and at least one of the control sequences are on the same chromosome.

33. The method of any one of embodiments 1-31, wherein at least one of the one or more target sequences and at least one of the control sequences are on different chromosomes.

34. The method of any one of embodiments 1-33, wherein the target sequence is SMN1/SMN2.

35. The method of embodiment 34, wherein the first targeting polynucleotide primer for the target sequence of SMN1/SMN2 comprises the sequence of 5′-AGG AGT AAG TCT GCC AGC ATT-3′ (SEQ ID NO: 2).

36. The method of embodiment 34 or 35, wherein the second targeting polynucleotide primer for the target sequence of SMN1/SMN2 comprises the sequence of 5′-AAA TGT CTT GTG AAA CAA AAT GCT-3′ (SEQ ID NO: 3).

37. The method of any one of embodiments 34-36, wherein the polynucleotide linker comprises 5′-CTT CAG CTT CCC GAT ATC CGA CGG TAG TGT-3′ (SEQ ID NO: 1).

38. The method of any one of embodiments 34-37, wherein the MIP for the target sequence of SMN1/SMN2 comprises the sequence of 5′-AGG AGT AAG TCT GCC AGC ATT NNN NNN NNN NCT TCA GCT TCC CGA TTA CGG GTA CGA TCC GAC GGT AGT GTN NNN NNN NNN AAA TGT CTT GTG AAA CAA AAT GCT-3′ (SEQ ID NO: 4).

39. The method of any one of embodiments 1-38, wherein the control sequences comprise one or more genes or sequences selected from the group consisting of CFTR, HEXA, HFE, HBB, BLM, IDS, IDUA, LCAS, LPL, MEFV, GBA, MPL, PEX6, PCCB, ATM, NBN, FANCC, F8, CBS, CPT1, CPT2, FKTN, G6PD, GALC, ABCC8, ASPA, MCOLN1, SPMD1, CLRN1, NEB, G6PC, TMEM216, BCKDHA, BCKDHB, DLD, IKBKAP, PCDH15, TTN, GAMT, KCNJ11, IL2RG, and GLA.

40. A method of detecting copy number variation in a subject comprising:

a) isolating a genomic DNA sample from the subject;

b) adding the genomic DNA sample into each well of a multi-well plate, wherein each well of the multi-well plate comprises a probe mixture, wherein the probe mixture comprises a plurality of target populations of targeting molecular inversion probes (MIPs), a plurality of control populations of control MIPs and buffer;

wherein each targeting population of targeting MIPs is capable of amplifying a distinct target sequence in the genomic DNA sample obtained in step a),

wherein each of the targeting MIPs in each target population comprises in sequence the following components:

first targeting polynucleotide arm—first unique targeting molecular tag—polynucleotide linker—second unique targeting molecular tag—second targeting polynucleotide arm;

wherein the pair of first and second targeting polynucleotide arms in each of the targeting MIPs in each target population are identical, and are substantially complementary to first and second regions in the genomic DNA that, respectively, flank each target sequence;

wherein the first and second unique targeting molecular tags in each of the targeting MIPs in each target population are distinct in each of the targeting MIPs and in each member of the target population;

wherein each control population of control MIPs is capable of amplifying a distinct control sequence in the genomic DNA sample obtained in step a),

wherein each of the control MIPs in each control population comprises in sequence the following components:

first control polynucleotide arm—first unique control molecular tag—polynucleotide linker—second unique control molecular tag—second control polynucleotide arm;

wherein the pair of first and second control polynucleotide arms in each of the control MIPs in each control population are identical, and are substantially complementary to first and second regions in the genomic DNA that, respectively, flank each control sequence;

c) incubating the genomic DNA sample with the probe mixture for the targeting MIPs to capture the target sequence and for the control MIPs to capture the control sequences;

d) adding an extension/ligation mixture to the sample of c) for the targeting MIPs and the captured target sequence to form the targeting MIPs replicons and for the control MIPs and the captured control sequences to form the control MIPs replicons, wherein the extension/ligation mixture comprises a polymerase, a plurality of dNTPs, a ligase, and buffer;

e) adding an exonuclease mixture to the targeting and control MIPs replicons to remove excess probes or excess genomic DNA;

f) adding an indexing PCR mixture to the sample of e) to add a pair of indexing primers, a unique sample barcode and a pair of sequencing adaptors to the targeting and control MIPs replicons to produce the targeting and control MIPs amplicons;

g) using a massively parallel sequencing method to determine, for each target population, the number of the unique targeting molecular tags present in the barcoded targeting MIPs amplicons provided in step f);

h) using a massively parallel sequencing method to determine, for each control population, the number of the unique control molecular tags present in the barcoded control MIPs amplicons provided in step f);

i) computing a target probe capture metric for each target sequence based at least in part on the number of the unique targeting molecular tags determined in step g) and a plurality of control probe capture metrics based at least in part on the numbers of the unique control molecular tags determined in step h);

j) identifying a subset of the control populations of control MIPs that have control probe capture metrics satisfying at least one criterion;

k) normalizing each target probe capture metric by a factor computed from the subset of control probe capture metrics satisfying the at least one criterion, to obtain a test normalized target probe capture metric for each target sequence;

l) comparing each test normalized target probe capture metric to a plurality of reference normalized target probe capture metrics that are computed based on reference genomic DNA samples obtained from reference subjects exhibiting known genotypes using the same target and control sequences, target population, one subset of control populations in steps b)-h); and

m) determining, based on the comparing in step l) and the known genotypes of reference subjects, the copy number variation for each target sequence.

41. A nucleic acid molecule comprising the sequence of:

(SEQ ID NO: 4)

5′-AGG AGT AAG TCT GCC AGC ATT NNN NNN NNN NCT

TCA GCT TCC CGA TTA CGG GTA CGA TCC GAC GGT AGT

GTN NNN NNN NNN AAA TGT CTT GTG AAA CAA AAT

GCT-3′.

42. The nucleic acid molecule of embodiment 41, wherein the nucleic acid is 5′ phosphorylated.

43. A method for producing a genotype cluster, the method comprising:

a) receiving sequencing data obtained from a plurality of nucleic acid samples from a plurality of subsets of a plurality of subjects, each sample in the plurality of samples being obtained from a different subject, and each subset being characterized by subjects exhibiting a same known genotype for a gene of interest, wherein the sequencing data for the nucleic acid sample from each subject in the plurality of subsets is obtained by:

- i) obtaining a nucleic acid sample isolated from the subject;
- ii) capturing one or more target sequences of interest in the nucleic acid sample obtained in step a.i) by using one or more target populations of targeting molecular inversion probes (MIPs) to produce targeting MIPs replicons for each target sequence,
- wherein each of the targeting MIPs in each of the target populations comprises in sequence the following components:
- first targeting polynucleotide arm—first unique targeting molecular tag—polynucleotide linker—second unique targeting molecular tag—second targeting polynucleotide arm;
- wherein the pair of first and second targeting polynucleotide arms in each of the targeting MIPs in each target population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank the target sequence of interest that is targeted by the one or more targeting MIPs;
- wherein the first and second unique targeting molecular tags in each of the targeting MIPs in each target population are distinct in each of the targeting MIPs and in each member of the target population;
- iii) capturing a plurality of control sequences in the nucleic acid sample obtained in step a) by using a plurality of control populations of control MIPs to produce a plurality of control MIPs replicons, each control population of control MIPs being capable of amplifying a distinct control sequence in the nucleic acid sample obtained in step a),
- wherein each of the control MIPs in each control population comprises in sequence the following components:
- first control polynucleotide arm—first unique control molecular tag—polynucleotide linker—second unique control molecular tag—second control polynucleotide arm;
- wherein the pair of first and second control polynucleotide arms in each of the control MIPs in each control population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank each control sequence;
- wherein the first and second unique control molecular tags in each of the control MIPs in each control population are distinct in each of the control MIPs and in each member of the control population, and are different from the unique targeting molecular tags;
- iv) sequencing the targeting and control MIPs amplicons that are amplified from the targeting and control MIPs replicons obtained in steps a.ii) and a.iii);

b) for each respective sample obtained from a subset in the plurality of subsets:

- i) determining, for each target population, the number of the unique targeting molecular tags present in the targeting MIPs amplicons sequenced in step a.iv);
- ii) determining, for each control population, the number of the unique control molecular tags present in the control MIPs amplicons sequenced in step a.iv);
- iii) computing a target probe capture metric, for each target sequence, based at least in part on the number of the unique targeting molecular tags determined in step b.i) and a plurality of control probe capture metrics based at least in part on the numbers of the unique control molecular tags determined in step b.ii);
- iv) identifying a subset of the control populations of control MIPs that have control probe capture metrics satisfying at least one criterion;
- v) normalizing each target probe capture metric by a factor computed from the control probe capture metrics satisfying the at least one criterion, to obtain a normalized target probe capture metric for each of the one or more target sites; and

c) grouping, across the samples obtained from each subset of subjects, the normalized target probe capture metrics to obtain the genotype cluster for the known genotype.

44. The method of embodiment 43, wherein computing the target probe capture metric at step b.iii) comprises normalizing the number of the unique targeting molecular tags determined in step b.i) by a sum of the number of the unique targeting molecular tags and the numbers of the unique control molecular tags.

45. The method of embodiment 43, wherein computing the plurality of control probe capture metrics at step b.iii) comprises normalizing, for each control population, the number of unique control molecular tags determined in step b.ii) by a sum of the number of the unique targeting molecular tags and the numbers of the unique control molecular tags.

46. The method of any of embodiments 43-45, wherein the target probe capture metric for the target population is indicative of the target population's ability to hybridize to the target sequence of interest, relative to the abilities of the plurality of control populations to hybridize to the distinct control sequences.

47. The method of any of embodiments 43-46, wherein each control probe capture metric for a respective control population is indicative of the respective control population's ability to hybridize to one of the control sequences, relative to the abilities of 1) the target population to hybridize to the target sequence and 2) remaining control populations to hybridize to respective control sequences.

48. The method of any of embodiments 43-47, wherein the target sequence of interest is located on the gene of interest, and the control sequences correspond to one or more reference genes that are different from the gene of interest.

49. The method of any of embodiments 43-48, wherein the gene of interest is a survival of motor neuron 1 (SMN1) gene and/or a survival of motor neuron 2 (SMN2) gene.

50. The method of any of embodiments 43-48, wherein the gene of interest is a BRCA1 gene.

51. The method of any of embodiments 43-48, wherein the gene of interest is a DMD gene.

52. The method of any of embodiments 43-51, wherein the at least one criterion includes a requirement that the control probe capture metric is above a first threshold and below a second threshold.

53. The method of embodiment 52, further comprising determining the first threshold and the second threshold based at least in part on the target probe capture metric computed at step b.iii).

54. The method of embodiment 53, wherein the first threshold and the second threshold are determined further based at least in part on the plurality of control probe capture metrics computed at step b.iii).

55. The method of any of embodiments 43-54, further comprising, for each control population, computing a variability coefficient for the control probe capture metrics computed at step b.iii) across the samples obtained from each subset in the plurality of subsets.

56. The method of embodiment 55, wherein the at least one criterion includes a requirement that the variability coefficient is below a threshold.

57. The method of any of embodiments 43-56, wherein the factor computed at step b.v) is an average of the control probe capture metrics satisfying the at least one criterion.

58. The method of any of embodiments 43-57, wherein a first subset is characterized by subjects exhibiting a known copy count of a survival of motor neuron 1 (SMN1) gene, and a second subset is characterized by subjects exhibiting a known copy count of a survival motor neuron 2 (SMN2) gene.

59. The method of any of embodiments 43-58, wherein the known genotype corresponds to a known copy count of a survival of motor neuron 1 (SMN1) gene or of a survival of motor neuron 2 (SMN2) gene.

60. The method of any of embodiments 43-57, wherein a first subset is characterized by subjects exhibiting a known copy count of exon 11 on a BRCA1 gene.

61. The method of any of embodiments 43-57 and 60, wherein the known genotype corresponds to a known copy count of exon 11 on a BRCA1 gene.

62. The method of any of embodiments 43-57, wherein a first subset is characterized by subjects exhibiting a known copy count of a DMD gene.

63. The method of any of embodiments 43-57 and 62, wherein the known genotype corresponds to a known copy count of a DIVED gene.

64. The method of any of embodiments 43-63, wherein the first and second unique targeting molecular tags and the first and second unique control molecular tags are generated randomly for each MIP in the targeting population of targeting MIPS and in the control populations of control MIPs.

65. A system configured to perform the method of any of embodiments 43-64.

66. A computer program product comprising computer-readable instructions that, when executed in a computerized system comprising at least one processor, cause the processor to carry out one or more steps of the method of any of embodiments 43-64.

67. A method of selecting a genotype for a test subject, the method comprising:

a) receiving sequencing data obtained from a nucleic acid sample from the test subject, wherein the sequencing data for the nucleic acid sample is obtained by:

- i) obtaining a nucleic acid sample isolated from the test subject;
- ii) capturing one or more target sequences of interest in the nucleic acid sample obtained in step a) by using one or more target populations of targeting molecular inversion probes (MIPs) to produce a plurality of targeting MIPs replicons for each target sequence,
- wherein each of the targeting MIPs in the target population comprises in sequence the following components:
- first targeting polynucleotide arm—first unique targeting molecular tag—polynucleotide linker—second unique targeting molecular tag—second targeting polynucleotide arm;
- wherein the pair of first and second targeting polynucleotide arms in each of the targeting MIPs in each target population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank the target sequence of interest that is targeted by the one or more targeting MIPs;
- wherein the first and second unique targeting molecular tags in each of the targeting MIPs in each target population are distinct in each of the targeting MIPs and in each member of the target population;
- iii) capturing a plurality of control sequences in the nucleic acid sample obtained in step a) by using a plurality of control populations of control MIPs to produce a plurality of control MIPs replicons, each control population of control MIPs being capable of amplifying a distinct control sequence in the nucleic acid sample obtained in step a),
- wherein each of the control MIPs in each control population comprises in sequence the following components:
- first control polynucleotide arm—first unique control molecular tag—polynucleotide linker—second unique control molecular tag—second control polynucleotide arm;
- wherein the pair of first and second control polynucleotide arms in each of the control MIPs in each control population are identical, and are substantially complementary to first and second regions in the nucleic acid that, respectively, flank each control sequence;
- wherein the first and second unique control molecular tags in each of the control MIPs in each control population are distinct in each of the control MIPs and in each member of the control population, and are different from the unique targeting molecular tags;
- iv) sequencing the targeting and control MIPs amplicons that are amplified from the targeting and control MIPs replicons obtained in steps a.ii) and a.iii);

b) determining, for each target population, the number of the unique targeting molecular tags present in the targeting MIPs amplicons sequenced in step a.iv);

c) determining, for each control population, the number of the unique control molecular tags present in the control MIPs amplicons sequenced in step a.iv);

d) computing a target probe capture metric, for each target site, based at least in part on the number of the unique targeting molecular tags determined in step b) and a plurality of control probe capture metrics based at least in part on the numbers of the unique control molecular tags determined in step c);

e) identifying a subset of the control populations of control MIPs that have control probe capture metrics satisfying at least one criterion;

f) normalizing each of the one or more target probe capture metrics by a factor computed from the control probe capture metrics satisfying the at least one criterion, to obtain a normalized target probe capture metric for each of the one or more target sequences;

g) receiving a group of values corresponding to normalized target probe capture metrics computed from nucleic acid samples from a first plurality of reference subjects exhibiting a same known genotype for a gene of interest;

h) comparing each of the one or more normalized target probe capture metrics obtained in step f) to the group of values received in step g); and

i) determining, based on the comparing in step h), whether the test subject exhibits the same known genotype for the gene of interest in each of the one or more target sequences.

68. The method of embodiment 67, wherein the group of values is a first group of values, the same known genotype is a first copy number of the target sequence of interest, the method further comprising:

j) receiving a second group of values corresponding to normalized target probe capture metrics computed from nucleic acid samples from a second plurality of reference subjects exhibiting a second copy number of the target sequence of interest; and

k) comparing the normalized target probe capture metric obtained in step f) to the second group of values, wherein the determining in step i) comprises selecting between the first copy number and the second copy number for the test subj ect.

69. The method of embodiment 68, wherein:

the comparing in step h) comprises computing a first distance metric between the normalized probe capture metric obtained in step f) and the first group of values;

the comparing in step k) comprises computing a second distance metric between the normalized probe capture metric obtained in step f) and the second group of values; and

the selecting between the first copy number and second copy number comprises selecting the first copy number if the first distance metric is less than the second distance metric, and selecting the second copy number if the first distance metric exceeds the second distance metric.

70. The method of any of embodiments 69, wherein the first group of values and the second group of values are computed by:

repeating steps a-f) for each subject in the first and second pluralities of reference subjects;

grouping the normalized target probe capture metrics for the first plurality of reference subjects to obtain the first group of values; and

grouping the normalized target probe capture metrics for the second plurality of reference subjects to obtain the second group of values.

71. The method of any of embodiments 67-70, wherein the computing the target probe capture metric at step d) comprises normalizing the number of the unique targeting molecular tags determined in step b) by a sum of the number of the unique targeting molecular tags and the numbers of the unique control molecular tags.

72. The method of any of embodiments 67-71, wherein computing the plurality of control probe capture metrics at step d) comprises normalizing, for each control population, the number of the unique control molecular tags determined in step c) by a sum of the unique targeting molecular tags and the numbers of the unique control molecular tags.

73. The method of any of embodiments 67-72, wherein the target probe capture metric for the target population is indicative of the target population's ability to hybridize to the target sequence of interest, relative to the abilities of the plurality of control populations to hybridize to the control sequences.

74. The method of any of embodiments 67-73, wherein the target sequence of interest is on the gene of interest, and the control sequences correspond to one or more reference genes that are different from the gene of interest.

75. The method of any of embodiments 67-74, wherein the gene of interest is a survival of motor neuron 1 (SMN1) gene and/or a survival of motor neuron 2 (SMN2) gene.

76. The method of any of embodiments 67-74, wherein the gene of interest is a BRCA1 gene.

77. The method of any of embodiments 67-74, wherein the gene of interest is a DMD gene.

78. The method of any of embodiments 67-77, wherein the at least one criterion includes a requirement that the control probe capture metric are above a first threshold and below a second threshold.

79. The method of embodiment 78, further comprising determining the first threshold and the second threshold based at least in part on the target probe capture metric computed at step d).

80. The method of embodiment 79, wherein the first threshold and the second threshold are determined further based at least in part on the plurality of control probe capture metrics computed at step d).

81. The method of any of embodiments 67-80, further comprising, for each control population, computing a variability coefficient for the control probe capture metrics computed at step d).

82. The method of embodiment 81, wherein the at least one criterion includes a requirement that the variability coefficient is below a threshold.

83. The method of any of embodiments 67-82, wherein the factor computed at step f) is an average of the control probe capture metrics satisfying the at least one criterion.

84. The method of any of embodiments 67-83, wherein the target sequence of interest is on a survival of motor neuron 1 (SMN1) gene and/or a survival of motor neuron 2 (SMN2) gene.

85. The method of embodiment 84, wherein the same known genotype corresponds to a known copy count of an SMN1 gene or an SMN2 gene.

86. The method of any of embodiments 67-83, wherein the target sequence of interest is on exon 11 of a BRCA1 gene.

87. The method of embodiment 86, wherein the same known genotype corresponds to a known copy count of exon 11 of the BRCA1 gene.

88. The method of any of embodiments 67-83, wherein the target sequence of interest is on a DMD gene.

89. The method of embodiment 88, wherein the same known genotype corresponds to a known copy count of the DMD gene.

90. A system configured to perform the method of any of embodiments 67-89.

91. A computer program product comprising computer-readable instructions that, when executed in a computerized system comprising at least one processor, cause the processor to carry out one or more steps of the method of any of embodiments 67-89.

92. The method of any one of embodiments 1-40, 43-64, and 67-89, wherein the subject or the test subject is a candidate for carrier screening of one or more diseases or conditions.

93. The method of any one of embodiments 1-40, 43-64, and 67-89, wherein the subject or the test subject is a candidate for:

a) a pharmacogenomics test;

b) a targeted tumor test;

c) an exonic deletion test; or

d) an exonic duplication test.

94. The method of any one of embodiments 1-40, 43-64, 67-89, 92, and 93, wherein the method is for detecting a) a single nucleotide polymorphism; or b) an exonic deletion; or c) an exonic duplication.

95. The method of any one of embodiments 1-40, 43-64, 67-89, and 92-94, wherein the one or more target sequences are one or more deleted exons in a gene of interest.

96. The method of any one of embodiments 1-40, 43-64, 67-89, and 92-94, wherein the one or more target sequences are one or more duplicated exons in a gene of interest.

97. The method of embodiment 95 or 96, wherein the gene of interest is a BRCA1 or a BRCA2 gene.

98. The method of embodiment 95 or 96, wherein the gene of interest is a DMD gene.

99. The method of embodiment 97, wherein the targeting MIP comprises the sequence of

(SEQ ID NO: 9)

5′-GTCTGAATCAAATGCCAAAGTNNNNNNNNNNCTTCAGCTTCCCGATT

ACGGGTACGATCCGACGGTAGTGTNNNNNNNNNNTCCCCTGTGTGAGA

GAAAAGA-3′.

100. The method of embodiment 98, wherein the targeting MIPs are selected from Table 3.

101. A nucleic acid molecule comprising the sequences selected from Table 3.

102. A nucleic acid molecule comprising the sequence of

(SEQ ID NO: 9)

5′-GTCTGAATCAAATGCCAAAGTNNNNNNNNNNCTTCAGCTTCCCGATT

ACGGGTACGATCCGACGGTAGTGTNNNNNNNNNNTCCCCTGTGTGAGA

GAAAAGA-3′.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the sequence of a molecular inversion probe (MIP) used in some embodiments of the methods of the disclosure (e.g., a specific target site or sequence in SMN1/SMN2). The MIP comprises in sequence the following components: a first targeting polynucleotide arm, a first unique targeting molecular tag, a polynucleotide linker, a second unique targeting molecular tag, and a second targeting polynucleotide arm. The first and second targeting polynucleotide arms in each of the MIP are substantially complementary to first and second regions in the nucleic acid that, respectively, flank a site or sequence of interest (a target site or sequence or control site or sequence). The unique molecular tags are random polynucleotide sequences. In some embodiments, e.g., when the targeting polynucleotide arms hybridize to the first and second regions in the nucleic acid that, respectively, flank a site of interest, “substantially complementary” refers to 0 mismatches in both arms, or at most 1 mismatch in only one arm. In other embodiments, “substantially complementary” refers to at most a small number of mismatches in both arms, such as 1, 2, 3, 3, 5, or any other suitable number.

FIG. 2 is a representative process flow diagram for determining a copy number variant according to some embodiments of the disclosure.

FIG. 3 is a block diagram of a computing device for performing any of the processes described herein.

FIG. 4 is a representative process flow diagram for determining a copy count number for a test subject, according to an illustrative embodiment.

FIG. 5 is a representative process flow diagram for forming a genotype cluster, according to an illustrative embodiment.

FIG. 6 is a plot of six illustrative genotype clusters that are used for comparison to a test metric evaluated from a test subject, according to an illustrative embodiment.

FIG. 7 is a representative process flow diagram for handling the sample and practicing some embodiments of the disclosure.

FIG. 8 is a diagram of a MIP and DNA captured between two targeting polynucleotide arms of the MIP, according to an illustrative embodiment.

FIG. 9 is a diagram of an example MIP and captured DNA, according to an illustrative embodiment.

FIG. 10 is a boxplot of results of an assay for estimating a copy number of the BRCA1 exon 11, according to an illustrative embodiment.

FIGS. 11-14 are plots of averaged probe capture metrics vs. 79 exons in the DMD gene that exhibit duplication or deletion, according to an illustrative embodiment.

DETAILED DESCRIPTION OF THE INVENTION

This disclosure provides systems and methods for determining, inter alia, copy number variations, chromosomal abnormalities or micro-deletions in a subject in need thereof. In some embodiments, the subject is a candidate for a disease or condition carrier screening. In some embodiments, the subject is a candidate for pharmacogenomics testing. In some embodiments, the subject is a candidate for targeted tumor testing (e.g., targeted tumor sequencing or targeted tumor analysis). In some embodiments, the subject is a candidate for pediatric diagnostic testing, such as for Duchenne's muscular dystrophy.

Embodiments of the disclosure relate to systems and methods that enable accurate and robust copy counting at any particular targeted site or sequence of interest, or targeted gene of interest, or targeted sequence of interest, in a genome using circular capture probes (e.g., molecular inversion probes) and short read sequencing technology. The systems and methods of embodiments of this disclosure allow one to get an accurate representation of how many copies of any targeted site or sequence of interest, or targeted gene of interest, or targeted sequence of interest, exist in the genome. The systems and methods of embodiments of this disclosure are useful for determining the copy count of targeted site or sequence of interest, or targeted gene of interest, or targeted sequence of interest in the context of carrier screening for a variety of diseases (e.g., spinal muscular atrophy) or risk factors.

The systems and methods of embodiments of this disclosure are also useful in other genomic applications where copy count variations or copy number variations are important variables, such as determining exonic deletions, exonic duplications, pharmacogenomics testing, or targeted tumor testing (e.g., sequencing).

The systems and methods of embodiments described herein are useful for examining or determining exonic deletions or duplications in disease-causing genes. For example, the systems and methods of embodiments of this disclosure can be used to determine exonic deletions in BRCA1 and BRCA2, where large exonic deletions account for a significant percentage of all causative variants. The systems and methods of embodiments of this disclosure can also be used to determine or examine exonic deletions or duplications in the DMD gene associated with Duchenne and Beckers Muscular dystrophy.

The systems and methods of embodiments of this disclosure are also applicable to pharmagogenomic testing. For example, The systems and methods of embodiments of this disclosure may be used to determine the copy count of the p450 enzyme CYP2D6, where −5% of the population has a duplication of this gene, causing them to more rapidly metabolize certain drugs such as codeine.

The systems and methods of embodiments of this disclosure are also applicable to targeted tumor testing. For example, The systems and methods of embodiments of this disclosure may be used to determine the duplication of certain genes that are known to be important for tumor progression, such as MYC, MYCN, RET, EGFR etc.

The systems and methods of embodiments of this disclosure offer a simple and cost effective approach for determining copy count in the context of a sequencing assay. Many variants of interest can be jointly and accurately assessed for copy count and sequence variation in a single assay. The systems and methods of embodiments of this disclosure allow for sequencing information to be combined with copy number variation information at a single site or sequence, which results in a simpler and more cost-effective workflow. The systems and methods of embodiments of this disclosure use unique identifiers on each probe (e.g., unique molecular tags) to determine, inter alfa, a maximum likelihood estimate (k), which allows one to estimate probe capture efficiency, thereby increasing accuracy and reducing the need for extraneous sequencing. The systems and methods of embodiments of this disclosure use circular capture probes, which allow for the combination of multiple additional probes in a single, multiplexed assay with minimal interference or cross assay reactions. Combining the information from several probes and their unique reads greatly reduces errors in the system and improves efficiency.

In some embodiments, The systems and methods of embodiments of this disclosure count the number of unique molecular tags and use such counting to estimate a probe capture efficiency and further to determine the copy count of a gene or site or sequence of interest. Counting the number of unique molecular tags provides a more accurate picture of the relative abundance of each sequence in the original nucleic acid sample when compared to counting sequencing reads.

In order that the disclosure herein described may be fully understood, the following detailed description is set forth.

Unless otherwise defined herein, scientific and technical terms used in this application shall have the meanings that are commonly understood by those of ordinary skill in the art to which this disclosure belongs. Generally, nomenclature used in connection with, and techniques of, cell and tissue culture, molecular biology, cell biology, cancer biology, neurobiology, neurochemistry, virology, immunology, microbiology, genetics, protein and nucleic acid chemistry, chemistry, and pharmacology described herein, are those well known and commonly used in the art. Each embodiment described herein may be taken alone or in combination with one or more other embodiments of the disclosure.

The methods and techniques of various embodiments of the present disclosure are generally performed, unless otherwise indicated, according to methods well known in the art and as described in various general and more specific references that are cited and discussed throughout this specification. See, e.g. Motulsky, “Intuitive Biostatistics”, Oxford University Press, Inc. (1995); Lodish et al., “Molecular Cell Biology, 4th ed.”, W. H. Freeman & Co., New York (2000); Griffiths et al., “Introduction to Genetic Analysis, 7th ed.”, W. H. Freeman & Co., N.Y. (1999); Gilbert et al., “Developmental Biology, 6th ed.”, Sinauer Associates, Inc., Sunderland, Mass. (2000).

Chemistry terms used herein are used according to conventional usage in the art, as exemplified by “The McGraw-Hill Dictionary of Chemical Terms”, Parker S., Ed., McGraw-Hill, San Francisco, Calif. (1985).

All of the above, and any other publications, patents and published patent applications referred to in this application are specifically incorporated by reference herein. In case of conflict, the present specification, including its specific definitions, will control.

Throughout this specification, the word “comprise” or variations such as “comprises” or “comprising” will be understood to imply the inclusion of a stated integer (or components) or group of integers (or components), but not the exclusion of any other integer (or components) or group of integers (or components).

The singular forms “a,” “an,” and “the” include the plurals unless the context clearly dictates otherwise.

The term “including” is used to mean “including but not limited to”. “Including” and “including but not limited to” are used interchangeably.

In order to further define the disclosure, the following terms and definitions are provided herein.

Definitions

The term “copy number variation,” “CNV,” “a copy number variant,” or “a gene copy number variant,” as used herein, refers to variation in the number of copies of a nucleic acid sequence present in a test sample (e.g., a nucleic acid sample isolated from, or derived from, or obtained from a carrier screening candidate) in comparison with the copy number of the nucleic acid sequence present in a reference sample (e.g., a nucleic acid sample isolated from, or derived from, or obtained from a reference subject exhibiting known genotypes). In some embodiments, the nucleic acid sequence is 1 kb or larger. In some embodiments, the nucleic acid sequence is a whole chromosome or significant portion thereof. In some embodiments, copy number differences are identified by comparison of a sequence of interest in a test sample with an expected level of the sequence of interest. For example, the level of the sequence of interest in the test sample is compared to that present in a reference sample. In some embodiments, copy number variation refers to a form of structural variation of the DNA of a genome that results in a cell having an abnormal or, for certain genes, a normal variation in the number of copies of one or more sections of the DNA.

In some embodiments, copy number variations (“CNVs”) refer to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes. For example, the chromosome that normally has sections in order as A-B-C-D-E might instead have sections A-B-C-C-D-E (a duplication of “C”) or A-B-D-E (a deletion of “C”). This variation accounts for roughly 12% of human genomic DNA and each variation may range from about 500 base pairs (500 nucleotide bases) to several megabases in size (e.g., between 5,000 to 5 million bases). In some embodiments, copy number variations refer to relative small regions of the genome that have been deleted (e.g., micro-deletions) or duplicated on certain chromosomes. In some embodiments, copy number variations refer to genetic variants due to presence of single-nucleotide polymorphisms (SNPs), which affect only one single nucleotide base. In some embodiments, copy number variants/variations include deletions, including micro-deletions, insertions, including micro-insertions, duplications, multiplications, inversions, translocations and complex multi-site variants. In some embodiments, copy number variants/variations encompass chromosomal aneuploidies and partial aneuploidies.

In some embodiments a copy number variation is a fetal copy number variation. Often, a fetal copy number variation is a copy number variation in the genome of a fetus. In some embodiments a copy number variation is a maternal and/or fetal copy number variation. In certain embodiments a maternal and/or fetal copy number variation is a copy number variation within the genome of a pregnant female (e.g., a female subject bearing a fetus), a female subject that gave birth or a female capable of bearing a fetus.

A copy number variation can be a heterozygous copy number variation where the variation (e.g., a duplication or deletion) is present on one allele of a genome. A copy number variation can be a homozygous copy number variation where the variation is present on both alleles of a genome. In some embodiments a copy number variation is a heterozygous or homozygous fetal copy number variation. In some embodiments a copy number variation is a heterozygous or homozygous maternal and/or fetal copy number variation. A copy number variation sometimes is present in a maternal genome and a fetal genome, a maternal genome and not a fetal genome, or a fetal genome and not a maternal genome.

The term “aneuploidy,” as used herein, refers to a chromosomal abnormality characterized by an abnormal variation in chromosome number, e.g., a number of chromosomes that is not an exact multiple of the haploid number of chromosomes. For example, a euploid individual will have a number of chromosomes equaling 2 n, where n is the number of chromosomes in the haploid individual. In humans, the haploid number is 23. Thus, a diploid individual will have 46 chromosomes. An aneuploid individual may contain an extra copy of a chromosome (trisomy of that chromosome) or lack a copy of the chromosome (monosomy of that chromosome). The abnormal variation is with respect to each individual chromosome. Thus, an individual with both a trisomy and a monosomy is aneuploid despite having 46 chromosomes. Examples of aneuploidy diseases or conditions include, but are not limited to, Down syndrome (trisomy of chromosome 21), Edwards syndrome (trisomy of chromosome 18), Patau syndrome (trisomy of chromosome 13), Turner syndrome (monosomy of the X chromosome in a female), and Klinefelter syndrome (an extra copy of the X chromosome in a male). Other, non-aneuploid chromosomal abnormalities include translocation (wherein a segment of a chromosome has been transferred to another chromosome) and deletion (wherein a piece of a chromosome has been lost), and other types of chromosomal damage.

The terms “subject” and “patient”, as used herein, refer to any animal, such as a dog, a cat, a bird, livestock, and particularly a mammal, and preferably a human. The term “reference subject” and “reference patients” refer to any subject or patient that exhibits known genotypes (e.g., known copy number of a site of interest, or a gene of interest, or a sequence of interest). The term “test subject”, “test patients”, or “candidate”, or “candidate subject”, “targeted subject” or “targeted individual” refers to any subject or patient or individual that exhibit known genotypes (e.g., known copy number of a site of interest, or a gene of interest, or a sequence of interest).

The terms “polynucleotide”, “nucleic acid” and “nucleic acid molecules”, as used herein, are used interchangeably and refer to DNA molecules (e.g., cDNA or genomic DNA), RNA molecules (e.g., mRNA), DNA-RNA hybrids, and analogs of the DNA or RNA generated using nucleotide analogs. The nucleic acid molecule can be a nucleotide, oligonucleotide, double-stranded DNA, single-stranded DNA, multi-stranded DNA, complementary DNA, genomic DNA, non-coding DNA, messenger RNA (mRNAs), microRNA (miRNAs), small nucleolar RNA (snoRNAs), ribosomal RNA (rRNA), transfer RNA (tRNA), small interfering RNA (siRNA), heterogeneous nuclear RNAs (hnRNA), or small hairpin RNA (shRNA).

The term “sample”, as used herein, refers to a sample typically derived from a biological fluid, cell, tissue, organ, or organism, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence that is to be screened for copy number variation (including aneuploidy or micro-deletions). In some embodiments the sample comprises at least one nucleic acid sequence whose copy number is suspected of having undergone variation. Such samples include, but are not limited to sputum/oral fluid, amniotic fluid, blood, a blood fraction, or fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.) urine, peritoneal fluid, pleural fluid, and the like. Although the sample is often taken from a human subject (e.g., a candidate for a disease or condition carrier screening), the assays can be used to detect copy number variations (CNVs) in samples from any mammal, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth. Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc. If such methods of pretreatment are employed with respect to the sample, such pretreatment methods are typically such that the nucleic acid(s) of interest remain in the test sample, preferably at a concentration proportional to that in an untreated test sample (e.g., namely, a sample that is not subjected to any such pretreatment method(s)). Depending on the type of sample used, additional processing and/or purification steps may be performed to obtain nucleic acid fragments of a desired purity or size, using processing methods including but not limited to sonication, nebulization, gel purification, PCR purification systems, nuclease cleavage, size-specific capture or exclusion, targeted capture or a combination of these methods. Optionally, cell-free DNA may be isolated from, or derived from, or obtained from the sample prior to further analysis. In some embodiments, the sample is from the subject whose copy number variation is to be determined by the systems and methods of embodiments of this disclosure, also referred as “a test sample.”

In some embodiments, the sample is from a subject exhibiting known genome type or copy number variation, also referred as a reference sample. A reference sample refers to a sample comprising a mixture of nucleic acids that are present in a known copy number to which the nucleic acids in a test sample are to be compared. In some embodiments, it is a sample that is normal, i.e. not aneuploid, for the sequence of interest. In some embodiments, it is a sample that is abnormal for the sequence of interest. In some embodiments, reference samples are used for identifying one or more normalizing site or sequences of interest, or genes of interest, or chromosomes of interests.

The term “MIP” as used herein, refers to a molecular inversion probe (or a circular capture probe). Molecular inversion probes (or circular capture probes) are nucleic acid molecules that comprise a pair of unique polynucleotide arms, one or more unique molecular tags (or unique molecular identifiers), and a polynucleotide linker (e.g., a universal backbone linker). See, for example, FIG. 1. In some embodiments, a MIP may comprise more than one unique molecular tags, such as, two unique molecular tags, three unique molecular tags, or more. In some embodiments, the unique polynucleotide arms in each MIP are located at the 5′ and 3′ ends of the MIP, while the unique molecular tag(s) and the polynucleotide linker are located internal to the 5′ and 3′ ends of the MIP. For example, the MIPs that are used in some embodiments of this disclosure comprise in sequence the following components: first unique polynucleotide arm—first unique molecular tag—polynucleotide linker—second unique molecular tag—second unique polynucleotide arm. In some embodiments, the MIP is a 5′ phosphorylated single-stranded nucleic acid (e.g., DNA) molecule.

The unique molecular tag may be any tag that is detectable and can be incorporated into or attached to a nucleic acid (e.g., a polynucleotide) and allows detection and/or identification of nucleic acids that comprise the tag. In some embodiments the tag is incorporated into or attached to a nucleic acid during sequencing (e.g., by a polymerase). Non-limiting examples of tags include nucleic acid tags, nucleic acid indexes or barcodes, radiolabels (e.g., isotopes), metallic labels, fluorescent labels, chemiluminescent labels, phosphorescent labels, fluorophore quenchers, dyes, proteins (e.g., enzymes, antibodies or parts thereof, linkers, members of a binding pair), the like or combinations thereof. In some embodiments, particularly sequencing embodiments, the tag (e.g., a molecular tag) is a unique, known and/or identifiable sequence of nucleotides or nucleotide analogues (e.g., nucleotides comprising a nucleic acid analogue, a sugar and one to three phosphate groups). In some embodiments, tags are six or more contiguous nucleotides. A multitude of fluorophore-based tags are available with a variety of different excitation and emission spectra. Any suitable type and/or number of fluorophores can be used as a tag. In some embodiments 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 20 or more, 30 or more, 50 or more, 100 or more, 500 or more, 1000 or more, 10,000 or more, 100,000 or more different tags are utilized in a method described herein (e.g., a nucleic acid detection and/or sequencing method). In some embodiments, one or two types of tags (e.g., different fluorescent labels) are linked to each nucleic acid in a library. In some embodiments, chromosome-specific tags are used to make chromosomal counting faster or more efficient. Detection and/or quantification of a tag can be performed by a suitable method, machine or apparatus, non-limiting examples of which include flow cytometry, quantitative polymerase chain reaction (qPCR), gel electrophoresis, a luminometer, a fluorometer, a spectrophotometer, a suitable gene- chip or microarray analysis, Western blot, mass spectrometry, chromatography, cytofluorimetric analysis, fluorescence microscopy, a suitable fluorescence or digital imaging method, confocal laser scanning microscopy, laser scanning cytometry, affinity chromatography, manual batch mode separation, electric field suspension, a suitable nucleic acid sequencing method and/or nucleic acid sequencing apparatus, the like and combinations thereof.

In the MIPs, the unique polynucleotide arms are designed to hybridize immediately upstream and downstream of a specific target sequence (or site) in a genomic nucleic acid sample. The unique molecular tags are short nucleotide sequences that are randomly generated. In some embodiments, the unique molecular tags do not hybridize to any sequence or site located on a genomic nucleic acid fragment or in a genomic nucleic acid sample. In some embodiments, the polynucleotide linker (or the backbone linker) in the MIPs are universal in all the MIPs used in embodiments of this disclosure.

In some embodiments, the MIPs are introduced to nucleic acid fragments derived from a test subject (or a reference subject) to perform capture of target sequences or sites (or control sequences or sites) located on a nucleic acid sample (e.g., a genomic DNA). In some embodiments, fragmenting aids in capture of target nucleic acid by molecular inversion probes. In some embodiments, for example, when the nucleic acid sample is comprised of cell free nucleic acid, fragmenting may not be necessary to improve capture of target nucleic acid by molecular inversion probes. As described in greater detail herein, after capture of the target sequence (e.g., locus) of interest, the captured target may be subjected to enzymatic gap-filling and ligation steps, such that a copy of the target sequence is incorporated into a circle-like structure. Capture efficiency of the MIP to the target sequence on the nucleic acid fragment can, in some embodiments, be improved by lengthening the hybridization and gap-filing incubation periods. (See, e.g., Turner E H, et al., Nat Methods. 2009 Apr. 6:1-2.).

In some embodiments, the MIPs that are used according to the disclosure to capture a target site or target sequence comprise in sequence the following components:

- first targeting polynucleotide arm—first unique targeting molecular tag—polynucleotide linker—second unique targeting molecular tag—second targeting polynucleotide arm.

In some embodiments, the MIPs that are used in the disclosure to capture a control site or control sequence comprise in sequence the following components:

- first control polynucleotide arm—first unique control molecular tag—polynucleotide linker—second unique control molecular tag—second control polynucleotide arm.

MIP technology may be used to detect or amplify particular nucleic acid sequences in complex mixtures. One of the advantages of using the MIP technology is in its capacity for a high degree of multiplexing, which allows thousands of target sequences to be captured in a single reaction containing thousands of MIPs. Various aspects of MIP technology are described in, for example, Hardenbol et al., “Multiplexed genotyping with sequence-tagged molecular inversion probes,” Nature Biotechnology, 21(6): 673-678 (2003); Hardenbol et al., “Highly multiplexed molecular inversion probe genotyping: Over 10,000 targeted SNPs genotyped in a single tube assay,” Genome Research, 15: 269-275 (2005); Burmester et al., “DMET microarray technology for pharmacogenomics-based personalized medicine,” Methods in Molecular Biology, 632: 99-124 (2010); Sissung et al., “Clinical pharmacology and pharmacogenetics in a genomics era: the DMET platform,” Pharmacogenomics, 11(1): 89-103 (2010); Deeken, “The Affymetrix DMET platform and pharmacogenetics in drug development,” Current Opinion in Molecular Therapeutics, 11(3): 260-268 (2009); Wang et al., “High quality copy number and genotype data from FFPE samples using Molecular Inversion Probe (MIP) microarrays,” BMC Medical Genomics, 2:8 (2009); Wang et al., “Analysis of molecular inversion probe performance for allele copy number determination,” Genome Biology, 8(11): R246 (2007); Ji et al., “Molecular inversion probe analysis of gene copy alternations reveals distinct categories of colorectal carcinoma,” Cancer Research, 66(16): 7910-7919 (2006); and Wang et al., “Allele quantification using molecular inversion probes (MIP),” Nucleic Acids Research, 33(21): e183 (2005), each of which is hereby incorporated by reference in its entirety for all purposes. See also in U.S. Pat. Nos. 6,858,412; 5,817,921; 6,558,928; 7,320,860; 7,351,528; 5,866,337; 6,027,889 and 6,852,487, each of which is hereby incorporated by reference in its entirety for all purposes.

MIP technology has previously been successfully applied to other areas of research, including the novel identification and subclassification of biomarkers in cancers. See, e.g., Brewster et al., “Copy number imbalances between screen- and symptom-detected breast cancers and impact on disease-free survival,” Cancer Prevention Research, 4(10): 1609-1616 (2011); Geiersbach et al., “Unknown partner for USP6 and unusual SS18 rearrangement detected by fluorescence in situ hybridization in a solid aneurysmal bone cyst,” Cancer Genetics, 204(4): 195-202 (2011); Schiffman et al., “Oncogenic BRAF mutation with CDKN2A inactivation is characteristic of a subset of pediatric malignant astrocytomas,” Cancer Research, 70(2): 512-519 (2010); Schiffman et al., “Molecular inversion probes reveal patterns of 9p21 deletion and copy number aberrations in childhood leukemia,” Cancer Genetics and Cytogenetics, 193(1): 9-18 (2009); Press et al., “Ovarian carcinomas with genetic and epigenetic BRCA1 loss have distinct molecular abnormalities,” BMC Cancer, 8:17 (2008); and Deeken et al., “A pharmacogenetic study of docetaxel and thalidomide in patients with castration-resistant prostate cancer using the DMET genotyping platform,” Pharmacogenomics, 10(3): 191-199 (2009), ach of which is hereby incorporated by reference in its entirety for all purposes.

MIP technology has also been applied to the identification of new drug- related biomarkers. See, e.g., Caldwell et al., “CYP4F2 genetic variant alters required warfarin dose,” Blood, 111(8): 4106-4112 (2008); and McDonald et al., “CYP4F2 Is a Vitamin K1 Oxidase: An Explanation for Altered Warfarin Dose in Carriers of the V433M Variant,” Molecular Pharmacology, 75: 1337-1346 (2009), each of which is hereby incorporated by reference in its entirety for all purposes. Other MIP applications include drug development and safety research. See, e.g., Mega et al., “Cytochrome P-450 Polymorphisms and Response to Clopidogrel,” New England Journal of Medicine, 360(4): 354-362 (2009); Dumaual et al., “Comprehensive assessment of metabolic enzyme and transporter genes using the Affymetrix Targeted Genotyping System,” Pharmacogenomics, 8(3): 293-305 (2007); and Daly et al., “Multiplex assay for comprehensive genotyping of genes involved in drug metabolism, excretion, and transport,” Clinical Chemistry, 53(7): 1222-1230 (2007), each of which is hereby incorporated by reference in its entirety for all purposes. Further applications of MIP technology include genotype and phenotype databasing. See, e.g., Man et al., “Genetic Variation in Metabolizing Enzyme and Transporter Genes: Comprehensive Assessment in 3 Major East Asian Subpopulations With Comparison to Caucasians and Africans,” Journal of Clinical Pharmacology, 50(8): 929-940 (2010), which is hereby incorporated by reference in its entirety for all purposes.

The term “capture” or “capturing”, as used herein, refers to the binding or hybridization reaction between a molecular inversion probe and its corresponding targeting site. In some embodiments, upon capturing, a circular replicon or a MIP replicon is produced or formed. In some embodiments, the targeting site is a deletion (e.g., partial or full deletion of one or more exons). In some embodiments, a target MIP is designed to bind to or hybridize with a naturally-occurring (e.g., wild-type) genomic region of interest where a target deletion is expected to be located. The target MIP is designed to not bind to a genomic region exhibiting the deletion. In these embodiments, binding or hybridization between a target MIP and the target site of deletion is expected to not occur. The absence of such binding or hybridization indicates the presence of the target deletion. In these embodiments, the phrase “capturing a target site” or the phrase “capturing a target sequence” refers to detection of a target deletion by detecting the absence of such binding or hybridization.

The term “MIP replicon” or “circular replicon”, as used herein, refers to a circular nucleic acid molecule generated via a capturing reaction (e.g., a binding or hybridization reaction between a MIP and its targeted sequence). In some embodiments, the MIP replicon is a single-stranded circular nucleic acid molecule. In some embodiments, a targeting MIP captures or hybridizes to a target sequence or site. After the capturing reaction or hybridization, a ligation/extension mixture is introduced to extend and ligate the gap region between the two targeting polynucleotide arms to form single-stranded circular nucleotide molecules, i.e., a targeting MIP replicon. In some embodiments, a control MIP captures or hybridizes to a control sequence or site. After the capturing reaction or hybridization, a ligation/extension mixture is introduced to extend and ligate the gap region between the two control polynucleotide arms to form single-stranded circular nucleotide molecules, i.e., a control MIP replicon. MIP replicons may be amplified through a polymerase chain reaction (PCR) to produce a plurality of targeting MIP amplicons, which are double-stranded nucleotide molecules.

The term “amplicon”, as used herein, refers to a nucleic acid generated via amplification reaction (e.g., a PCR reaction). In some embodiments, the amplicon is a single-stranded nucleic acid molecule. In some embodiments, the amplicon is a double-stranded nucleic acid molecule. In some embodiments, a targeting MIP replicon is amplified using conventional techniques to produce a plurality of targeting MIP amplicons, which are double-stranded nucleotide molecules. In some embodiments, a control MIP replicon is amplified using conventional techniques to produce a plurality of control MIP amplicons, which are double-stranded nucleotide molecules.

The term “sequencing”, as used herein, is used in a broad sense and may refer to any technique known in the art that allows the order of at least some consecutive nucleotides in at least part of a nucleic acid to be identified, including without limitation at least part of an extension product or a vector insert. In some embodiments, sequencing allows the distinguishing of sequence differences between different target sequences. Exemplary sequencing techniques include targeted sequencing, single molecule real-time sequencing, electron microscopy-based sequencing, transistor-mediated sequencing, direct sequencing, random shotgun sequencing, Sanger dideoxy termination sequencing, targeted sequencing, exon sequencing, whole-genome sequencing, sequencing by hybridization, pyrosequencing, capillary electrophoresis, gel electrophoresis, duplex sequencing, cycle sequencing, single-base extension sequencing, solid-phase sequencing, high-throughput sequencing, massively parallel signature sequencing, emulsion PCR, co-amplification at lower denaturation temperature-PCR (COLD-PCR), multiplex PCR, sequencing by reversible dye terminator, paired-end sequencing, near-term sequencing, exonuclease sequencing, sequencing by ligation, short-read sequencing, single-molecule sequencing, sequencing-by-synthesis, real-time sequencing, reverse-terminator sequencing, ion semiconductor sequencing, nanoball sequencing, nanopore sequencing, 454 sequencing, Solexa Genome Analyzer sequencing, miSeq (Illumina), HiSeq 2000 (Illumina), HiSeq 2500 (Illumina), Illumina Genome Analyzer (Illumina), Ion Torrent PGMTM (Life Technologies), MinION™ (Oxford Nanopore Technologies), real-time SMIRT™ technology (Pacific Biosciences), the Probe-Anchor Ligation (cPAL™) (Complete Genomics/BGI), SOLiD® sequencing, MS-PET sequencing, mass spectrometry, and a combination thereof. In some embodiments, sequencing comprises detecting the sequencing product using an instrument, for example but not limited to an ABI PRISM® 377 DNA Sequencer, an ABI PRISM® 310, 3100, 3100-Avant, 3730, or 3730xI Genetic Analyzer, an ABI PRISM® 3700 DNA Analyzer, or an Applied Biosystems SOLiD™ System (all from Applied Biosystems), a Genome Sequencer 20 System (Roche Applied Science), or a mass spectrometer. In certain embodiments, sequencing comprises emulsion PCR. In certain embodiments, sequencing comprises a high throughput sequencing technique, for example but not limited to, massively parallel signature sequencing (MPSS).

It will be understood by one of ordinary skill in the art that the compositions and methods described herein may be adapted and modified as is appropriate for the application being addressed and that the compositions and methods described herein may be employed in other suitable applications, and that such other additions and modifications will not depart from the scope hereof

This disclosure will be better understood from the Experimental Details which follow. However, one skilled in the art will readily appreciate that the specific methods and results discussed are merely illustrative of various embodiments of the disclosure as described more fully as follows.

Methods of the Disclosure

In one aspect, the disclosure provides a method of detecting copy number variation (e.g., single-nucleotide polymorphism, or exonic deletion, or exonic duplication) in a subject in need thereof. In some embodiments, the method comprises:

a) obtaining a nucleic acid sample isolated from the subject;

b) capturing or detecting one or more target sequences (e.g., a genomic region comprising the single nucleotide polymorphism, or one or more deleted exons, or one or more duplicated exons) in the nucleic acid sample obtained in step a) by using one or more target populations of targeting molecular inversion probes (MIPs) to produce a plurality of targeting MIPs replicons for each target sequence,