This disclosure relates generally to the field of calling copy number variants, and more particularly to calling overlapping copy number variants.
In the population there exist common CNVs of a gene that overlap in positions. Due to the overlapping positions, a genome-wide CNV caller may make wrong calls when there is a mixture of signals from more than one CNV in a single sample. There is a need for a targeted method that calls the genotype of overlapping CNVs accurately.
Disclosed herein include methods of determining alleles of a gene (or genotyping a gene) of a subject. In some embodiments, a method for determining alleles of a gene of a subject is under control of a processor (e.g., a hardware processor) and comprises: receiving a plurality of sequence reads generated from a sample obtained from a subject. The method can comprise: aligning the plurality of sequence reads to a reference genome sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to a gene in the reference genome sequence. The gene can comprise a plurality of regions. Two copy number variants (CNVs) of a plurality of CNVs (or variants) of the gene can each comprise one or more regions of the plurality of regions. The two CNVs can differ by at least one region of the plurality of regions. The method can include: determining a number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence. The method can include: determining a number of copies (or observed or estimated copies) of each region of the plurality of regions based on the number of the sequence reads aligned to the region. The method can include: determining two alleles of the gene of the subject based on the number of copies of each region of the plurality of regions and all CNVs (or each CNV) of the plurality of CNVs comprising the region. Each of the two alleles of the gene of the subject can comprise one or more regions of the plurality of regions.
In some embodiments, the plurality of regions comprises a plurality of consecutive and/or non-overlapping regions. A number of the plurality of regions can be 2 to 10. One, one or more, or each of the plurality of regions can be 1 kilobase (kb) to 100 kb in length. In some embodiments, a number of the plurality of CNVs is 2 to 10. In some embodiments, one CNV of the plurality of CNVs do not overlap with one or more other CNVs of the plurality of CNVs. Two or more of the plurality of CNVs do not overlap (or are non-overlapping CNVs). The two CNVs of the plurality of CNVs of the gene do not overlap (or are non-overlapping CNVs). No CNVs of the plurality of CNVs overlap (or all CNVs of the plurality of CNVs are non-overlapping CNVs). In some embodiments, one CNV of the plurality of CNVs overlaps with one or more other CNVs of the plurality of CNV. Two CNVs of the plurality of CNVs of the gene overlap (or are overlapping CNVs). The two CNVs of the plurality of CNVs of the gene overlap (or are overlapping CNVs). The two CNVs of the plurality of CNVs of the gene can comprise an identical region of the plurality of regions (or are overlapping CNVs). In some embodiments, each CNV of the plurality of CNVs of the gene comprises one or more regions of the plurality of regions. Each CNV of the plurality of CNVs can differ from every other CNV of the plurality of CNVs by at least one region of the plurality of regions.
In some embodiments, a first region, a second region, and a third region of the plurality of regions are consecutive and non-overlapping. A first CNV of the two CNVs can comprise the first region and the second region, not the third region. A second CNV of the two CNVs can comprise the second region and the third region, not the first region. In some embodiments, a first region, a second region, and a third region of the plurality of regions are consecutive and non-overlapping. A first CNV of the two CNVs can comprise the first region, the second region, and the third region. A second CNV of the two CNVs can comprise the second region, not the first region and the third region. Determining the two alleles of the gene of the subject can comprise: determining two alleles of the gene of the subject based on the number of copies of the first region and the number of copies of the second region, not the number of copies of the third region. The third region can be shorter or substantially shorter than the first region. In some embodiments, a first CNV and a second CNV of the plurality of CNVs comprise no common region.
In some embodiments, the number of the sequence reads aligned to each region of the plurality of regions of the gene comprises a raw number or a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene. In some embodiments, determining the number of copies of each region of the plurality of regions comprises: determining the number of copies of each region of the plurality of regions using the number of the sequence reads aligned to the region based on a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence. The method can further comprise: determining the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence using (1a) a depth of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence, (1b) a length of the region of the gene, (2a) a depth of sequence reads of the plurality of sequence reads aligned to each of a plurality of regions of the reference genome sequence other than a genetic locus comprising the gene, and/or (2b) a length of each of the plurality of regions of the reference genome sequence other than the genetic locus comprising the gene. The method can further comprise: determining the GC corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence from the number or the normalized number of the sequence reads aligned the region of gene in the reference genome sequence using a GC content of the region of the gene in the reference genome sequence.
In some embodiments, the number of copies of each region comprises the number of copies of each region relative to a reference number of copies of the region. The reference number of copies of the region can be 2. In some embodiments, determining the number of copies of each region of the plurality of regions comprises: determining a difference in the number of copies of each region of the plurality of regions, relative to a reference number of copies of the region, based on the number of the sequence reads aligned to the region. Determining the two alleles of the gene of the subject can comprise: determining the two alleles of the gene of the subject using the difference in the number of copies of each region of the plurality of regions, relative to the reference number of copies of the region, and all CNVs of the plurality of CNVs comprising the region.
In some embodiments, a first allele of the two alleles comprises a duplication (e.g., having two copies) of a CNV of the plurality of CNVs. A first allele of the two alleles can comprise a deletion (e.g., having zero copy) of a CNV of the plurality of CNVs. In some embodiments, a first allele of the two alleles can comprise one copy of a CNV of the plurality of CNVs. In some embodiments, a second allele of the two alleles comprises a duplication (e.g., having two copies) of a CNV of the plurality of CNVs. A second allele of the two alleles can comprise a deletion (e.g., having zero copy) of a CNV of the plurality of CNVs. In some embodiments, a second allele of the two alleles comprises one copy of a CNV of the plurality of CNVs.
In some embodiments, determining the two alleles of the gene of the subject comprises: determining (i) a number of copies a first CNV in a first allele of the two alleles of the gene of the subject and (ii) a number of copies of a second CNV in a second allele of the two alleles of the gene of the subject such that (a) the number of copies of a region of the plurality of regions in the first CNV and not the second CNV is the number of copies of the first CNV, (b) the number of copies of a region of the plurality of regions in the first CNV and the second CNV is the sum of the number of copies of the first CNV and the number of copies of the second CNV, and/or (c) the number of copies of a region of the plurality of regions in the second CNV and not the first CNV is the number of copies of the second CNV.
In some embodiments, the plurality of CNVs is predetermined (or the plurality of CNVs is known). The plurality of regions can be predetermined. In some embodiments, the method further comprises: receiving the plurality of CNVs. The method can further comprise: determining the plurality of regions using the plurality of CNVs. Receiving the plurality of CNVs can comprise: determining the plurality of CNVs.
In some embodiments, the method further comprises: creating a file or a report representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles. In some embodiments, the method further comprises: generating a user interface (UI) comprising a UI element representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles.
In some embodiments, the plurality of sequence reads comprises sequence reads that are about 100 base pairs to about 1000 base pairs in length each. The plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads. The plurality of sequence reads is generated by whole genome sequencing (WGS), such as clinical WGS (cWGS). In some embodiments, the sample comprises cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof. The sample can be obtained directly from a subject. The sample can be generated from another sample obtained from a subject. The other sample can be obtained directly from the subject.
Disclosed herein include systems of determining alleles of a gene of a subject. In some embodiments, a system for determining alleles of a gene of a subject comprises: non-transitory memory configured to store executable instructions. The non-transitory memory can be configured to store a plurality of regions of a gene, and a plurality of copy number variants (CNVs) of the gene. Two CNVs of the plurality of CNVs of the gene each can comprise one or more regions of the plurality of regions and differ by at least one region of the plurality of regions. The system can comprise: a hardware processor in communication with the non-transitory memory. The hardware processor can be programmed by the executable instructions to perform: receiving a plurality of sequence reads generated from a sample obtained from a subject. The hardware processor can be programmed by the executable instructions to perform: aligning the plurality of sequence reads to a reference sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to the gene in the reference genome sequence. The hardware processor can be programmed by the executable instructions to perform: determining a number of copies of each region of the plurality of regions based on a number of the sequence reads aligned to the region. The hardware processor can be programmed by the executable instructions to perform: determining two alleles of the gene of the subject, each comprising one or more regions of the plurality of regions, based on the number of copies of each region of the plurality of regions and all CNVs of the plurality of CNVs comprising the region. In some embodiments, the reference sequence comprises a reference genome sequence.
In some embodiments, the plurality of regions comprises a plurality of consecutive and/or non-overlapping regions. A number of the plurality of regions can be 2 to 10. One, one or more, or each of the plurality of regions can be 1 kilobase (kb) to 100 kb in length. In some embodiments, a number of the plurality of CNVs is 2 to 10. In some embodiments, one CNV of the plurality of CNVs do not overlap with one or more other CNVs of the plurality of CNVs. Two or more of the plurality of CNVs do not overlap (or are non-overlapping CNVs). The two CNVs of the plurality of CNVs of the gene do not overlap (or are non-overlapping CNVs). No CNVs of the plurality of CNVs overlap (or all CNVs of the plurality of CNVs are non-overlapping CNVs). In some embodiments, one CNV of the plurality of CNVs overlaps with one or more other CNVs of the plurality of CNV. Two CNVs of the plurality of CNVs of the gene overlap (or are overlapping CNVs). The two CNVs of the plurality of CNVs of the gene overlap (or are overlapping CNVs). The two CNVs of the plurality of CNVs of the gene can comprise an identical region of the plurality of regions (or are overlapping CNVs). In some embodiments, each CNV of the plurality of CNVs of the gene comprises one or more regions of the plurality of regions. Each CNV of the plurality of CNVs can differ from every other CNV of the plurality of CNVs by at least one region of the plurality of regions.
In some embodiments, a first region, a second region, and a third region of the plurality of regions are consecutive and non-overlapping. A first CNV of the two CNVs can comprise the first region and the second region, not the third region. A second CNV of the two CNVs can comprise the second region and the third region, not the first region. In some embodiments, a first region, a second region, and a third region of the plurality of regions are consecutive and non-overlapping. A first CNV of the two CNVs can comprise the first region, the second region, and the third region. A second CNV of the two CNVs can comprise the second region, not the first region and the third region. Determining the two alleles of the gene of the subject can comprise: determining two alleles of the gene of the subject based on the number of copies of the first region and the number of copies of the second region, not the number of copies of the third region. The third region can be shorter or substantially shorter than the first region. In some embodiments, a first CNV and a second CNV of the plurality of CNVs comprise no common region.
In some embodiments, the hardware processor is further programmed by the executable instructions to perform: determining the number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence. In some embodiments, the number of the sequence reads aligned to each region of the plurality of regions of the gene comprises a raw number or a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene. In some embodiments, determining the number of copies of each region of the plurality of regions comprises: determining the number of copies of each region of the plurality of regions using the number of the sequence reads aligned to the region based on a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence. In some embodiments, the hardware processor is further programmed by the executable instructions to perform: determining the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence using (1a) a depth of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence, (1b) a length of the region of the gene, (2a) a depth of sequence reads of the plurality of sequence reads aligned to each of a plurality of regions of the reference sequence other than a genetic locus comprising the gene, and (2b) a length of each of the plurality of regions of the reference sequence other than the genetic locus comprising the gene. The hardware processor can be further programmed by the executable instructions to perform: determining the GC corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence from the number or the normalized number of the sequence reads aligned the region of gene in the reference genome sequence using a GC content of the region of the gene in the reference genome sequence.
In some embodiments, the number of copies of each region comprises the number of copies of each region relative to a reference number of copies of the region. The reference number of copies of the region can be 2. In some embodiments, determining the number of copies of each region of the plurality of regions comprises: determining a difference in the number of copies of each region of the plurality of regions, relative to a reference number of copies of the region, based on the number of the sequence reads aligned to the region. Determining the two alleles of the gene of the subject comprises: determining the two alleles of the gene of the subject using the difference in the number of copies of each region of the plurality of regions, relative to the reference number of copies of the region, and all CNVs of the plurality of CNVs comprising the region.
In some embodiments, a first allele of the two alleles comprises a duplication (e.g., having two copies) of a CNV of the plurality of CNVs. A first allele of the two alleles can comprise a deletion (e.g., having zero copy) of a CNV of the plurality of CNVs. In some embodiments, a first allele of the two alleles can comprise one copy of a CNV of the plurality of CNVs. In some embodiments, a second allele of the two alleles comprises a duplication (e.g., having two copies) of a CNV of the plurality of CNVs. A second allele of the two alleles can comprise a deletion (e.g., having zero copy) of a CNV of the plurality of CNVs. In some embodiments, a second allele of the two alleles comprises one copy of a CNV of the plurality of CNVs.
In some embodiments, determining the two alleles of the gene of the subject comprises: determining (i) a number of copies a first CNV in a first allele of the two alleles of the gene of the subject and (ii) a number of copies of a second CNV in a second allele of the two alleles of the gene of the subject such that (a) the number of copies of a region of the plurality of regions in the first CNV and not the second CNV is the number of copies of the first CNV, (b) the number of copies of a region of the plurality of regions in the first CNV and the second CNV is the sum of the number of copies of the first CNV and the number of copies of the second CNV, and/or (c) the number of copies of a region of the plurality of regions in the second CNV and not the first CNV is the number of copies of the second CNV.
In some embodiments, the plurality of CNVs is predetermined (or the plurality of CNVs is known). The plurality of regions can be predetermined. In some embodiments, the hardware processor is further programmed by the executable instructions to perform: receiving the plurality of CNVs. The hardware processor can be further programmed by the executable instructions to perform: determining the plurality of regions using the plurality of CNVs. Receiving the plurality of CNVs can comprise: determining the plurality of CNVs.
In some embodiments, the hardware processor is further programmed by the executable instructions to perform: creating a file or a report representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles. The hardware processor can be further programmed by the executable instructions to perform: creating a file or a report and/or generating a user interface (UI) comprising a UI element representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles.
In some embodiments, the plurality of sequence reads comprises sequence reads that are about 100 base pairs to about 1000 base pairs in length each. The plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads. The plurality of sequence reads is generated by whole genome sequencing (WGS), such as clinical WGS (cWGS). In some embodiments, the sample comprises cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof. The sample can be obtained directly from a subject. The sample can be generated from another sample obtained from a subject. The other sample can be obtained directly from the subject.
Also disclosed herein include a non-transitory computer-readable medium storing executable instructions, when executed by a system (e.g., a computing system), causes the system to perform any method or one or more steps of a method disclosed herein.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Neither this summary nor the following detailed description purports to define or limit the scope of the inventive subject matter.
Throughout the drawings, reference numbers may be re-used to indicate correspondence between referenced elements. The drawings are provided to illustrate example embodiments described herein and are not intended to limit the scope of the disclosure.
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the Figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein and made part of the disclosure herein.
In the population there exist common CNVs of a gene that overlap in positions. Due to the overlapping positions, genome-wide CNV calling may be inaccurate, for example, when there is a mixture of signals from more than one CNV in a single sample. A targeted method that calls the genotype of overlapping CNVs accurately is described herein. The method can take advantage of a prior knowledge of some or all possible CNVs that could exist in a given region of a gene, such as the CNVs shown in Table 1. The method can comprise receiving a plurality of sequence reads generated from a sample obtained from a subject. The method can comprise aligning the plurality of sequence reads to a reference genome sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to a gene in the reference genome sequence. The gene can comprise a plurality of regions. Two copy number variants (CNVs) of a plurality of CNVs (or variants) of the gene can each comprise one or more regions of the plurality of regions. The two CNVs can differ by at least one region of the plurality of regions. The method can comprise determining a number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference genome sequence. The method can comprise determining a number of copies (or observed or estimated copies) of each region of the plurality of regions based on the number of the sequence reads aligned to the region. The method can comprise determining two alleles of the gene of the subject based on the number of copies of each region of the plurality of regions and all CNVs (or each CNV) of the plurality of CNVs comprising the region. Each of the two alleles of the gene of the subject can comprise one or more regions of the plurality of regions.
Disclosed herein include a system of determining alleles of a gene of a subject. In some embodiments, the system comprises non-transitory memory configured to store executable instructions. The non-transitory memory can be configured to store a plurality of regions of a gene, and a plurality of copy number variants (CNVs) of the gene. Two CNVs of the plurality of CNVs of the gene each can comprise one or more regions of the plurality of regions and differ by at least one region of the plurality of regions. The system can comprise a hardware processor in communication with the non-transitory memory. The hardware processor can be programmed by the executable instructions to perform receiving a plurality of sequence reads generated from a sample obtained from a subject. The hardware processor can be programmed by the executable instructions to perform aligning the plurality of sequence reads to a reference sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to the gene in the reference genome sequence. The hardware processor can be programmed by the executable instructions to perform determining a number of copies of each region of the plurality of regions based on a number of the sequence reads aligned to the region. The hardware processor can be programmed by the executable instructions to perform: determining two alleles of the gene of the subject, each comprising one or more regions of the plurality of regions, based on the number of copies of each region of the plurality of regions and all CNVs of the plurality of CNVs comprising the region. In some embodiments, the reference sequence comprises a reference genome sequence.
The majority of the copy number variants (CNVs) in an individual are common. Rediscovering the same variant in every sample using, for example, genome-wide CNV calling, can be very inefficient. Such genome-wide CNV calling can have low sensitivity and the resulting genotypes determined may be inaccurate. For example, it can be difficult to differentiate between (i) homozygous duplication where both alleles of a subject with two copies of a region of a gene) and (ii) one allele with no duplication and one allele with three copies of the region of the gene). Genome-wide CNV calling can be limited to large CNVs (e.g., 10 kb or longer). Breakpoints determined by genome-wide CNV calling can be highly variable (see
Targeted CNV calling can be performed using Gaussian mixture models of the population depth distribution. Use of Gaussian mixture models has been described in PCT Publication No. WO 2021/045947, entitled METHODS AND SYSTEMS FOR DIAGNOSING FROM WHOLE GENOME SEQUENCING DATA and U.S. Provisional Patent Application No. 63/197,936, entitled METHODS AND SYSTEMS FOR IDENTIFYING RECOMBINANT VARIANTS; the content of each of which is incorporated herein by reference in its entirety. Briefly, Gaussian mixture models can include a mixture of one-dimensional Gaussians with constrained means. The constrained means can be, for example, CN of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and more. Use of such Gaussian mixture models can normalize out systemic biases and provide confidence in both variant and reference calls (CN equals 2). As a result, high sensitivity in small CNV regions (e.g., down to 1 kb; see
Referring to
Referring to
Table 1 shows exemplary copy number variants. The start and end positions of copy number variants can be used to determine the start and end positions of regions. Example 1 in Table 1 shows that two variants of a gene (or a portion thereof) can be at chr5:140842552-140859343 and chr5: 140834702-140848902 (the first CNV and the second CNV, respectively). Thus the gene (or a portion thereof) can have three regions, chr5:140842552-140834701 (140834702-1), 140834702-140848902, and 140848903 (140848902+1)-140859343 (the first region, the second region, and the third region, respectively). The observed CN change of the first region should be CN change of the first variant. The observed CN change of the second region should be the sum of CN change of the first variant and CN change of the second variant. The observed CN change of the third region should be the CN change of the first variant. Example 47 in Table 1 shows that three variants of a gene (or a portion thereof) can be at chr19:42749348-42862748, chr19: 42788173-43042773, and chr19: 42748348-42773348. Thus the gene (or a portion thereof) can have five regions, chr19: 42748348-42749347 (i.e., 42749348−1), 42749348-42773348, 42773349 (i.e., 42773348+1)-42788172 (i.e., 42788173-1), 42788173-42862748, 42862749 (i.e., 42862748+1)-43042773.
Determining Alleles of a Gene with Overlapping CNVs
The method 600 can be efficient compared to other CNV calling methods, such as genome-wide CNV calling methods. Rediscovering the same variant in every sample using, for example, genome-wide CNV calling, can be very inefficient. In contrast, the method 600 can utilize prior knowledge of some or all possible CNVs that could exist in a given region of a gene, such as the CNVs shown in Table 1. Alternatively or additionally, the method 600 can be accurate compared to other CNV calling methods, such as genome-wide CNV calling methods. Due to the overlapping positions of CNVs, genome-wide CNV calling methods may be inaccurate, for example, when there is a mixture of signals from more than one CNV in a single sample. In contrast, the annotations generated or determined by the method 600 can be accurate. For example, method 600 can determine a subject (or the subject's sample) has an allele with V2 deletion and another allele with V4 deletion and V5 duplication illustrated in
After the method 600 begins at block 604, the method 600 proceeds to block 608, where a computing system receive a plurality of sequence reads. The plurality of sequence reads can be generated from a sample. The sample can be obtained from a subject. Sequence reads can be, for example, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000, or more base pairs in length each. For example, sequence reads are about 100 base pairs to about 1000 base pairs in length each. The sequence reads can comprise paired-end sequence reads. The sequence reads can comprise single-end sequence reads. The sequence reads can be generated by whole genome sequencing (WGS). The WGS can be clinical WGS (cWGS). The sequence reads can be generated by targeted sequencing, such as sequencing of 5, 10, 20, 30, 40, 50, 100, 200, or more genes.
The sample can comprise cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof. The sample can be obtained directly from a subject. The sample can be generated from another sample obtained from a subject. The other sample can be obtained directly from the subject or the other sample can be generated from another sample obtained from the subject. The computing system can store the plurality of sequence reads in its memory. The computing system can load the plurality of sequence reads into its memory. Sequence reads can be generated by techniques such as sequencing by synthesis, sequencing by binding, or sequencing by ligation. Sequence reads can be generated using instruments such as MINISEQ, MISEQ, NEXTSEQ, HISEQ, and NOVASEQ sequencing instruments from Illumina, Inc. (San Diego, CA).
The method 600 proceeds from block 608 to block 612, where the computing system aligns the plurality of sequence reads to a reference sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to a gene in the reference sequence. The gene can comprise a plurality of regions. The reference sequence can be, for example, a reference genome sequence, such as hg19 or hg38. Two copy number variants (CNVs) of a plurality of CNVs (or variants) of the gene can each comprise one or more regions of the plurality of regions. The two CNVs can differ by at least one region of the plurality of regions. Each CNV of the plurality of CNVs of the gene can comprise one or more (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more) regions of the plurality of regions. One, one or more, or each CNV of the plurality of CNVs can differ from every other CNV of the plurality of CNVs by at least one region (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more) of the plurality of regions.
The plurality of regions can comprise consecutive and/or non-overlapping regions. The number of the plurality of regions can be, or be about, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 30. For example, the plurality of regions can comprise 2 to 10 regions.
The number of the plurality of CNVs can be different in different embodiments, such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20. For example, the number of the plurality of CNVs can be 2 to 10. In some embodiments, one CNV of the plurality of CNVs do not overlap with one or more (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) other CNVs of the plurality of CNVs. Two or more (such as 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) of the plurality of CNVs do not overlap or do not comprise an identical region. CNVs of a gene that do not overlap or do not comprise an identical region are non-overlapping CNVs. No CNVs of the plurality of CNVs may overlap. All CNVs of the plurality of CNVs can be non-overlapping CNVs. In some embodiments, one CNV of the plurality of CNVs overlaps with one or more (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more) other CNVs of the plurality of CNV. Two CNVs of the plurality of CNVs of the gene can overlap or can comprise an identical region of the plurality of regions. Two CNVs that overlap or comprise an identical region of the plurality of regions are overlapping CNVs.
As an example, a first region, a second region, and a third region of the plurality of regions can be consecutive and non-overlapping. A first CNV of the two CNVs can comprise the first region and the second region, not the third region (see
The plurality of CNVs can be predetermined, or the plurality of CNVs can be known (see
The computing system can align sequence reads to the reference sequence using an aligner or an alignment method such as Burrows-Wheeler Aligner (BWA), ISAAC, BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, CASHX, Cloudburst, CUDA-EC, CUSHAW, CUSHAW2, CUSHAW2-GPU, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP and GSNAP, Geneious Assembler, LAST, MAQ, mrFAST and mrsFAST, MOM, MOSAIK, MPscan, Novoaligh & NovoalignCS, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PRIMER, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RT Investigator, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP, SOAP2, SOAP3 and SOAP3-dp, SOCS, SSAHA and SSAHA2, Stampy, SToRM, Subread and Subjunc, Taipan, UGENE, VelociMapper, XpressAlign, and ZOOM.
The method 600 proceeds from block 612 to block 616, where the computing system determines a number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence. The number of the sequence reads aligned to each region of the plurality of regions of the gene can comprise a raw number or a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene. Determining the number of copies of each region of the plurality of regions can comprise determining the number of copies of each region of the plurality of regions using the number of the sequence reads aligned to the region based on a normalized and/or GC-corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence.
The computing system can further determine the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence. The computing system can further determine the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence using (1a) a depth of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence. The computing system can further determine the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence using (1b) a length of the region of the gene. The computing system can further determine the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence using (2a) a depth of sequence reads of the plurality of sequence reads aligned to each of a plurality of regions of the reference sequence other than a genetic locus comprising the gene. The computing system can further determine the normalized number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence using (2b) a length of each of the plurality of regions of the reference sequence other than the genetic locus comprising the gene. The computing system can further determine the GC corrected number of the sequence reads aligned to each region of the plurality of regions of the gene in the reference sequence from the number or the normalized number of the sequence reads aligned the region of gene in the reference sequence using a GC content of the region of the gene in the reference sequence.
The method 600 proceeds from block 616 to block 620, where the computing system determines a number of copies (or observed, estimated or determined copies) of each region of the plurality of regions based on the number of the sequence reads aligned to the region. The number of copies of each region comprises the number of copies of each region relative to a reference number of copies of the region. Such a number of copies of a region can be a change in the number of copies of the region, relative to the reference. The reference can be 2 (or 3, 4, 5, 6, 7, 8, 9, 10, or more). For example, the number of copies of the region r1 illustrated in
The method 600 proceeds from block 620 to block 624, where the computing system determines two alleles of the gene of the subject (e.g., an allele has V2 deletion and another allele with V4 deletion and V5 duplication). The computing system can determine two alleles of the gene of the subject based on the number of copies (or the change in the number of copies) of each region of the plurality of regions and all CNVs (or each CNV) of the plurality of CNVs comprising the region. For example, the number of copies (or the change in the number of copies) of region r1 in
As an example, a first CNV of two CNVs can comprise the first region and the second region, not the third region (e.g., V1 in
To determine the two alleles of the gene of the subject, the computing system can determine (i) the number of copies a first CNV in a first allele of the two alleles of the gene of the subject and (ii) a number of copies of a second CNV in a second allele of the two alleles of the gene of the subject such that (a) the number of copies of a region of the plurality of regions in the first CNV and not the second CNV is the number of copies of the first CNV (e.g., region r1 in
The computing system can determine the two alleles of the gene of the subject using the difference in the number of copies of each region of the plurality of regions, relative to the reference number of copies of the region, and one, one or more, or each CNV of the plurality of CNVs comprising the region. For example, as illustrated in
The two alleles of the subject can be identical. The two alleles of the subject can be different. A first allele of the two alleles of the subject can comprise a duplication (e.g., having two copies) of a CNV of the plurality of CNVs. A first allele of the two alleles can comprise a deletion (e.g., having zero copy) of a CNV of the plurality of CNVs. For example, one allele described with reference to
The computing system can create a file or a report representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles. The computing system can generate a user interface (UI) comprising a UI element representing or comprising alleles of the gene of the subject and/or the one or more regions of the plurality of regions in each of the two alleles. A UI element can be a window (e.g., a container window, browser window, text terminal, child window, or message window), a menu (e.g., a menu bar, context menu, or menu extra), an icon, or a tab. A UI element can be for input control (e.g., a checkbox, radio button, dropdown list, list box, button, toggle, text field, or date field). A UI element can be navigational (e.g., a breadcrumb, slider, search field, pagination, slider, tag, icon). A UI element can informational (e.g., a tooltip, icon, progress bar, notification, message box, or modal window). A UI element can be a container (e.g., an accordion).
The method 600 ends at block 628.
The memory 770 may contain computer program instructions (grouped as modules or components in some embodiments) that the processing unit 710 executes in order to implement one or more embodiments. The memory770 generally includes RAM, ROM and/or other persistent, auxiliary or non-transitory computer-readable media. The memory770 may store an operating system772 that provides computer program instructions for use by the processing unit 710 in the general administration and operation of the computing device700. The memory770 may further include computer program instructions and other information for implementing aspects of the present disclosure.
For example, in one embodiment, the memory770 includes an allele determination module 774 for determining (or calling) alleles of a subject, such as the method 600 described with reference to
In at least some of the previously described embodiments, one or more elements used in an embodiment can interchangeably be used in another embodiment unless such a replacement is not technically feasible. It will be appreciated by those skilled in the art that various other omissions, additions and modifications may be made to the methods and structures described above without departing from the scope of the claimed subject matter. All such modifications and changes are intended to fall within the scope of the subject matter, as defined by the appended claims.
One skilled in the art will appreciate that, for this and other processes and methods disclosed herein, the functions performed in the processes and methods can be implemented in differing order. Furthermore, the outlined steps and operations are only provided as examples, and some of the steps and operations can be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed embodiments.
With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Accordingly, phrases such as “a device configured to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations. For example, “a processor configured to carry out recitations A, B and C can include a first processor configured to carry out recitation A and working in conjunction with a second processor configured to carry out recitations B and C. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to embodiments containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” and/or “an” should be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, means at least two recitations, or two or more recitations). Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
In addition, where features or aspects of the disclosure are described in terms of Markush groups, those skilled in the art will recognize that the disclosure is also thereby described in terms of any individual member or subgroup of members of the Markush group.
As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible sub-ranges and combinations of sub-ranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc. As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into sub-ranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 articles refers to groups having 1, 2, or 3 articles. Similarly, a group having 1-5 articles refers to groups having 1, 2, 3, 4, or 5 articles, and so forth.
It will be appreciated that various embodiments of the present disclosure have been described herein for purposes of illustration, and that various modifications may be made without departing from the scope and spirit of the present disclosure. Accordingly, the various embodiments disclosed herein are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
It is to be understood that not necessarily all objects or advantages may be achieved in accordance with any particular embodiment described herein. Thus, for example, those skilled in the art will recognize that certain embodiments may be configured to operate in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other objects or advantages as may be taught or suggested herein.
All of the processes described herein may be embodied in, and fully automated via, software code modules executed by a computing system that includes one or more computers or processors. The code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Some or all the methods may be embodied in specialized computer hardware.
Many other variations than those described herein will be apparent from this disclosure. For example, depending on the embodiment, certain acts, events, or functions of any of the algorithms described herein can be performed in a different sequence, can be added, merged, or left out altogether (for example, not all described acts or events are necessary for the practice of the algorithms). Moreover, in certain embodiments, acts or events can be performed concurrently, for example through multi-threaded processing, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than sequentially. In addition, different tasks or processes can be performed by different machines and/or computing systems that can function together.
The various illustrative logical blocks and modules described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions. A processor can also be implemented as a combination of computing devices, for example a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
Any process descriptions, elements or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or elements in the process. Alternate implementations are included within the scope of the embodiments described herein In which elements or functions may be deleted, executed out of order from that shown, or discussed, including substantially concurrently or in reverse order, depending on the functionality involved as would be understood by those skilled in the art.
It should be emphasized that many variations and modifications may be made to the above-described embodiments, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/332,107, filed Apr. 18, 2022. The content of this related application is incorporated herein by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63332107 | Apr 2022 | US |