The following disclosure relates generally to determining genotypes and, more specifically, to determining genotypes associated with a gene having a corresponding highly homologous homolog.
Many diseases result from genes rendered inactive by mutation. Identification of such mutations is, therefore, a fundamental goal of clinical genetic medicine. For many genes, these mutations are relatively easy to find from Next Generation Sequencing (NGS) data. However, for a subset of genes that are the subject of several important and prevalent disorders, it is challenging to identify and count the number of inactivated genes, since these genes are effectively occluded by other homologous parts of the genome.
Resolving the structure and content of genomic regions that are highly homologous to other (typically dysfunctional) regions is exceptionally difficult, even with advanced NGS tools. Unfortunately, these technical obstacles are especially problematic, as many of these difficult regions have disease implications. Indeed, their very homology to dysfunctional regions leads to frequent rearrangements between genes and homologs, which can affect the number of functional copies of the gene.
Thus, there remains a need for detecting and determining the genotype and/or carrier status of a subject with respect to a gene, wherein the gene has a homologous homolog.
Current technologies that allow determination of genotypes for highly homologous genes and the corresponding homologs are time- and labor-intensive, as well as expensive, making them unsuitable for widespread clinical use.
The presently disclosed methods may be practiced in an affordable and high-throughput manner. Thus, there are significant time, labor and expense savings. In addition, the present method overcomes the problem of resolving structure/copy-number/genotype in regions where the unique alignment of NGS reads to genes or their homologs is compromised. Importantly, these compromising “highly homologous” regions are based on two features: (1) the length of the NGS reads in the given experiment and (2) the amount of mismatches allowed by the alignment software, e.g., BWA.
In an aspect there is provided herein a method for determining the genomic structure (i.e., genotype) of an individual with respect to a gene of interest, wherein the gene of interest has a highly homologous homolog.
In an embodiment the sequence information for the gene of interest and its homolog use primers that are directed to an exon. In certain embodiments, the sequence information is from an intron of a gene of interest and/or homolog. In certain embodiments, the sequence information is from intergenic regions.
In a further embodiment, the sequence information is generated by Next Generation Sequencing (NGS). In some embodiments the NGS is high-depth whole-genome shotgun sequencing (i.e., without the use of probes for enrichment). In other embodiments, the NGS is targeted sequencing such as, for example, hybrid-capture technology, multiplex amplicon enrichment, or any other means of enriching specific regions of the genome for the sequencing reaction. In some embodiments, the sequencing is done in a multiplex assay.
In one embodiment, the gene is SMN1 and the pseudogene is SMN2. In an embodiment, the presence of an altered copy number of SMN1 indicates that the subject may be a carrier for the disease spinal muscular atrophy (SMA).
In another embodiment, the gene is CYP21A2 and the pseudogene is CYP21A1P. In an embodiment, the presence of an altered copy number of CYP21A2 indicates that the subject may be a carrier for the disease congenital adrenal hyperplasia (CAH).
In an embodiment, the gene is HBA1 and the homolog is HBA2 (or vice versa). In an embodiment, the presence of an altered copy number of either HBA1 or HBA2 indicates that the subject may be a carrier for the disease alpha-thalassemia.
In a further embodiment, the gene is GBA and the pseudogene is GBAP. In an embodiment, the presence of an altered copy number of GBA indicates that the subject may be a carrier for the disease Gaucher's Disease.
In an embodiment, the gene is PMS2 and the pseudogene is either PMS2CL or one of several other pseudogenes. As of December 2015 there were 15 pseudogenes. The pseudogenes may be selected from, but not limited to, the 13 pseudogenes known as PMS2CL with the other 12 of 13 pseudogenes numbered PMS2P1 through PMS2P12. In an embodiment, the presence of an altered copy number and/or inversions that alter orientation of the gene and pseudogene (e.g., those that fuse portions of pseudogene with the gene and thus compromise gene function) may indicate that the subject has increased risk for the disease Lynch Syndrome.
In an embodiment, the gene is CHEK2, which has several pseudogenes. As of December 2014, here were seven pseudogenes. The pseudogenes may be selected from, but not limited to, CHEK2 pseudogenes enumerated in a curated database. In an embodiment, the presence of mutations that arise from recombination with its pseudogenes—e.g., a pseudogene-derived frameshift mutation—may indicate that the subject has increased risk for the disease breast cancer, among other diseases. It is well known in the art that only one of the seven pseudogenes has been named and that risk is primarily associated with one mutation, 1100delC. However, other mutations also contribute to risk of disease. Patients are at risk for Li Fraumeni syndrome and other heritable cancers.
In an aspect, there is provided a computer system configured to execute instructions for carrying out the methods described herein.
Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the scope and spirit of the invention will become apparent to one skilled in the art from this detailed description.
The file of this patent contains at least one drawing in color. Copies of this patent or patent publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The invention will now be described in detail by way of reference only using the following definitions and examples. All patents and publications, including all sequences disclosed within such patents and publications, referred to herein are expressly incorporated by reference.
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Singleton, et al., D
Numeric ranges are inclusive of the numbers defining the range. The term “about” is used herein to mean plus or minus ten percent (10%) of a value. For example, “about 100” refers to any number between 90 and 110.
Unless otherwise indicated, nucleic acids are written left to right in 5′ to 3′ orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.
The headings provided herein are not limitations of the various aspects or embodiments of the invention which can be had by reference to the specification as a whole. Accordingly, the terms defined immediately below are more fully defined by reference to the specification as a whole.
As used herein, “purified” means that a molecule is present in a sample at a concentration of at least 95% by weight, or at least 98% by weight of the sample in which it is contained.
An “isolated” molecule is a nucleic acid molecule that is separated from at least one other molecule with which it is ordinarily associated, for example, in its natural environment. An isolated nucleic acid molecule includes a nucleic acid molecule contained in cells that ordinarily express the nucleic acid molecule, but the nucleic acid molecule is present extrachromasomally or at a chromosomal location that is different from its natural chromosomal location.
The term “% homology” is used interchangeably herein with the term “% identity” herein and refers to the level of nucleic acid or amino acid sequence identity between the nucleic acid sequence that encodes any one of the inventive polypeptides or the inventive polypeptide's amino acid sequence, when aligned using a sequence alignment program. In the case of a nucleic acid the term also applies to the intronic and/or intergenic regions.
For example, as used herein, 80% homology means the same thing as 80% sequence identity determined by a defined algorithm, and accordingly a homolog of a given sequence has greater than 80% sequence identity over a length of the given sequence. Exemplary levels of sequence identity include, but are not limited to, 80, 85, 90, 95, 98% or more sequence identity to a given sequence, e.g., the coding sequence for any one of the inventive polypeptides, as described herein.
Exemplary computer programs which can be used to determine identity between two sequences include, but are not limited to, the suite of BLAST programs, e.g., BLASTN, BLASTX, and TBLASTX, BLASTP and TBLASTN, and BLAT publicly available on the Internet. See also, Altschul, et al., 1990 and Altschul, et al., 1997.
Sequence searches are typically carried out using the BLASTN program when evaluating a given nucleic acid sequence relative to nucleic acid sequences in the GenBank DNA Sequences and other public databases. The BLASTX program is preferred for searching nucleic acid sequences that have been translated in all reading frames against amino acid sequences in the GenBank Protein Sequences and other public databases. Both BLASTN and BLASTX are run using default parameters of an open gap penalty of 11.0, and an extended gap penalty of 1.0, and utilize the BLOSUM-62 matrix. (See, e.g., Altschul, S. F., et al., Nucleic Acids Res. 25:3389-3402, 1997.)
A preferred alignment of selected sequences in order to determine “% identity” between two or more sequences, is performed using for example, the CLUSTAL-W program in MacVector version 13.0.7, operated with default parameters, including an open gap penalty of 10.0, an extended gap penalty of 0.1, and a BLOSUM 30 similarity matrix.
As used herein, “highly homologous” means that the homology between a gene and its corresponding homolog is greater than 90% over a region whose length corresponds to the NGS read length. Thus, a gene and its homolog are referred to as “highly homologous” if any region in the gene is highly homologous to the homolog. An NGS read length may range from 30 nt to 400 nt, from 50 nt to 250 nt, from 50 nt to 150 nt, or from 100 nt to 200 nt. Importantly, the entire gene's sequence need not be “highly homologous” to say a gene has a homolog; only a region in the gene needs to be highly homologous.
The term “homolog” as used herein refers to a DNA sequence that is identical or nearly identical to a gene of interest located elsewhere in the subject's genome. The homolog can be either another gene, a “pseudogene,” or a segment of sequence that is not part of a gene.
The term “mutation” as used herein refers to both spontaneous and inherited sequence variations, including, but not limited to, variations between individuals, or between an individual's sequence and a reference sequence. Exemplary mutations include, but are not limited to, SNPs, indel, copy number variants, inversions, translocations, chromosomal fusions, etc.
A “pseudogene” as used herein is a DNA sequence that closely resembles a gene in DNA sequence but harbors at least one change that renders it dysfunctional. The change may be a single residue mutation. The change may result in a splice variant. The change may result in early termination of translation. A pseudogene is a dysfunctional relative of a functional gene. Pseudogenes are characterized by a combination of homology to a known gene (i.e., a gene of interest) and nonfunctionality.
The number of pseudogenes for genes is not limited to those enumerated herein. Pseudogenes are increasingly recognized. Therefore, a person skilled in the art would be able to determine if a sequence is a pseudogene on the basis of sequence homology or by reference to a curated database such as, for example, GeneCards (genecards.org), pseudogenes.org, etc.
As used herein, a “gene of interest” is a gene for which determining the number of functional copies is desired. Generally, a gene of interest has two functional copies due to the two chromosomes each having a copy of the gene of interest. The terms “gene of interest” and “gene” may be used interchangeably herein.
Sequences from the region of interest are enriched, where possible, with hybrid-capture probes or PCR primers, which should be designed such that the captured and sequenced fragments contain at least one sequence that distinguishes a gene from its homolog(s). For example, hybrid-capture probes may be designed to anneal adjacent to the few bases that differ between the gene and the homolog(s)/pseudogene(s) (“diff bases”). Where such distinguishing sequence is scarce, multiple probes should be used to capture distinguishable fragments to diminish the effect of biases inherent to each particular probe's sequence. Amplicon sequencing can be used as an alternative to hybrid-capture as a means to achieve targeted sequencing. High-depth whole-genome sequencing can be used as an alternative to targeted sequencing. Any high-throughput quantitative data that reflects the dose of a particular genomic region may be used, be it from NGS, microarrays, or any other high-throughput quantitative molecular biology technique.
The abundance of NGS sequence reads bearing gene- or homolog-derived bases permit distinction between normal (CN=2) and mutant individuals (CN≠2). Additional useful information is attainable, however, even from sequence reads that cannot distinguish gene from homolog, as in the case of HBA1 and HBA2, where the normal combined CN of the two identical genes is 4, and a deletion in either gene leads to collective CN 3. Note that, in principle, the CN analysis described herein could be applied even to high-depth whole-genome shotgun sequencing (i.e., without the use of probes for enrichment).
Broadly speaking, and in one example, to generate a call for a region, the following process is performed, which is illustrated as process 10 in
Partition reads to gene or homolog(s) based on the presence of the base(s) that distinguish them. The distinguishing base(s) exploited in this partitioning process depend on the particular gene of interest. Further, the partitioning may only use a subset of the distinguishing bases in a given read, again based on the specific application. In an embodiment where the hybrid-capture probe sequence itself becomes part of the sequenced fragment, the hybrid-capture probe is designed such that the distinguishing base is at or near the terminus of one the ends of a paired-end read. For example in such a case, the hybrid-capture probe is, e.g., 39 bases long, but the sequencer reads 40 bases from the captured fragment. The probe is designed such that the 40th base is a distinguishing base, thereby allowing the entire read (i.e., both ends of the paired-end read) to be partitioned to gene or homolog(s) based on the 40th position's base. The precise numbers (i.e., 39 and 40) in the example above could change and yield similar results. In principle, the probe could be as short as 10 bp or as long as 1000 bp, though lengths in the range of 20 bp-100 bp are most common. In embodiments like the one above where the probe becomes part of the sequenced fragment, the sequencer must read beyond the length of the probe by at least 1 bp; however, in embodiments where the captured fragment alone contains enough distinguishing bases to partition the read appropriately to gene or homolog, then sequencing need not necessarily extend beyond the length of the probe.
An exemplary treatment of experimental data is shown in
As shown in
The next step is depicted in
x
i,j
=r
i,j/median(ri,CS1:ri,CSX)
where ri,j is the number of raw reads in sample i at site j. The median is evaluated over all sites j that are in the set of CS sites. xi,j is the “sample-normalized depth value” for sample i at site j; xi,j is calculated for all sites j in both CS and TS.
As provided for in
The normalization starts with calculating the median down each column. This is done for both TS and CS columns as shown in
CNi,j=2*xi,j/median(xS1,j:xSX,j)
where xi,j is the “sample-normalized depth value” from above. The median is calculated over all samples for site j. CNi,j is the decimal approximation of the copy number of site j in sample i. Since the copy number of a sequence in the genome is an integer value, each CNi,j can be rounded to its nearest integer value, and confidence in the call can be calculated as described herein.
Note that the final normalization step indicated in the equation immediately above may be modified for TS's where CN is highly variable (i.e., where a small majority or even a minority of samples have CN=2). For instance, in the right plot of
The final step is interpretation of the data. For each disease—Congenital Adrenal Hypertrophy (CAH), Spinal Muscular Atrophy (SMA), Gaucher's, and alpha-thalassemia—we're looking for contiguous TS's in which the CN signal deviates from 2. Note that “Sample 1” in
It is worth noting that the CN analysis described herein is a critical upstream step for finding other types of clinically relevant mutations in a gene with a homolog. For instance, in addition to CN variants (shown in
Since we typically have multiple TS's for a given test, we can assess confidence in our CN determination using a z-score. Here are the steps that may be used:
One of skill in the art will appreciate that other statistical approaches that are insensitive to outliers and yields an approximation of the standard deviation of the data may be used. Identification of spans of similar copy number (e.g., a series of adjacent sites with CN=1, consistent with a large deletion) can be identified in a supervised manner (e.g., by eye or by matching to known or hypothesized recombination sites) or unsupervised (e.g., using a Hidden Markov Model).
An exemplary environment and system in which certain aspects and examples of the systems and processes described herein may operate. As shown in
User devices 102 can communicate with server system 110 through one or more networks 108, which can include the Internet, an intranet, or any other wired or wireless public or private network. The client-side portion of the exemplary system on user device 102 can provide client-side functionalities, such as user-facing input and output processing and communications with server system 110. Server system 110 can provide server-side functionalities for any number of clients residing on a respective user device 102. Further, server system 110 can include one or caller servers 114 that can include a client-facing I/O interface 122, one or more processing modules 118, data and model storage 120, and an I/O interface to external services 116. The client-facing I/O interface 122 can facilitate the client-facing input and output processing for caller servers 114. The one or more processing modules 118 can include various issue and candidate scoring models as described herein. In some examples, caller server 114 can communicate with external services 124, such as text databases, subscriptions services, government record services, and the like, through network(s) 108 for task completion or information acquisition. The I/O interface to external services 116 can facilitate such communications.
Server system 110 can be implemented on one or more standalone data processing devices or a distributed network of computers. In some examples, server system 110 can employ various virtual devices and/or services of third-party service providers (e.g., third-party cloud service providers) to provide the underlying computing resources and/or infrastructure resources of server system 110.
Although the functionality of the caller server 114 is shown in
It should be noted that server system 110 and clients 102 may further include any one of various types of computer devices, having, e.g., a processing unit, a memory (which may include logic or software for carrying out some or all of the functions described herein), and a communication interface, as well as other conventional computer components (e.g., input device, such as a keyboard/touch screen, and output device, such as display). Further, one or both of server system 110 and clients 102 generally includes logic (e.g., http web server logic) or is programmed to format data, accessed from local or remote databases or other sources of data and content. To this end, server system 110 may utilize various web data interface techniques such as Common Gateway Interface (CGI) protocol and associated applications (or “scripts”), Java® “servlets,” Java® applications running on server system 110, or the like to present information and receive input from clients 102. Server system 110, although described herein in the singular, may actually comprise plural computers, devices, databases, associated backend devices, and the like, communicating (wired and/or wireless) and cooperating to perform some or all of the functions described herein. Server system 110 may further include or communicate with account servers (e.g., email servers), mobile servers, media servers, and the like.
It should further be noted that although the exemplary methods and systems described herein describe use of a separate server and database systems for performing various functions, other embodiments could be implemented by storing the software or programming that operates to cause the described functions on a single device or any combination of multiple devices as a matter of design choice so long as the functionality described is performed. Similarly, the database system described can be implemented as a single database, a distributed database, a collection of distributed databases, a database with redundant online or offline backups or other redundancies, or the like, and can include a distributed database or storage network and associated processing intelligence. Although not depicted in the figures, server system 110 (and other servers and services described herein) generally include such art recognized components as are ordinarily found in server systems, including but not limited to processors, RAM, ROM, clocks, hardware drivers, associated storage, and the like (see, e.g.,
At least some values based on the results of the above-described processes can be saved for subsequent use. Additionally, a non-transitory computer-readable medium can be used to store (e.g., tangibly embody) one or more computer programs for performing any one of the above-described processes by means of a computer. The computer program may be written, for example, in a general-purpose programming language (e.g., Pascal, C, C++, Python, Java) or some specialized application-specific language.
Various exemplary embodiments are described herein. Reference is made to these examples in a non-limiting sense. They are provided to illustrate more broadly applicable aspects of the disclosed technology. Various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the various embodiments. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the various embodiments. Further, as will be appreciated by those with skill in the art, each of the individual variations described and illustrated herein has discrete components and features that may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the various embodiments. All such modifications are intended to be within the scope of claims associated with this disclosure.
The present invention is described in further detain in the following examples which are not in any way intended to limit the scope of the invention as claimed. The attached Figures are meant to be considered as integral parts of the specification and description of the invention. All references cited are herein specifically incorporated by reference for all that is described therein. The following examples are offered to illustrate, but not to limit the claimed invention.
This example illustrates the method for determining gene/homolog copy number and is schematized in
The method comprises the following steps.
Results for various gene/homolog determinations are shown in
This example illustrates the method for determining gene/homolog copy number for a specific gene using probes that anneal adjacent to a base that is different between the gene and the homolog(s) or pseudogene(s).
Hybrid-capture probes were designed to anneal adjacent to the few bases that differ between CYP21A2 and CYP21A1P (“diff bases”). Paired-end NGS of captured fragments allows designation of reads as being either gene- or pseudogene-derived based on the diff bases. CAH variants were identified using two strategies: SNP-based calling and copy-number analysis. SNP-based calling at a given position searched for deleterious and/or pseudogene-derived bases in a pileup composed of reads with gene-derived diff bases distal from the position of interest. By contrast, copy-number analysis used read depth of diff bases to calculate the relative abundance of each variant, and deleterious variants were identified as those with excess copy number of pseudogene-derived sequence (and, conversely, depleted copy number of gene-derived sequence). Long-range PCR and Sanger sequencing were used to confirm variants in a validation study.
The test correctly identified the genotypes of positive-control samples from affected patients, and we have since run the validated CAH test on nearly 150,000 clinical samples. The variant frequencies observed are consistent with prior studies that sequenced CYP21A2 in affected patients. There is great diversity in the copy number of gene and pseudogene: 38% of patients have at least one haplotype that does not simply have one copy of each. Evidence for recombination between gene and pseudogene is widespread, with at least 83% having a CYP21A2 haplotype containing pseudogene-derived bases. Finally, the test identifies compound variants consistent with specific rare haplotypes, e.g., (1) three copies of CYP21A2 where one has the Q319X mutation, and (2) CYP21A2 with a V282L mutation in cis with two copies of CYP21A1P, a haplotype enriched in Ashkenazi Jewish patients.
It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
The present application claims priority to U.S. Provisional Patent Application 62/097,139 was filed on 29 Dec. 2014, entitled “Method For Determining Genotypes in Regions of High Homology”, and US Provisional Patent Application 62/234,012 was filed on 28 Sep. 2015, entitled “Method For Determining Genotypes in Regions of High Homology”.
Number | Date | Country | |
---|---|---|---|
62097139 | Dec 2014 | US | |
62234012 | Sep 2015 | US |