The present invention generally relates to genomic characteristics that correlate with disease, and in particular, to methods of detecting tandem repeats indicative of disease.
Technical advances in genome analysis have improved clinical diagnosis and gene discovery for human diseases that have a significant genetic component. However, for many complex disorders, including autism spectrum disorder (ASD), the causal genetic variants thus far identified generally confer less risk than expected from empirical estimates of heritability.
ASD refers to a group of neurodevelopmental disorders that are characterized by atypical social function, communication deficits, restricted interests, and repetitive behaviors. Genetic factors contribute to the etiology of ASD; twin studies estimate heritability in the 50-90% range, and recurrence in families is ˜20%. Individuals with ASD can have additional medical complications such as intellectual disability or epilepsy, and ASD itself features in many medical genetic conditions, the prototypical example being fragile X syndrome.
Genomic analyses with microarrays, exome sequencing, and genome sequencing have shown that individuals with ASD have a two- to three-fold increase in the number of rare copy number variations (CNVs) and de novo loss-of-function (nonsense, frameshift and splice site) variants, compared to their unaffected siblings. More complex structural DNA variations are also involved in ASD. With all of these studies combined, more than 100 genes and loci are known to be associated with increased likelihood of ASD. However, collectively, these genetic factors can only explain the etiology of ˜20% of ASD cases. Genome sequencing is the current state-of-the-art technology for variant detection, but even its application leaves the majority of ASD cases ‘genetically unsolved’. This “missing heritability” could, in part, be attributed to the inaccessibility of repetitive regions in DNA due to technical limitations of generating and/or resolving short-read next-generation sequence data.
Tandem repetitive DNA makes up ˜6% of the human genome. Expansions of gene-specific repeats have been linked to >40 neurodegenerative and neuromuscular genetic diseases. Alternative repeat motifs are found in some of these disorders, which complicates identification. A given tandem repeat-related gene can contribute to a variety of clinically distinct conditions. For example, the unstable CGG tract of FMR1 has been linked to intellectual delay in fragile X syndrome, fragile X premature ovarian insufficiency, fragile X associated ataxia, endocrine, autoimmune, metabolic disease, and ASD.
It would be desirable to develop an improved method for detecting and/or analyzing tandem repeat expansions in a genome.
A comprehensive strategy for genome-wide discovery of repeat sequence expansion has been developed to identify loci as candidates for involvement in disease such as ASD.
Accordingly, in one aspect, a method of detecting one or more outlier tandem repeat sequences associated with a disease in a population of interest is provided. The method includes the following steps:
i) detecting tandem repeat sequences comprising a repeated motif sequence in nucleic acid samples from individuals within a population of interest which have been sequenced using a sequencing platform, wherein the tandem repeat sequences have a length that permits a tandem repeat calling algorithm to detect and/or genotype the tandem repeat sequences;
ii) simulating the length distribution of the tandem repeat sequences in the population of interest to a normal distribution if no tandem repeat sequences are detected in step i); and
iii) detecting one or more outlier tandem repeat sequences, wherein an outlier tandem repeat sequence has a length that is greater than that in 90% of the tandem repeat sequences detected in the population interest and occur at a frequency of less than 1% of the tandem repeat sequences detected in control population.
In another aspect, a method of diagnosing and treating an individual with a disease, such as ASD, is provided. The method comprises:
i) detecting in a nucleic acid sample from the individual the presence of one or more target outlier tandem repeat sequences determined to be prominent for that disease using nucleic acid probes that specifically bind thereto or bind to region adjacent thereto;
ii) determining that the individual has the disease when the presence of the target outlier tandem repeat sequence or sequences is present in the nucleic acid sample; and
iii) optionally, treating the individual using an appropriate therapy for the disease.
These and other aspects of the invention are described by reference to the detailed description and figures.
A method of detecting outlier tandem repeats associated with a disease is provided. The method includes the following steps: i) detecting tandem repeat sequences comprising a repeated motif sequence in nucleic acid samples from individuals within a population of interest which have been sequenced using a sequencing platform, wherein the tandem repeat sequences have a length that permits a tandem repeat calling algorithm to detect and/or genotype the tandem repeat sequences; ii) simulating the length distribution of the tandem repeat sequences in the population of interest to a normal distribution if no tandem repeat sequences are detected in step i); and iii) detecting one or more outlier tandem repeat sequences, wherein an outlier tandem repeat sequence has a length that is greater than that in 90% of the tandem repeat sequences detected in the population interest and occur at a frequency of less than 1% of the tandem repeat sequences detected in a control population.
The term “disease” is used herein to refer to a pathological condition in an individual, including human and non-human mammals. The disease may, for example, be a pathological condition in a developmental process of the central nervous system and brain, heart, kidney, or other organs. Examples of disease conditions thus include, but are not limited to, developmental disorders in any developmental process with respect to the brain, blood, heart, kidney, and others, neurological disease such as neuropathies and neuropsychiatric disorders, cardiomyopathy including congenital heart disease, cancer including carcinomas, sarcomas, lymphomas, gliomas, leukemia, melanomas, etc., diabetes, and others.
The term “neuropsychiatric disorder” is used herein to encompass Autism Spectrum Disorder (ASD), i.e. a disorder that results in developmental delay of an individual such as autism, Asperger's Disorder, Childhood Disintegrative Disorder, Pervasive Developmental Disorder-Not Otherwise Specified (PDD-NOS) and Rett Syndrome (APA DSM-IV 2000); schizophrenia (SZ), intellectual disability (ID), Fragile X syndrome, epilepsy and related nervous system disorders such as OMIM (Online Mendelian Inheritance in Man) nervous system disorders.
The term “population of interest” refers to a population within which individuals may have the disease to be detected, but includes individuals which do not have the disease. Populations may include a particular ethnic population, a population in a particular age group, a population in a given locale, a pediatric population, etc.
The term “control population” refers to a population distinct from the population of interest in which prevalence of the disease to be detected is equal or lower than that in the population of interest.
In the present method, nucleic acid from individuals within a population, e.g. humans, having a disease of interest, is obtained for analysis. Nucleic acid from various biological samples may be studied, depending on the disease of interest, including samples such as blood, serum, plasma, urine, and biopsied tissue from an individual. The sample may be obtained using techniques well-established and known to those of skill in the art, and will vary depending on the sample type as one of skill in the art will appreciate. Examples of different techniques that may be used to obtain a tissue sample include standard biopsy, needle biopsy, endoscopic biopsy, stereotactic biopsy, neuroendoscopy, bone marrow biopsy, and combination techniques which employ biopsy and imaging techniques. Tissue samples from cerebellum, breast, heart, liver, muscle, kidney, thyroid, pancreas, prostate, spleen, and testis may be used. For neurological disease, including neuropsychiatric disorders, preferred biological samples include brain samples, for example, neocortex, amygdala, cerebellar cortex, hippocampus, mediodorsal nucleus of the thalamus, and striatum. For cardiomyopathy, nucleic acid from the heart and related tissue is preferred. The samples may be obtained from any age group, including prenatal, child and adult samples. Generally, a suitable sample will contain up to about 1 μg, typically, in the range of about 50 ng-1 μg of nucleic acid.
Nucleic acid from the tumour sample may be extracted from the sample using techniques well-known to those of skill in the art, including chemical extraction techniques utilizing phenol-chloroform (Sambrook et al., 1989), guanidine-containing solutions, or CTAB-containing buffers. As well, as a matter of convenience, commercial DNA extraction kits are also widely available from laboratory reagent supply companies, including for example, the QIAamp DNA Blood Minikit available from QIAGEN® (Chatsworth, Calif.), or the Extract-N-Amp blood kit available from Sigma-Aldrich® (St. Louis, Mo.).
The nucleic acid samples are sequenced using any appropriate nucleic acid sequencing platform, including but not limited to, including methods such as Next Generation Sequencing (NGS) methods (e.g. such as Illumina, BGI/MGI, Ion torrent and Ion proton sequencing) and Third Generation Sequencing (e.g. such as Oxford Nanopore Technologies and Pacific Biosciences) to permit detection of tandem repeat sequences. Prior to sequencing, the nucleic acid may be amplified using a PCR-based amplification technique.
Tandem repeat sequences are then detected. Tandem repeat sequences are sequences comprising a repeated motif sequence having about 1-25 base pairs which occur directly adjacent to one another, e.g. ACACAC . . . . The repeated motif sequence preferably comprises at least 2 base pairs up to 5, 10, 15 or 20 base pairs. More preferably, the repeat motif sequences are about 2 to 6 base pairs in length. Tandem repeat sequences useful in the present method are those having a length that permits detection and/or genotyping by a tandem repeat calling algorithm. In one embodiment, the tandem repeat sequence length is greater than 150 base pairs when a sequencing platform such as an Illumina-based platform is used. Methods suitable for detecting tandem repeat sequences include, but are not limited to, methods which use anchored in-repeat reads (paired reads in which the first read maps to a repetitive region and the second “anchor” read maps to an adjacent non-repetitive region) to estimate the size and location of genomic repeats. Algorithms that may be used include, but are not limited to, ExpansionHunter Denovo, NanoSatellite, and others.
One or more outlier tandem repeat sequences within the tandem repeat sequences is then detected. Outlier tandem repeat sequences, also referred to herein as “repeat expansions”, are tandem repeat sequences which have a length that is greater than 90% of the lengths of the tandem repeat sequences detected in the population of interest, e.g. a length that is greater than 95% or more than the length of the tandem repeat sequences detected, and preferably occur at very low frequency, for example, less than 1% of the tandem repeat sequences detected in a control population, for example, less than 0.5%, or less than 0.1% of a control population. In embodiments of the present method, outlier tandem repeat sequences associated with disease comprise an increased GC content of at least about 10% in comparison to the tandem repeat sequences detected. In other embodiments, outlier tandem repeat sequences associated with disease are sequences located within intronic regions of a gene, and which may be located close to a transcriptional start site (TSS) or a splice junction, e.g. within about 10,000 base pairs to a TSS or splice junction.
In view of the cut-offs applied in the present method, it is possible that no tandem repeat sequences will be detected by the detection algorithm used. In this case, the length distribution of the missing tandem repeat sequences is adjusted (or simulated) to a normal distribution, for example, with a mean of one, a standard deviation of 0.25, and a maximum of two. This is illustrated in
Outlier tandem repeat sequences are determined to be associated with a given disease, for example, by functional analysis, including identifying the chromosome location of the outlier tandem repeat sequence, and the gene with which it is associated. Based on this information, the function of the particular gene, the disease with which the outlier tandem repeat sequence is associated may be identified.
The method of detecting outlier tandem repeat sequences, and their association with a disease, is useful to permit diagnosis of the disease in an individual such that the individual may then be appropriately treated.
Thus, in another aspect, a method of diagnosing in an individual risk of having a disease, such as ASD, and optionally treating the individual, is provided. The method comprises the steps of: detecting in a nucleic acid sample from the individual the presence of at least one outlier tandem repeat sequence using a method as described above for detection of outlier tandem repeat sequences. Based on the genomic location of the outlier tandem repeat sequence, diagnosing the individual with the disease associated with that genomic location.
In another embodiment, diagnosis or risk of having a target disease may be based on the detection of a target outlier tandem repeat sequence known to be prominent for the target disease. In this case, a nucleic acid probe that specifically binds to the target outlier tandem repeat sequence, or that binds to a sequence adjacent thereto (a target adjacent sequence), i.e. a sequence that provides selectivity and may be a sequence that is unique with respect to the target outlier tandem repeat sequence. If the target outlier tandem repeat sequence, or target adjacent sequence, is detected in the nucleic acid sample, then the individual is determined to have or be at risk of having the target disease. Following such a determination, a diagnosis may be confirmed using other identifiers of the target disease.
The term “prominent” is used herein to refer to the pronounced presence of the target outlier tandem repeat sequence in the nucleic acid of individuals with the disease than individuals not having the disease. The prominence of the target outlier tandem repeat sequence in individuals with the disease will vary, as one of skill in the art will appreciate. Generally, the target outlier tandem repeat sequence occurs at an increased frequency in a population with the disease than in a control population. For example, target outlier tandem repeat sequences may be present, for example, at least two times more frequently in the ASD population than in the control population. In some cases, the presence of the target outlier tandem repeat sequence may be present at a much greater frequency in a disease population as compared to a control population, e.g. 5-10 times or more frequently in a disease population than in a control population.
Methods of detecting target outlier tandem repeat sequences include those based on the use of primers and/or probes, such as PCR-based techniques and/or Southern blotting techniques, that specifically bind to the target tandem repeat sequence, or to a specific sequence adjacent to the target outlier tandem repeat sequence. Thus, primers/probes will vary with each target outlier tandem repeat sequence and each disease to be diagnosed. Probes will generally be detectably labelled for detection, e.g. with fluorescent, luminescent, radioactive, or other detectable label.
In embodiments, methods of diagnosing ASD is provided in which outlier tandem repeat sequences prominent in some genes are detected, such as LINGO3, DMPK, CACNB1 and FXN. For LINGO3, a CGG outlier tandem repeat is detected; a CTG repeat in DMPK is detected; and tandem repeats with multiple different motifs within the same repeat-containing regions are detected in CACNB1 and FXN, for a diagnosis of ASD in an individual.
Embodiments of the invention are described by reference to the following specific example.
The following study exemplifies methodology in accordance with an aspect of the invention.
Samples—Genome sequencing data derived from 8,448 samples from the MSSNG project1, 9,096 samples from the Simons Simplex Collection (SSC)2, and 2,504 samples from the 1000 Genomes Project (1000G)3 was used. All SSC samples used PCR-free DNA library preparation and were sequenced on the Illumina HiSeq X platform (2×150 bp paired-end reads). All 1000G samples used PCR-free library preparation and were sequenced on the Illumina NovaSeq platform (2×150 bp paired-end reads). Each MSSNG sample fell into one of three categories: 1) PCR-based DNA library preparation and sequenced on either the Illumina HiSeq2000 (2×90 bp paired-end reads) or HiSeq2500 (2×126 bp paired-end reads) platforms; 2) PCR-based library preparation and sequenced on the Illumina HiSeq X platform, or 3) PCR-free library preparation and sequenced on the Illumina HiSeq X platform. All samples were aligned to the GRCh38/hg38 reference genome using BWA-mem4. Full details on the MSSNG, 1000G, and SSC alignment pipelines can be obtained from the websites of MSSNG, 1000G and SSC (via Globus; https://www.globus.org), respectively.
Genome-wide tandem repeat identification—To perform reference genome-agnostic detection of tandem repeats, ExpansionHunter Denovo (EHdn) (https://github.com/Illumina/ExpansionHunterDenovo), which uses anchored in-repeat reads (paired reads in which the first read maps to a repetitive region and the second “anchor” read maps to an adjacent non-repetitive region) to estimate the size and location of genomic repeats. EHdn v0.7.0 was run on each sample using default parameters. The per-sample output files were combined using the combine_counts.py script provided with EHdn. The final set of regions was generated using the compare_anchored_irrs.py script with the parameter --minCount=2, thus retaining only regions for which at least one sample had C>=2, where C=A*40/R, A is the raw count of anchored in-repeat reads for that region, and R is the average read depth of the sample calculated by EHdn.
Characterization of technical variability in repeat detection—To determine whether the number of repeats estimated by EHdn in a given sample was affected by systematic biases in the sequencing data, the distributions of the raw number of EHdn calls (specifically, the number of regions in the RegionsWithIrrAnchors field of the per-sample JSON output file) were plotted for each combination of cohort (MSSNG, SSC, or 1000G), DNA library preparation method (PCR-based or PCR-free), and sequencing platform (HiSeq 2000/2500, HiSeq X, or NovaSeq).
Sample quality control and ancestry determination—The ancestry of the MSSNG, SSC and 1000G samples was determined using data from 1752 unrelated samples from the 1000 Genomes Project as the reference. The reference samples were genotyped on Illumina HumanOmni2.5-4v1-B and Illumina HumanOmni25M-8v1-1_B chips (http://www.tcag.ca/tools/1000genomes.html). The genotypes for a set of 265,236 autosomal SNPs and 23,171 chromosome X SNPs for the three cohorts with bcftools v1.6 were extracted using the joint-genotyped variant call format (VCF) files as input. For each cohort, the data were sorted, decomposed, normalized and SNVs retained for further processing. The resulting VCFs using PLINK v1.9.b3.42 were formatted. The SNPs with genotyping rate <99% in the reference set or any of the three cohorts were removed. PLINK identity-by-descent estimates were calculated for all pairs of individuals in the three cohorts using the autosomal SNPs to check for pedigree and Mendelian errors within each set and sample duplications between sets. SNPs on X chromosome were used to determine sex and flagged samples where the reported sex and estimated sex were different. Linkage disequilibrium-based pruning of the autosomal SNPs yielded 41,720 SNPs, which were used to estimate model-based ancestry using the program ADMIXTURE′, and projected the three cohorts on the population structure learned from the reference panel.
Validation rate of ExpansionHunter Denovo—To assess the accuracy of tandem repeats identified by EHdn, EHdn was used to detect repeats in the HuRef genome6 and then the proportion that could be corroborated by an orthogonal method was determined. Specifically, Illumina HiSeq X reads derived from HuRef blood (NCBI sequence read archive accession number SRR9046649) were aligned to the GRCh38/hg38 reference assembly and EHdn was run as described above. The orthogonal comparison method involved two sources of data: 1) tandem repeats in the human reference genome, derived from tandem repeat finder (TRF)7, and 2) insertions and deletions in the HuRef genome detected from Pacific BioSciences single molecule, real-time (SMRT) long-read sequencing data, derived by de novo assembly using Canu8 and variant detection using AsmVar (https://github.com/bioinformatics-centre/AsmVar). An EHdn region was considered validated if the sum of the size of the largest overlapping TRF region and the size of overlapping Canu/AsmVar insertions/deletions (positive for insertions and negative for deletions) was at least 150 bp (the minimum size detectable by EHdn). A Canu/AsmVar insertion/deletion was considered to overlap the TRF region if it overlapped the region itself or 100 bp on either side. For example, suppose that a hypothetical EHdn region overlapped a TRF region of size 100 bp, where Canu/AsmVar detected a 70 bp insertion inside the TRF region. The total repeat size would be 170 bp, thereby validating the EHdn call. Conversely, if the TRF region was 160 bp along with a 20 bp Canu/AsmVar deletion, then the total size is 140 bp, and EHdn region would not be considered validated. If the EHdn region did not overlap a TRF region, but there was a Canu/AsmVar insertion 150 bp within the EHdn region, then the EHdn region was considered validated.
Confirmation of repeats detected by ExpansionHunter Denovo—To support the accuracy of EHdn-predicted repeat sizes, the loci listed in Table 2 were genotyped using Expansion Hunter v3.0.2. This program estimates allele-specific repeat size for each genomic coordinate and motif supplied by the user with high accuracy (precision=0.91, recall=0.99). All unique Ehdn-detected repeats (each having a different motif) overlapping each locus were identified. To determine more precise coordinates for input to Expansion Hunter, coordinates from TRF that overlapped the locus were identified. For each combination of TRF coordinates and EHdn motif, Expansion Hunter was used to estimate motif-specific (detected by EHdn) repeat sizes for the samples involved. The Spearman correlation coefficient and p value were then calculated between the EHdn-predicted repeat sizes and the Expansion Hunter estimated size (defined as either the size of the longest allele or the sum of the two allele sizes), aggregated over all of the EHdn-detected motifs for that locus. A manual inspection on the presence of repeat expansion and the corresponding repeat-sequence was performed by inspecting reads from the BAM for repeats found to be expanded by EHdn in Table 2.
Detection of repeat expansions—A repeat expansion was defined as a genomic segment of repeat that is much larger than what is observed in the population. A density-based spatial clustering of applications with noise (DBSCAN) was applied to identify repeat expansions9. DBSCAN is a non-parametric clustering algorithm that defines a cluster based on the minimum number of data points (minPts) reachable to each other by a maximum distance (c). Any data point not reachable by the clusters are classified as noisy data or outliers, if they have a value of a particular feature (e.g. size of repeat) higher than those of cluster members. By trial and error, the stringent DBSCAN parameters used in detecting repeat expansions were minPts=−log2(n)≈14 and ε=2×Mo(Xi), where n is the number of samples, Mo is Mode and Xi is a vector of repeat sizes for repeat i. For a repeat to be detected by EHdn, it must be larger than the sequence read length (e.g., >150 bp). As a result, many samples were left without EHdn's size estimation for the repeat-containing regions that did not meet this size minimum. Similarly, DBSCAN might also fail to detect outliers when only a few samples have genotype data. Therefore, the normalized read depth of the repeat for such samples was simulated by assigning them a normal distribution with a mean of one, a standard deviation of 0.25, and a maximum of two. As a result, a minimum of the normalized read depth of two was required for a repeat to be identified as an expansion.
Experimental validation of repeat expansions—The validation of the repeat length estimated by ExpansionHunter or EHdn was done by fragment analysis with FAM-labelled primers and capillary electrophoresis. PCR was performed with Expand Long Template PCR System™ (Roche), adding dimethylsulfoxide to achieve the final concentration of 5-10%, depending on the GC content of the target region. Capillary electrophoresis with Applied Biosystems' 3730xl™/3130™ capillary sequencers was performed with GeneScan 500LIZ™ size markers. For the CGG repeat in LINGO3, betaine (final concentration: 2 M) was added in the PCR reaction mixtures and the repeat size was determined by Sanger sequencing of PCR products. For samples that appeared to be homozygous for the repeat length, the presence of expanded repeats was validated by repeat-primed PCR (RP-PCR) and/or Southern blot. For RP-PCR, the following repeat-priming primers with the tail sequence of 5′-TACGCATCCCAGTTTGAGACGC-3′ (SEQ ID NO: 1) were used:
Southern blot was performed to determine the sizes of the CGG repeat in LINGO3 and the CTG repeat in DM1 with selected restriction-endonucleases to digest genomic DNA as denoted in
For repeats that were detected with multiple different motifs at the same repeat-containing regions (e.g., CACNB1 and FXN), Sanger sequencing was performed on the PCR-amplified alleles after gel-extraction to confirm the presence of the reported motifs. PCR primers for CACNB1 were 5′-CTTCCTACCGATTTCCCCTC-3′ (SEQ ID NO: 10) and 5′-CTGATTGACTTCCCACCCTT-3′ (SEQ ID NO: 11) and for FXN were 5′-TATTTGTGTTGCTCTCCGGAG-3′ (SEQ ID NO: 12) and 5′-ATAGTGCACAGAAGCCAAGT-3′ (SEQ ID NO: 13).
Burden analysis of repeat expansions in individuals with ASD—To compare the frequency of rare repeat expansions (<0.1% population frequency) between individuals with and without ASD, a logistic regression analysis was performed by regressing the number of rare repeat expansions on the affected status (unaffected=0, affected=1). Sex bias was avoided by performing the test only on autosomal regions. Any biases in number of repeats detected per subject that may have been related to ethnicity were accounted for. From five admixture variables obtained in the ancestry determination step (see “ancestry determination” above), we included as covariates in the final model only two variables that showed significant correlation with the number of EHdn detected repeats (p<0.05).
Besides this burden test for total number of rare expansions, a functional burden test, as well as a gene set burden test, were performed. The bias in number of rare repeat expansions per subject in functional and gene set burden tests by covarying the number of rare repeat expansions found in intergenic region were further accounted for. For the functional burden test, the genome (RefSeq hg38) was separated into different functional elements, i.e., upstream (1 kb upstream of transcription start sites), 5′UTR, exon, core splice site, intron, 3′UTR and downstream (1 kb downstream of the transcription termination sites). The number of rare repeat expansions impacting each functional element were tested for. If any rare repeat expansion impacted more than one functional element, the effects were prioritized based on their impact on the corresponding genes predicted by ANNOVAR10. We also tested these different functional elements altogether as a genic burden signal. For the gene set burden test, we obtained 33 functional gene sets previously used to study CNV and SNV enrichment in ASD, including genes relevant to neuronal functions, synaptic components, or genes with homologues in mouse genes grouped by organ system (Table 3). Finally, we estimated family-wise error rate (FWER) to adjust for multiple comparisons.
Statistical comparisons of means—We performed non-parametric Wilcoxon signed-rank tests (one-sided) to compare means between two datasets. These included testing the hypotheses of (i) shorter distances to TSS or splice junction for rare repeat expansions than two other sets of repeats separately (known simple sequence repeats and all EHdn-detected repeats), (ii) lower phenotype-related test scores for samples with than without rare repeat expansions, and (iii) higher number of rare repeat expansions found in affected and unaffected children. For (i), we only included the tandem repeats within 10 kb distance from TSS or splice junction in the test. The distance was calculated from the midpoint of a tandem repeat region to the nearest TSS or splice junction. For (ii), we compared the test scores of the two phenotypes (Vineland Adaptive Behavior standard score to measure adaptive function, and IQ full scale standard score to measure cognitive ability) available in MSSNG database's samples with (n=2,417) and without (n=1,927) rare repeat expansions. This is to test if a similar reduced adaptive function or cognitive ability can be found in individuals with rare repeat expansions, as we previously showed such phenomenon in the carriers of rare pathogenic SNVs or CNVs. Samples included were mutually exclusive to each other and there were no replicates (randomization not applicable).
Detection of tandem repeats from genome sequence data—To assess the characteristics of tandem repeat expansions in the human genome, data was collected from 20,048 genomic samples sequenced on Illumina platforms with >30× coverage. These consisted of (a) 8,448 samples (2,042 families with both parents and at least one child, and 1,844 singletons) from families with ASD from the Autism Speaks MSSNG project, (b) 9,096 samples (1,941 complete quartet families) from ASD families from the Simons Simplex Collection (SSC), and (c) 2,504 samples of unrelated population controls from the 1000 Genomes Project. All genomes were aligned to the GRCh38/hg38 reference assembly. We estimated the length of tandem repeats using the ExpansionHunter Denovo (EHdn) algorithm2. EHdn detects any tandem repeat involving a motif of 2 to 20 bp, for which the total size is larger than the sequencing read length (e.g., 150 bp for Illumina HiSeq X) (
77% of the repeats detected by EHdn in the HuRef genome were validated by comparing to repeats detected by an orthogonal strategy of long-read sequencing (PacBio) data (Methods). Samples were appropriately curated (
Wide variability of repeat motif size, genic context, and fragile sites—39,078 repeat motifs in 33,083 distinct regions of the human genome were identified, revealing that a given tract could have more than one repeat, with ˜1.2 different forms of repeat-sequence per region. We defined a tandem repeat-containing region as a genomic location where the repeats detected with one or more different motifs overlapped with each other by at least 1 bp (
To understand the distribution of the tandem repeat-containing regions, we correlated their presence with different genomic features (
The increased recognition of repeats in cytogenetically known fragile site locations may allow refined mapping of those that are not yet characterized at molecular resolution, and provide important information on susceptibility to genome instability. Indeed, repeat-containing regions we identified co-localized to 9 of 13 (69.2%) of the molecularly mapped rare folate-sensitive fragile sites, all at CG-containing repeats, including the cytogenetically confirmed FRA12A/DIP2B, FRAXA/FMR1, and FRAXE/AFF2. Intriguingly, 66.7% (10 of 15) of the currently molecularly unmapped fragile sites overlapped with at least one GC-rich tandem repeat-containing region detected (Table 1). One of the potentially novel mapped fragile sites was FRA19B, which overlapped with a CGG repeat detected at the 5′ untranslated region in LINGO3. Expansion in one such available sample was confirmed by repeat-primed PCR and Southern blotting (
Rare repeat expansions in ASD individuals, in genes related to nervous system, cardiovascular system and muscle—Repeat expansions that are disease-causing and functionally impactful tend to be large and rare in the general population. We applied a non-parametric approach to identify individual repeats whose tract lengths were outliers compared to those among members of the cohorts (Methods). We designated these outliers as repeat expansions. We defined these as having at least double the tract length of that in the majority of the samples (>2 times the length of the mode) and not being a member of any clusters in the size distribution (Methods). While our defined criterion of an expansion is a conservative measure, it should be noted that there are diseases whose repeat tracts incurred one or two additional repeat units that lead to disease. Such changes will be missed by EHdn. We further categorized them as rare repeat expansions when found in <0.1% of the population controls (1000 Genomes Project). This resulted in 2,483 repeat-containing regions (3,818 motifs) being categorized as rare repeat expansions (
To delineate the possible functional roles, we assessed whether the rare repeat expansions identified here contribute to the risk of ASD and its heritable features. To avoid sex bias on allele transmission, we compared their occurrence only in autosomal regions, in children with (N=5,249) and without (N=2,023) ASD from 5,262 unrelated families. We further adjusted the comparisons by adding ethnicity as covariates in the statistical tests (Methods). We found that rare repeat expansions were more prevalent in children with ASD than in the non-ASD siblings (OR=1.2, p=8×10−6). These rare repeat expansions generally represented further expansions from already-large repeat expansions from the previous generation, since the average repeat-tract length of these parents was at the 94th percentile of the length distribution (
Towards assessing possible functional effects of the rare repeat expansions, we examined their proximity to different features within genes. We found the ASD-associated rare repeat expansions to be increased in exonic (OR=2.58, p=0.0002, family-wise error rate; FWER=0.02), intronic (OR=2.21, p=0.005; FWER=0.04) and splicing (OR=1.68, p=0.007; FWER=0.05) regions (
In terms of the biological pathways associated with the genes impacted by the identified rare repeat expansions, we investigated their relevance to previously known ASD-related gene functions and pathways using the pathway enrichment test (Methods). Unlike rare SNVs or CNVs, which predominantly impact neural synaptic functions, ASD-linked genes with rare repeat expansions were predominantly involved in nervous system (OR=1.76, p=0.002; FWER=0.06), and cardiovascular system or muscle (OR=1.55, p=0.005; FWER=0.16) (
From the gene sets that were enriched in the pathway enrichment test, we selected 12 other examples (repeats at FGF14, CACNB1, FXN, CDON, MYCOD, WWOX, PARD3, IGF1, FOXJ3, ABCC4, RICTOR and ARID1B) (OR>1.5 each with at least 4 unrelated ASD carriers) as top candidates to be ASD-relevant repeat-containing regions when expanded (Table 2). We further genotyped each of these 13 genes, confirming by another tandem repeat detection algorithm, Expansion Hunter. Although not included in the prior statistical comparisons, known ASD-risk regions such as the CGG repeats at the 5′ untranslated regions of FMK1 and AFF2 were among the top loci, based on the same criteria (Table 2). Due to their rarity, none of these regions considered individually were statistically increased in ASD subjects; however, rare repeat expansions in 176 loci within these enriched gene sets collectively accounted for 5.5% (288 of 5,249) of the ASD cases in the cohorts (OR=1.49, p=0.0014).
Individuals with rare repeat expansions correlate with ASD-related phenotypes—Towards correlating the genetic findings herein with the phenotypes in the MSSNG cohort, we note that all 4 males with clinical information available in the database with CGG repeat expansions in FMR1 were indicated as having fragile X syndrome (
As with the carriers of de novo loss-of-function SNVs or CNVs, we found rare repeat expansions in the enriched gene sets more often in females than in males (OR=1.46; p=0.01) (
1Loci on the X chromosome were not included in the overall statistical comparisons for functional analysis. They were added here only for reference.
It is demonstrated that large-scale profiling of repeat expansions from genome sequence data can delineate an unprecedented variability of tandem repeats in the human genome. Specifically, we found 176 tandem repeat loci to be expanded in 288 of 5,249 individuals (5.5%) with ASD, and propose that such expansions may be relevant to ASD. Our findings represent a significant advancement in ASD genetics, as we discovered many genes involved in the repeat expansions that had not been previously identified using conventional genomic analyses (Table 2). Beyond implications for ASD, we have revealed far more extensive variability among such sequences than previously recognized in the human genome, with 7.7% of the repeats interrogated having more than one motif detected. This suggests that some genes may be prone to expansions with different repeat motifs.
Coupling repeat identification with an outlier detection method, we were able to identify 2,483 repeat-containing regions that, when expanded, occur in genes involved in biological functions and pathways, such as those involved in nervous system, cardiovascular system and muscle. For example, there was a correlation between the repeat expansions and shaping cognitive and behavioral phenotypes in ASD. For example, DMPK, in which rare SNVs and CNVs were found in individuals with ASD, had not been conclusively linked to ASD previously, because the majority of ASD-relevant alterations were not detected until the expanded repeats were analyzed in our study. Notably, many of the ASD-relevant repeat expansions we discovered are in the non-coding regions of genes, and their mechanisms of gene regulation and aberrant splicing have been well-established (e.g., DMPK and FXN).
While allowing sensitive and accurate detection of the expanded repeat sequence, the method we developed here provides an estimated relative aggregated length of the repeat tracts.
Access to the MSSNG and SSC genome sequencing data can be obtained by completing data access agreements (https://research.mss.ng and https://www.sfari.org/resource/sfari-base, respectively). The 1000G genome sequencing data are publicly available via Amazon Web Services (s3://1000genomes/1000G_2504_high_coverage/data).
The methods described in Example were employed to identify outlier tandem repeat sequences in an epilepsy population.
As shown in
A method of operating a gene sequencer is described for the identification of outlier tandem repeat sequences in a population of interest. Reference is made to
Initially, a physical nucleic acid sample 1504 is obtained, for example from individuals in a population of interest. The individuals may be human individuals, or non-human individuals. The physical nucleic acid sample 1504 may, for example and without limitation, come from a blood sample, tissue or other sample as described herein. Next, the nucleic acid samples 1504 are prepared 1506 for the genetic sequencer 1502 to obtain prepared nucleic acid samples 1508. The precise mode of preparation 1506 will depend on the type and model of the genetic sequencer 1502 and will be within the capability of one of ordinary skill in the art, now informed by the present disclosure. The prepared nucleic acid samples 1508 are then input 1510 into the genetic sequencer 1502, which sequences the prepared nucleic acid samples 1508 to obtain nucleic acid sequence information 1512 for the population of interest from the prepared nucleic acid samples 1508.
The method 1500 analyzes the nucleic acid sequence information using analysis logic 1514 to detect in the nucleic acid sequence information 1512 the presence of one or more outlier sample tandem repeat sequences 1516, as described further below.
In one embodiment, the analysis of the nucleic acid sequence information 1512 is carried out solely by the genetic sequencer 1502, for example one or more onboard processors of the genetic sequencer 1502 may execute the analysis logic 1514, which may reside in storage media of the genetic sequencer 1502. In another embodiment, the analysis of the nucleic acid sequence information 1512 is carried out by the genetic sequencer 1502 in conjunction with at least one external computer system 1518 communicatively coupled 1520 to the genetic sequencer 1508. For example the analysis logic 1514 may be executed in part by one or more onboard processors of the genetic sequencer 1502 and in part by one or more processors of the external computer system(s) 1518. In still another embodiment, the analysis of the nucleic acid sequence 1512 is carried out solely by the external computer system(s) 1518, which receives the nucleic acid sequence information 1512 obtained by the genetic sequencer 1502.
The analysis logic 1514 may have a number of implementations.
In one such implementation, the analysis logic 1514 detects sample tandem repeat sequences comprising a repeated motif sequence in the nucleic acid sequence information. These sample tandem repeat sequences have a length that permits a tandem repeat calling algorithm (e.g. within the analysis logic 1514) to detect and/or genotype the tandem repeat sequences. The analysis logic 1514 may simulate the length distribution of population of interest tandem repeat sequences in a population of interest to a normal distribution if no sample tandem repeat sequences are detected initially. The analysis logic 1514 then detects one or more outlier sample tandem repeat sequences 1516, and may also determine the disease with which the outlier tandem repeat sequence is associated based on the location of tandem repeat sequence within the genome, including the gene in which it is present. An outlier sample tandem repeat sequence(s) has a length that is greater than that in 90% of the population of interest tandem repeat sequences detected in the population of interest and occur at a frequency of less than about 1% of control population tandem repeat sequences detected in a control population.
In another implementation, the analysis logic 1514 detects in the nucleic acid sequence of an individual the presence of one or more outlier sample tandem repeat sequences 1516 determined to be prominent for a disease and, responsive to detecting the presence of the outlier sample tandem repeat sequence(s), determines that the individual has the disease. Responsive to determining that the individual has the disease, the method 1500 may emit a signal 1522 that the individual has the disease. The signal 1522 may be one or more of an audible signal, a visual signal (e.g. a flashing light or on-screen display), and an electronic communication (e.g. an e-mail message, text message, SMS message, iMessage, or the like). These are merely non-limiting illustrative examples of a signal 1522.
The foregoing are merely illustrative examples of analysis logic 1514 and are not intended to be limiting; the analysis logic 1514 may incorporate other aspects of the present disclosure as well.
As can be seen from the above description, the method 1500 described above represents significantly more than merely using categories to organize, store and transmit information and organizing information through mathematical correlations. The method 1500 in fact an improvement to the technology of genetic sequencing and genetic analysis, as it provides for detection of outlier sample tandem repeat sequences within a nucleic acid sequence, which may facilitate disease detection. Moreover, the method 1500 is applied by using a particular machine, namely a genetic sequencer. As such, the method 1500 is confined to genetic sequencing applications.
Aspects of the present disclosure may be embodied within a system, a method, a computer program product or any combination thereof. The computer program product may include a computer readable storage medium or media having computer readable program instructions thereon for causing a processor to carry out aspects of the present technology. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present technology may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language or a conventional procedural programming language. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to implement aspects of the present technology.
An illustrative computer system in respect of which the technology herein described may be implemented is presented as a block diagram in
The computer 1606 may contain one or more processors or microprocessors, such as a central processing unit (CPU) 1610. The CPU 1610 performs arithmetic calculations and control functions to execute software stored in an internal memory 1612, preferably random access memory (RAM) and/or read only memory (ROM), and possibly additional memory 1614. The additional memory 1614 may include, for example, mass memory storage, hard disk drives, optical disk drives (including CD and DVD drives), magnetic disk drives, magnetic tape drives (including LTO, DLT, DAT and DCC), flash drives, program cartridges and cartridge interfaces such as those found in video game devices, removable memory chips such as EPROM or PROM, emerging storage media, such as holographic storage, or similar storage media as known in the art. This additional memory 1614 may be physically internal to the computer 1606, or external as shown in
The computer system 1600 may also include other similar means for allowing computer programs or other instructions to be loaded. Such means can include, for example, a communications interface 1616 which allows software and data to be transferred between the computer system 1600 and external systems and networks. Examples of communications interface 1616 can include a modem, a network interface such as an Ethernet card, a wireless communication interface, or a serial or parallel communications port. Software and data transferred via communications interface 1616 are in the form of signals which can be electronic, acoustic, electromagnetic, optical or other signals capable of being received by communications interface 1616. Multiple interfaces, of course, can be provided on a single computer system 1600.
Input and output to and from the computer 1606 is administered by the input/output (I/O) interface 1618. This I/O interface 1618 administers control of the display 1602, keyboard 1604A, external devices 1608 and other such components of the computer system 1600. The computer 1606 also includes a graphical processing unit (GPU) 1620. The latter may also be used for computational purposes as an adjunct to, or instead of, the (CPU) 1610, for mathematical calculations.
The various components of the computer system 1600 are coupled to one another either directly or by coupling to suitable buses. The term “computer system”, “data processing system” and related terms, as used herein, is not limited to any particular type of computer system and encompasses servers, desktop computers, laptop computers, networked mobile wireless telecommunication computing devices such as smartphones, tablet computers, as well as other types of computer systems. Thus, computer readable program code for implementing aspects of the technology described herein may be contained or stored in the memory 1612 of the computer 1606, or on a computer usable or computer readable medium external to or the computer 1606, or on any combination thereof.
Finally, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed.
The description has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the claims. The embodiment was chosen and described in order to best explain the principles of the technology and the practical application, and to enable others of ordinary skill in the art to understand the technology for various embodiments with various modifications as are suited to the particular use contemplated.
One or more currently preferred embodiments have been described by way of example. It will be apparent to persons skilled in the art that a number of variations and modifications can be made without departing from the scope of the claims. In construing the claims, it is to be understood that the use of a computer (including an onboard computer of a genetic sequencer) to implement certain of the embodiments described herein is essential.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2020/051762 | 12/18/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62951671 | Dec 2019 | US |