Described are methods for determining a genetic identity of a cell, tissue, organ, or organism, based on type, position, and size of every occurrence of at least one repetitive element in the genome of the cell, tissue, organ, or organism. The methods can include using a computer to generate a graphical representation of the genetic identity of the cell, tissue, organ, or organism, and comparing genetic identity at different times/spaces. Also described herein is a computer implemented Universal Genome Information System, which serves as a genome-RE/TRE information management and analysis platform.
The vast majority of core concepts and relevant methodologies for modern studies of both normal and disease biology are stringently tethered to the function and polymorphism of “conventional” genes. Conventional gene sequences are reported to be shared, with a homology of greater than 80%, among a wide range of species, ranging from rodents to humans (Consortium, 2002; Guenet, 2005). A careful examination of the data obtained from recent biomedical investigations, which focus on the function and polymorphism of conventional genes, indicates that the ratio of tangible and/or helpful returns is very low, in consideration of the enormous investments (Padyukov, 2013; Seok et al., 2013; Shastry, 2002; Takao and Miyakawa, 2015a; Takao and Miyakawa, 2015b).
The announcements in 2001 and 2002 that the human and mouse reference genomes, respectively, were “completely” sequenced was followed by numerous publications which reported “whole” genome sequences of a wide range of species (Consortium, 2004; Consortium, 2002; Herrero-Medrano et al., 2014; Jun et al., 2014; Lander et al., 2001; Mullikin et al., 2010; Venter et al., 2001). These whole genome projects were executed apparently on a platform of a static genome within an individual (Fujimoto et al., 2010; Kim et al., 2009; Zhang et al., 2014). However, it is interesting to note that the human and mouse reference genomes have not been fully decoded as of May 2015 (National Center for Biotechnology Information [NCBI], National Institutes of Health). For instance, less than half of the human chromosome Y has been decoded.
It is estimated that the sum of all conventional gene sequences (exons) represents less than ˜1.2% of the human and mouse genomes, which have not been completely sequenced yet. Currently, genetics surveillance protocols for humans, animals, and plants primarily focus on polymorphisms in small sets of conventional gene and/or microsatellite sequences. The limited information obtained from conventional gene and/or microsatellite polymorphism analyses is apparently inadequate for precise genetic surveillance/identification. In fact, the results from the present studies, in which our novel Repetitive Element (RE; or repetitive genetic element)-based genome-landscaping technologies were tested with genomic DNAs from different humans and mouse strains, demonstrated that the current conventional gene/microsatellite-based protocols provide insufficient data for the correct identification of individual genomes. Described herein are REome and Transposable Repetitive Element-ome (TREome) analysis-based systems and methods that enable high-resolution and tunable genetics surveillance/identification systems for both static and dynamic (temporal and spatial) genomes of all life forms. These genetics surveillance/identification systems are applicable to a wide range of species (e.g., humans, animals, plants, and microbes) and fields, such as justice forensics, animal breeding, plant breeding, pharmacogenomics, monitoring of radiation therapy, cell/tissue typing, diagnostics-marker discovery, fundamental cell biology and genetics.
Thus, in a first aspect, the invention provides methods of determining a genetic identity for a cell or organism. The methods can include determining type, position, and size of every occurrence of at least one repetitive element (both transposable and non-transposable) in the genome of the cell or organism; thereby determining the genetic identify of the cell or organism.
A computer-implemented method of generating a graphical representation, e.g., an RE array, of the genetic identity of a cell or organism. The methods include receiving electronic information regarding the type, position, and size of every occurrence of at least one repetitive element in the genome of the cell or organism; and using a processor to generate a graphical representation of the electronic information, e.g., by self-alignment of a query sequence to determine direct (illustrated by blue angles on
In some embodiments, the cell is from an animal, e.g., a mammal, bird, fish, or reptile; plant; fungus; or bacterium.
In some embodiments, the assay comprises using PCR and/or inverse-PCR (I-PCR) to determine position and sequencing to determine type, size, and/or copy number.
In some embodiments, the electronic information was obtained using PCR and/or inverse-PCR (I-PCR) to determine position and sequencing to determine type, size, and/or copy number.
In some embodiments, the repetitive element is a Transposable Repetitive Element (TRE). In some embodiments, the TRE is an endogenous retrovirus (ERV), long interspersed nuclear element (LINE), short interspersed nuclear element (SINE), or DNA transposon. In some embodiments, wherein the repetitive element is a non-transposable repetitive element.
In some embodiments, the type is based on primary sequence; the position is relative to a reference genome; and/or the size refers to the length or number of repeats.
In some embodiments, using a processor to generate a graphical representation of the electronic information comprises unbiased self-alignment and dot-matrix plot visualization.
In some embodiments, the method includes displaying the graphical representation (e.g., RE array) electronically on a display device to provide a visible image.
In some embodiments, the genetic identity is determined at a specific time or space.
In some embodiments, the genetic identity is determined at a first time or space, and the method further comprising determining genetic identity at a second time or space, and comparing the genetic identity at the first and second time or space to detect changes in the genetic identity of the cell or organism. For example, the methods can be used to monitor change of state (e.g., progression of disease or temporal surveillance) or to identify risk factors or prognostic applications.
In some embodiments, the second time is later than the first time.
In some embodiments, the second space is obtained from a different cell, tissue, or organ within the same organism.
Also provided herein is a computer-implemented method for determining genetic identity of a cell, tissue, organ, or organism, comprising:
accessing, by one or more processing devices, a database (e.g., a Genome Information System as described herein) to obtain data elements comprising genomic sequence information, gene information, genetic variation information, and repetitive element information for a cell, tissue, organ, or organism at a selected time and/or space; computing a genetic identity for the cell, tissue, organ, or organism at the selected time and/or space, wherein the genetic identity is computed based on the data elements; and storing, at a storage location, a representation of the genetic identity.
Also provided herein is a computer-implemented method, comprising:
accessing, by one or more processing devices, a database (e.g., a Genome Information System as described herein) to obtain data elements comprising genomic sequence information, gene information, genetic variation information, and repetitive element information for a cell, tissue, organ, or organism at a selected time and/or space;
obtaining additional information relating to genomic sequence information, gene information, genetic variation information, and repetitive element information in the cell, tissue, organ, or organism, wherein the additional information is associated with a predetermined time and/or space, e.g., aging, stress, and/or disease; and updating the data elements.
In some embodiments, the methods include computing a genetic identity for the cell, tissue, organ, or organism, wherein the genetic identity is computed based on the data elements; and
storing, at a storage location, a representation of the genetic identity.
Also provided herein is a computer-implemented system (e.g., a Genome Information System as described herein) for storing genomic information, comprising: memory storing computer-readable instructions,
one or more processing devices configured to execute the computer-readable instructions to perform operations comprising:
accessing a database to obtain data elements comprising genomic sequence information, gene information, genetic variation information, and repetitive element information for a cell, tissue, organ, or organism at a selected time and/or space;
computing a genetic identity for the cell, tissue, organ, or organism at the selected time and/or space, wherein the genetic identity is computed based on the data elements; and storing, at a storage location, a representation of the genetic identity.
In some embodiments, the selected time and/or space is different from the predetermined time and/or space; e.g., they can be the first and second time/space as described herein (in either order). In some embodiments, the selected and predetermined time/space are the same.
In some embodiments, the representation of the genetic identity is usable for generating an image of the genetic identity.
In some embodiments, the method includes presenting the image of the genetic identity on a display device.
In some embodiments, the selected time and/or space relates to changes associated with aging, stress, and/or disease.
The present disclosure also provides methods of determining origin of a test subject. The methods include the steps of determining type, position, and size of every occurrence of at least one repetitive element in the genome of the test subject; comparing the type, position, and size of every occurrence of the repetitive element of the test subject to the type, position, and size of every occurrence of the repetitive element of a reference subject; determining that the type, position, and size of every occurrence of the repetitive element of the test subject and the type, position, and size of every occurrence of the repetitive element of the reference subject is not statistically different; and identifying the test subject as having the same origin as the reference subject. In some embodiments, the test subject is a human, a plant, or an animal.
In one aspect, the disclosure relates to methods of sub-classifying a disease of humans, plants, and animals. The methods include the steps of determining type, position, and size of every occurrence of at least one repetitive element in the genome of a group of subjects with a disease; applying a clustering algorithm to the type, position, and size of every occurrence of the repetitive element in the genome of the group of subjects; and identifying a sub-group of subjects as having a sub-group disease.
In another aspect, the disclosure relates to methods of determining whether a test cell belongs to a reference cell line. The methods include the steps of determining type, position, and size of every occurrence of at least one repetitive element in the genome of the test cell; comparing type, position, and size of every occurrence of the repetitive element in the genome of the test cell to type, position, and size of every occurrence of the repetitive element in the genome of a reference cell from the reference cell line; determining that the type, position, and size of every occurrence of the repetitive element of the test cell is not statistically different from the type, position, and size of every occurrence of the repetitive element of the reference cell; and identifying the cell as belonging to the cell line.
The present disclosure also provides methods of identifying a locus associated with a disease. The methods include the steps of determining type, position, and size of every occurrence of at least one repetitive element in the genome of a first sibling with the disease; comparing type, position, and size of every occurrence of the repetitive element in the genome of the first sibling to type, position, and size of every occurrence of the repetitive element in the genome of a second sibling, wherein the second sibling does not have the disease; and identifying the locus associated with the disease. In some embodiments, the first sibling and the second sibling are of the same sex.
As used herein, the term “significant” or “significantly” refers to statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level (denoted a, alpha). The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true whereas the significance level a is the probability of rejecting the null hypothesis given that it is true. In some embodiments, the significance level is 0.05, 0.01, 0.005, 0.001, 0.0001, or 0.00001, etc. In some embodiments, “significantly different” refers to the difference between the two groups have attained the statistical significance.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials are described herein for use in the present invention; other, suitable methods and materials known in the art can also be used. The materials, methods, and examples are illustrative only and not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.
Other features and advantages of the invention will be apparent from the following detailed description and figures, and from the claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Transposable repetitive elements (TREs) make up about 45% of the human genome (
Publications and databases commonly define the sizes of the genomes and/or chromosomes of various species, such as human and mouse reference genomes at NCBI, based on an assumption that their configurations are fully characterized and rather static (Church et al., 2015; Rosenfeld et al., 2012). The finding that the size of NCBI's build Annotation Release 105 reference mouse chromosome Y of ˜92 Mb in size is almost six times larger than the build 37.2's ˜16 Mb indicates that the current size estimates of genomes and/or chromosomes of various species need to be reevaluated (Lee et al., 2013). A recent report that the size and structure of the C57BL/6J inbred mouse genomes are temporally and spatially changed in conjunction with differential TREome activities suggests that: 1) it is impractical to identify a single representative genome for an individual mouse or human from a pool of variant genomes and 2) genome dynamicity is linked to a range of biological processes, such as differentiation, stress response, and aging (Lee et al., 2012; Lee et al., 2015).
It is unlikely that the limited library of proteins derived from the current repertoire of conventional genes, in conjunction with “non-coding” RNA species, is sufficient to explain the enormous extent of phenotypic polymorphisms observed in both normal and disease states. Thus far, the majority of the efforts for understanding a host of normal and disease phenotypes, based solely on the knowledge derived from studies of the function and polymorphism of conventional genes, have been inefficient. As a new functional layer of the genome that contributes to the development of disparate normal and disease phenotypes in humans and other species, we introduce the inherent diversity and acquired activity of repetitive elements, or REs, and transposable repetitive elements, or TREs. In contrast to the common polymorphisms of conventional genes observed in a population, variations in the genomic RE/TRE landscapes should be directly linked to the differential shaping of the genomes of somatic cells within an individual. The Universal Genome Information System (described below) can help dynamically manage the TREome as well as other genomic data and introduce new insights into the variable, often individual-specific, biological mechanisms (both normal and disease) and identify novel RE/TRE loci as markers for specific phenotypes.
As described herein, it would be logical to explain aging-related phenotypic changes in the context of a dynamic genome rather than a static one. From the perspective of precision biology and medicine, some of the normal and disease phenotypes, which have been explained by functions of conventional genes, need to be re-evaluated to account for the effects of both the inherent diversity and acquired activity of the RE/TREs and the associated dynamic nature of the genome. It is certain that normal and disease phenotypes, which sequentially or randomly appear during the lifetime of an individual, are not statically forged by only the standard exome (from conventional genes), but by the entire genome and its innate temporal and spatial dynamics. This method of collecting and interpreting data from the entire genome information system enables precision decoding of biology and medicine; see, e.g.,
Genetics Surveillance System (GSS)
Currently, both normal and disease biology is explained in the context of the function and polymorphism of conventional genes. However, it is clear that certain phenotypes cannot be explained solely on the basis of conventional gene functions. For example, the genomes of mice and humans share a high degree of homology (˜85%) in their standard exome sequences (conventional genes). Hence, the evident phenotypic differences between these species cannot be explained by exome functions alone. The standard exome comprises only ˜1.2% of our genome, and the vast majority of the residual genome is occupied by a plethora of repetitive elements (REs), often called “Junk” DNA. In particular, transposable repetitive elements (TREs), constituting at least 45% of the human genome, have the potential to dynamically shape the genomes' configuration through “copy and paste” and “cut and paste” functions. There are a myriad of heterologous TRE families in the human and mouse genomes, which include endogenous retroviruses (ERVs), long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), DNA transposons, and unknowns.
We demonstrated that human and mouse REs, e.g., TREs (Human ERVs [HERVs] and mouse ERVs), are inherently diverse and that injury-elicited stressors activate certain HERVs in an individual- and disease course-specific manner. The expression of injury-associated HERV polypeptides differentially induced inflammatory mediators (e.g., IL-6, IL-1β). In addition, the genomic landscapes of hypertrophic burn scars, in the context of type and position of HERVs, were altered compared to the matching control skin, suggesting HERVs' roles in hypertrophic scar development. Our recent studies further showed that the genomic TRE/ERV landscapes are altered in various human and mouse tumors in comparison to their matching controls (unpublished). In particular, following micro-dissection of a paraffin section from a patient's histological normal, precancer, and cancer breast biopsy, it was determined that each of the regions has a unique genomic landscape of TREs/HERVs.
Described herein is a high-resolution genetics surveillance system (GSS) that uses the platform of REome/TREome landscapes and TREome genes in conjunction with current conventional gene-based monitoring systems. For example, genetic uniformity of established laboratory mouse strains, both conventional and genetically engineered, could be evaluated by the high-resolution GSS for initial confirmation and maintenance of each strain. Depending on the GSS data collected, additional breeding designs and/or surveillance protocols can be implemented to obtain acceptable levels of genetic uniformity for each strain. When new mouse strains are developed, the GSS serves as a high-resolution tool for dynamic monitoring of genetic crossing/backcrossing. The GSS is applicable across species and organisms (e.g., humans, animals, plants, as well as cells therefrom) and fields (e.g., agriculture, forensics). Identifying and accounting for variations in the REome/TREome landscapes and TREome genes can be used to establish a genetically uniform laboratory mouse strain as well as decoding normal and disease biology in numerous organisms.
The genome information is expected to be variable depending on each individual organism, organ, cell type, stress environment, and age, resulting in non-uniform genome sizes unique for the individual genomic DNA sources. For efficient storage, normalization, and computation of the information residing in these non-uniform and dynamic (temporal and spatial) genomes, a novel genome management system, named the “Universal Genome Information System”, can be used.
Although the Universal Genome Information System is applicable to the genomes of all living organisms, an exemplary design and construction is described herein based on mouse and human genomes. The Universal Genome Information System has the following three key characteristics: 1) designed to accommodate and manage the critical variations in genomes' structure, sequence, and size due to inherent diversity and acquired activity of REs/TREs depending on organ, cell type, stress environment, and age, 2) able to handle all genome variations (structure, sequence, and size) and they are expandable to annotate newly obtained genetic information (e.g., REs, TREs, TRE genes, conventional genes) to their normalized relative positions, and 3) permitting efficient annotation and rapid analyses of conventional genes, TREs, and other genetic elements residing on the genomes of variable structures and sizes since the positions of the individual elements derived from different genomes are normalized into a single dynamically synchronized and universal frame.
The information in the NCBI's human genome database indicates that the reference genome is incomplete and estimates the complete length to be approximately 3 Gb. Initially, the REs, TREs, TRE genes, conventional genes, and other information, which are obtained from the reference genome and/or RE/TRE databases, can be assembled into the dynamic Universal Genome Information System scaffold. The nucleotide position information of the NCBI's human reference genome will serve as the founding frame for the human Universal Genome Information System. Similar information from additional well-established human genome databases, either fully or partially assembled, will be compared to the NCBI's reference genome frame. Any new information (e.g., nucleotide insertions/deletions-related changes in positions as well as single nucleotide polymorphisms, REs, TREs, TRE genes, conventional genes) can be updated to the original frame so that the nucleotide sequence and position information from all available human genome resources are represented in the human Universal Genome Information System. Thus, the genetic information (e.g., REs, TREs, TRE genes, conventional genes), which are annotated in the individual human genomes of different structure, composition, and size, can be consolidated and normalized into the dynamically synchronized frame of the human Universal Genome Information System, enabling efficient storage, management, and comparison/computation analyses of REs, TREs, conventional genes, and other genetic elements. For instance, alignment analyses of multiple whole chromosomes, which contain a plethora of conventional genes and various RE/TRE families, can be accomplished with minimum computation time since all the nucleotide positions are normalized within one synchronized-normalized frame. This Universal Genome platform can be applied to the genomes of any species. The platform for the Universal Genome Information System can be built with standardized and open-source software as much as possible to leverage the existing advancements in the field.
Dynamic Genetics Surveillance Systems (DGSS)
The DGSS produce genetics surveillance data by interrogating the individual genomes for information regarding the RE/TREs' inherent diversity and acquired activity (
The multi-dimensional RE/TRE landscape data for each genome can be collected and recorded, using methods known in the art, e.g., by: 1) TRE amplicon banding pattern on a polyacrylamide gel or an electropherogram and 2) annotated multi-dimensional information of REs and/or TREs, with regard to their type, copy number, position, and time/space, on a “Universal Genome” scaffold, which can be designed and developed specifically for each species/individual; these can be represented as the RE arrays described herein. TRE/RE copy number and position information for each type of TRE/RE will depend on the specific primer sets used. Comparisons between samples, e.g., samples that differ in time and/or space, can be done for each single primer set or specific combinatorial primer sets. The results from one primer set can be confirmed by the data obtained from another set.
Following sequence analysis, the TRE amplicon-band information can also be annotated within the Universal Genome scaffold. For most DGSS applications, either RE/TRE landscape data type (banding pattern or annotated information) would be sufficient for a high-resolution genetic surveillance/identification of specific genomes; however, a combinatorial interpretation of the two different RE/TRE landscape data types would be helpful for a final confirmation of the critical surveillance data sets (e.g., forensic identification for justice system). Interrogation of the RE/TRE landscape data, which are collected from individual genomes, via multi-dimensional (type, copy number, position, and time/space of REs/TREs) pattern-computation would allow for precise and dynamic (temporal and spatial) genetics surveillance and/or identification of all life forms.
Within each species, the genome-wide RE/TRE landscape information (type, copy number, position, and time/space) is expected to be variable depending on a host of factors (e.g., individual, organ, cell type, stress environment, and age), resulting in non-uniform genome sizes unique for the individual genomic DNA sources. For efficient storage, normalization, and computation of the multi-dimensional RE/TRE landscape information, which resides in these non-uniform and dynamic (temporal and spatial) genome platforms, we designed a novel genome-RE/TRE management system specifically designed for each species/individual, named the Universal Genome. The Universal Genome of each species/individual is able to encompass all genome variations (e.g., structure, sequence, and size) and it is expandable to accommodate newly obtained genetic information (e.g., REs, TREs, “TRE genes”, conventional genes) to their normalized relative positions. The Universal Genome permits efficient storage and annotation as well as rapid analyses of REs/TREs and other genetic elements (e.g., conventional genes, small RNAs) on the genomes of variable structures and sizes since the positions of each element are normalized into a single dynamically synchronized universal frame. For each species/individual, the DGSS will be built and operated on this dynamically adaptable and normalized Universal Genome scaffold.
The following highlight unique features of an exemplary DGSS:
Applications of DGSS
Applications of the DGSS technologies described herein, including the RE Arrays, include the following:
1. Introduction of the dynamic (temporal and spatial) and multi-dimensional RE/TRE landscape data (type, copy number, position, and optionally time/space of REs/TREs), which are directly linked to RE/TREs' inherent diversity and acquired activity, as critical elements for the development of a highly tunable precision genetics surveillance/identification for individual humans (including monozygotic twins), animals, plants, cells (e.g., cultured cells) and microbes;
2. Introduction of the dynamic (temporal and spatial) landscape data of repetitive element arrays (RE arrays), which are directly linked to REs/TREs' inherent diversity and acquired activity, as core elements for the development of precision genetics surveillance/identification for individual humans, animals, and plants;
3. Incorporation of both inherent and acquired RE/TRE landscape information into a life-long periodic or sporadic (incident-specific) genetics/health surveillance system using an individual's (e.g., human, animal, plant) personalized Universal Genome and
4. Identification and monitoring of cell types (e.g., cultured or primary cells) based on inherent and acquired RE/TRE landscape information on a species-specific Universal Genome scaffold.
5. Monitoring of genomic stability of cells grown in culture based on inherent and acquired TRE landscape information on a species-specific Universal Genome scaffold
6. Genomic monitoring/surveillance of cell differentiation processes (e.g., stem cells) based on inherent and acquired RE/TRE landscape information on a species-specific Universal Genome scaffold.
7. Monitoring and confirmation of crossing-over events between two different individuals/strains of humans, animals, or plants by examining crossing-over maps based on the inherent and acquired RE/TRE landscape information of parental strains and offspring on a species-specific Universal Genome scaffold. This would be especially important to monitor plant cross-breeding at early life stages.
8. Establishment of genetics surveillance system for laboratory animals of conventional-inbred and genetically engineered mammals or cells, e.g., mouse strains (e.g., CRISPR-CAS9-edits, transgenics, knock-outs) based on inherent and acquired RE/TRE landscape information of parental strains and offspring on a species-specific Universal Genome scaffold.
9. Establishment of a genetics surveillance system for genetically engineered/modified/edited plants (e.g., CRISPR-CAS9-edits, transgenics, knock-outs) based on the inherent and acquired RE/TRE landscape information of parental strains and offspring on a species-specific Universal Genome scaffold.
10. Monitoring and confirmation of stability and compatibility of the CRISPR-CAS9-edited cells (derived from humans, animals, and plants) by surveying the RE/TRE landscape profile on a species-specific Universal Genome scaffold.
11. Monitoring of the genomic stability of laboratory animals which are subjected to pharmacogenomics studies by examining changes in the RE/TRE landscape profile on a species-specific Universal Genome scaffold.
12. Identification and development of pathologic markers for studying various disease processes (e.g., cancer, aging-related disorders) by tracking the inherent diversity and acquired transposition activity of RE/TREs on a species- or individual (for temporal and spatial genomic changes within an individual)-specific Universal Genome scaffold.
13. Identification and development of diagnostic markers for diseases with unknown causative agents (e.g., cerebral palsy, autism spectrum disorder, allergy, susceptibility/resistance to specific diseases) or without any tangible diagnostic markers by tracking the inherent diversity and acquired transposition activity of RE/TREs on a species- or individual (for temporal and spatial genomic changes within an individual)-specific Universal Genome scaffold.
14. Development of non-conventional diagnostics systems by identifying genomic risk factors for a host of relatively well-characterized diseases (e.g., neonatal trisomy test), especially the ones without a reliable and/or efficient diagnostic tool available, such as celiac disease, Crohn's disease, and multiple sclerosis, by focusing on the inherent diversity and acquired activity of RE/TREs on a Universal Genome scaffold.
15. Identification and development of prognostic genomic signatures for a range of cancer types which predict differential cancer progression patterns, such as “DCIS (ductal carcinoma in situ)-forever” vs. “DCIS to breast tumor” based on the inherent and acquired RE/TRE landscape information on a species-specific Universal Genome scaffold.
16. Identification and development of prognostic genomic signatures for a range of aging-related disorders based on the inherent and acquired RE/TRE landscape information on a species-specific Universal Genome scaffold.
17. Temporal surveillance of the genome stability of a patient undergoing radiation therapy or chemotherapy by examination of changes in the RE/TRE landscape profiles and affected genomic regions within an individual-specific Universal Genome scaffold.
18. Surveillance of the effects of drugs and compounds on genome stability of a range of cultured cell types by examination of changes in the RE/TRE landscape profiles and affected genomic regions on a species-specific Universal Genome scaffold.
19. Surveillance of the effects of drugs and compounds on genome stability of experimental animals and human patients by examination of changes in the RE/TRE landscape profiles and affected genomic regions on a species-specific Universal Genome scaffold.
20. Temporal surveillance of the genome stability/variation of a patient who undergoes a series of acute disease episodes (e.g., trauma, infection) by examination of changes in the RE/TRE landscape profiles and affected genomic regions within an individual-specific Universal Genome scaffold.
21. Temporal surveillance of the genome stability and/or clonality of cancer patients (e.g. leukemia) undergoing treatment by examination of changes in RE/TRE landscape profiles within an individual-specific Universal Genome scaffold.
22. Development of the “RE/TRE landscape” biochip systems seeded with species/strain/cell type-specific multi-dimensional RE/TRE landscape information (type, copy number, position, and time/space of RE/TREs) annotated on a relevant Universal Genome scaffold for efficient surveillance of genome identity and/or stability.
23. Development of disease diagnostic systems based on the RE/TRE landscape biochip systems seeded with disease-specific multi-dimensional RE/TRE landscape information (type, copy number, position, and time/space of RE/TREs) annotated on the relevant species' Universal Genome scaffold.
24. Development of disease (e.g., inflammation) diagnostic systems based on the “TRE gene” biochip systems seeded with disease-specific RE/TRE gene sequences annotated on the relevant species' Universal Genome scaffold.
25. Establishment of an individual/strain/species-specific genome management and application systems (GMAS) to organize the constantly expandable DGSS and accompanying components (e.g., species-specific RE/TRE libraries) on the Universal Genome scaffold.
26. Establishment of disease-specific GMAS-DGSS which are enabled to learn and utilize newly annotated RE/TRE landscape and other relevant information.
27. Development and incorporation of genome modeling technologies within the GMAS-DGSS for efficient monitoring and determination of genomic phenotypes by multi-dimensional computation of RE/TRE landscape information (type, copy number, position, and space/time of RE/TREs).
28. The DGSS can also be used to determine the origin of a subject (e.g., where a migrating animal or a plant comes from). This is because RE/TREs' types, copy numbers, positions can be influenced by various environment factors as well. These factors include climate (e.g., temperature and amount of sunlight), food, and pollution, etc. Thus, in one aspect, the present disclosure provides methods of identifying the origin of a test subject. In some embodiments, the methods involve determining the type, position, and size of at least one repetitive element family in the genome of the test subject; comparing the type, position, and size of the repetitive elements of the test subject to the type, position, and size of the repetitive elements of a reference subject with known origin. If the type, position, and size of the repetitive elements are statistically different, it can be determined that the test subject and the reference subject have the different origins. If the type, position, and size of the repetitive elements are not statistically different, it can be determined that the test subject and the reference subject have the same origin.
29. The present disclosure also provides methods for clustering/sub-classifying a wide range of diseases (e.g., breast cancer, autism spectrum disorder). As used herein, the subject can be a human, an animal, or a plant. Various clustering algorithms can be used. Based on the clustering results, a person skilled in the art can identify a pattern that is unique to the individual clusters/subtypes of the disease. This pattern can be used to determine whether a test subject has the specific subtype disease.
30. The present disclosure also provides methods for cell line authentication with regard to identity, divergence, and contamination. Cultured cells are important for research (e.g., human cells, cancer cells). When it comes to interpreting results, knowing the origin of a cell line is imperative. However, cell lines can be mislabeled for various reasons, e.g., mix-up by accident. The present disclosure can be used to determine whether a test cell line belongs to a cell line of interest. In addition, multiple passages in a culture setting can lead to genome/DNA rearrangement. Thus, it is important to measure divergence of the cell lines' genomes in a culture setting. In some embodiments, the methods can also be used to determine whether the cell is from a male subject (e.g., a man, or a male animal) or from a female subject (e.g., a woman, or a female animal).
31. The methods described herein can also be used to determine whether a cell culture is contaminated by microorganisms (e.g., bacterium, virus, fungus). In these cases, the probes, which are designed to target the genome of microorganisms, can be mixed with the repetitive element probes employed for cell line authentication.
32. The methods described herein can be used in crossing over mapping in plants, animals, and humans. In these cases, the type and position information of the repetitive elements from gender-matching siblings/littermates will be compared against each other using their parents' genome as a reference. It is not necessary to have the parents' genome as a reference for the crossing over mapping. In some embodiments, the siblings are of the same sex (e.g., they are all brothers, or they are all sisters). In some embodiments, one sibling has the phenotype of interest (e.g., a disease), while the other does not. In some embodiments, the phenotype of interest is autism spectrum disorder, bipolar disorder, schizophrenia, or any diseases without known causative agents and/or genetic risk factors.
Other applications are also within the scope of the present disclosure.
The present methods can include the generation and use of graphical representations of Repeat Element Arrays, which are species specific and ordered genome units. The arrays can be generated using unbiased self-alignment and dot-matrix plot visualization of the type (based on primary sequence), position (relative to the NCBI reference genome, for example), and size (e.g., length or number of repeats) of the REs or TREs. Each array may be representative of a specific time/space (e.g., a specific age of the cell or organism or a specific time in culture, or a specific tissue, organ, or organism source, or specific conditions under which the sample was originally obtained). For example, as shown in
The RE Arrays have a number of uses, as described herein; for example, the (known or unexplored) polymorphisms in species-unique RE arrays can serve as novel identifiers of genomes from a cell or organism, with extraordinary levels of resolution and precision. In addition, within a species, functional variations in RE array configurations could be directly applied to diagnostics as well as to the general studies of normal and disease biology. Furthermore, digital forms of RE arrays can be used as a “RE array code” to identify individual humans, animals, and plants.
The inherent diversity of RE arrays, and their responsiveness to acquired RE activity, makes them particularly useful for: 1) genome identification, 2) diagnostics, 3) studies of normal and disease biology, and 4) development of digital RE array ID.
The RE array can be stored, e.g., in electronic media such as a flash drive as well as on paper or other media. The RE array can also be represented electronically on a monitor or screen, such as on a computer monitor, a mobile telephone screen, or on a personal digital assistant (PDA) screen. The RE array can be further subjected visual or optical analysis and comparison, e.g., with a laser scanner or image capture device, such as a charge-coupled device (CCD). Images on paper or other non-electronic media can be scanned, e.g., digitally, and then compared by machine. For example, these images can then be compared using standard pattern recognition software, such as fingerprint matching or facial recognition programs. Alternatively, the RE Arrays can also be analyzed and compared by computer in digital, electrical form without the need for a tangible printout or image represented on a computer or other screen or monitor.
The RE arrays can be generated using a computer system, e.g., as described in WO 2011/146263 and
The memory 1020 stores information within the system 1000. In some implementations, the memory 1020 is a computer-readable medium. The memory 1020 can include volatile memory and/or non-volatile memory.
The storage device 1030 is capable of providing mass storage for the system 1000. In one implementation, the storage device 1030 is a computer-readable medium. In various different implementations, the storage device 1030 may be a disk device, e.g., a hard disk device or an optical disk device, or a tape device.
The input/output device 1040 provides input/output operations for the system 1000. In some implementations, the input/output device 1040 includes a keyboard and/or pointing device. In some implementations, the input/output device 1040 includes a display device for displaying graphical user interfaces.
The features described can be implemented in digital electronic circuitry, or in computer hardware, software, firmware, or in combinations of them. The features can be implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine-readable storage device, for execution by a programmable processor; and features can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output. The described features can be implemented in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program includes a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Computers include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer will also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
To provide for interaction with a user, the features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
The features can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them.
The components of the system can be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a LAN, a WAN, and computers and networks that form the Internet.
The computer system can include clients and servers. A client and server are generally remote from each other and typically interact through a network, such as the described one. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The processor 1010 carries out instructions related to a computer program. The processor 1010 may include hardware such as logic gates, adders, multipliers and counters. The processor 1010 may further include a separate arithmetic logic unit (ALU) that performs arithmetic and logical operations.
The invention is further described in the following examples, which do not limit the scope of the invention described in the claims.
In almost all biomedical research fields involving animals, it is critical to clearly define and confirm the genetic constancy of animals employed in a wide range of experimental models for studying gene function, toxicity of candidate compounds, diseases, and others (Austin et al., 2004; Maronpot, 2013). The definition and degree of the genetic constancy of animals, which are subjected to various experiments, are tunable depending on the specific aim(s) of individual studies. For example, when a candidate compound for therapeutic drug development is evaluated for its side effects and/or toxicities, the genetic constancy of the experimental animals does not have to be highly stringent. Conversely, the majority of studies, which focus on understanding functions of genes using a pair of defective/mutated (conventional or targeted) and its matching control animals, rely on stringent genetic constancy for a proper evaluation of the experimental outcomes.
It is not uncommon to encounter variations in morphologic phenotypes among a population of a specific inbred mouse strain, such as C57BL/6J (Niu and Liang, 2009). Some of these phenotypic variations within individual inbred mouse populations are explained primarily by irreversible genetic drift events due to the genetic fixation of accumulated mutations, which are often discovered serendipitously or as outcomes of troubleshooting experiments (Taft et al., 2006). Taft et al. stated that the current repertoire of gene SNPs and other DNA markers (e.g., microsatellite elements) is not sufficient for screening genetic drift in mice (Taft et al., 2006). To circumvent the detrimental effects of cumulative genetic drifts over-time, key research mouse producers implemented control programs, such as the Genetic Stability Program (Jackson Laboratory) and the Genetic Monitoring Program (Taconic Biosciences, Inc). One of the key shared features of these programs is the cryopreservation of embryos as a future replacement of foundation mice. Jackson Laboratory reported that “Inbred strains within this program effectively remain genetically unchanged for at least the period of the program (projected 25 years)” (Taft et al., 2006).
As described herein, the reference mouse genome sequence (derived from C57BL/6J inbred mice) housed at the NCBI database has not yet been completely sequenced (Consortium, 2002). In addition, the NCBI's reference mouse chromosome Y, which was estimated to be ˜16 million nucleotides in length up until early 2013, is now annotated to have ˜92 million nucleotides. These findings summarize the difficulties which are inherent to understanding genome/chromosome biology as well as decoding/sequencing the entire genetic information system of humans and animals. In addition to the estimated 20,000 conventional genes annotated in the reference mouse genome, the vast majority of the mouse and human genomes are occupied by a plethora of TREome members. In contrast to conventional genes, the TREome is inherently and highly diverse within the mouse as well as human populations (Batzer et al., 1996; Bennett et al., 2004; Boissinot et al., 2004), unpublished). In addition, it has been well-documented that some members of the TREome have “gene” sequences which code for functional proteins, such as the superantigen of mouse mammary tumor virus type-ERVs and human endogenous retroviruses (Holder et al., 2012; Lee et al., 2014; Schmitt et al., 2015). Furthermore, certain TREome members of mice and humans respond to a range of stressors, leading to an increase in their activity (Antony et al., 2011; Cho et al., 2008b). Acquired TREome (e.g., MLV-ERVs) activities during the life course of an inbred mouse could involve five critical processes: 1) DNA-dependent RNA polymerization (transcription), 2) protein synthesis from TREome genes, 3) RNA-dependent DNA polymerization (reverse transcription) in the cytoplasm, 4) virion assembly, and 5) “random” integration of a DNA copy into the genome. One of the critical impacts imposed by the accumulation of acquired TREome activities would be alterations in the TREome landscapes in the affected genomes.
In this study, we examined the extent of variation in the configurations among the inherent germ-line and acquired (temporally and spatially) somatic genomes of C57BL/6J inbred mice using murine leukemia virus-type ERV (MLV-ERV) sequences as a TREome landscaping probe. The findings from this study provide evidence that: 1) with regard to the TREome landscapes, inherent diversity is visible among the population of C57BL/6J inbred mice evidenced by the variations in the TREome landscapes of germ-line DNAs, 2) there are spatial variations in the TREome landscapes among different organs/tissues within the individual mice, probably due to the dynamic accumulation of acquired activity of certain TREome members, 3) in particular, there are more copies of MLV-ERV in the kidney genomes of 19-month old C57BL/6J male mice compared to the liver genomes of the same mice, 4) there are gender specific TREome landscapes, suggesting the TREome's association with gender-specific phenotypes, and 5) distinct patterns of variations in the TREome landscape exist among the population of C57BL/6J inbred mice. One can assume that the entire, or at least the majority, of the information embedded in the genome of humans and mice participate in determining phenotypic details. In that case, surveillance of the inherent diversity and acquired activity of the TREome in the dynamic genomes of C57BL/6J inbred mice, as demonstrated in this study, would provide critical and valuable information for understanding the relationships between the genetic characteristics and phenotypes of inbred mice or research animals in general. We suggest that a mouse genetics surveillance system be established for a range of laboratory mouse strains which focuses on the inherent diversity and acquired activity of the TREome in conjunction with temporal and spatial variations in TREome landscapes. This somewhat unbiased TREome-based genetics surveillance system would serve as a synergistic tool for the current monitoring systems (e.g., Genetic Stability Program, Genetic Monitoring Program) which primarily rely on the cryopreservation of embryos and survey for polymorphisms of conventional genes and microsatellites in the absence of complete reference mouse genome sequences. Successful development and implementation of the TREome-based mouse genetic surveillance system would be applied for high-resolution genetic identification and monitoring of wide-ranging species, such as plants and their products, animals and their products, and humans which all harbor their own TREomes.
Materials and Methods
Animal Experiments
C57BL/6J inbred mice (females and males) of varying ages were purchased from the Jackson Laboratory (Bar Harbor, Me.; West Sacramento, Calif.) or obtained from Dr.
David Pleasure at the University of California, Davis (UC Davis). All animals were provided with water and food ad libitum during their housing at an UC Davis facility where some of them were aged for an extended period of time. The animal experiment protocol was approved by the Animal Use and Care Administrative Advisory Committee of UC Davis. Animals were sacrificed by CO2 inhalation to collect sperm and/or tissues followed by snap-freezing in liquid nitrogen.
Genomic DNA Isolation and TREome (MLV-ERV) Landscaping by Inverse-PCR (I-PCR) Analyses
Snap-frozen sperm and somatic tissue samples were subjected to genomic DNA isolation using a DNeasy Tissue kit (Qiagen, Valencia, Calif.) and DNA samples were normalized to 20 ng/μ1. As an initial step for the I-PCR analyses (
Real-Time Genomic DNA PCR Analyses of MLV-ERV Copy Numbers
For the kidney and liver genomic DNAs isolated from six 19-month old mice (3 females and 3 males), real-time genomic DNA PCR was performed in triplicate using a MX3005P instrument (Stratagene, Santa Clara, Calif.) with a reagent kit (Brilliant SYBR Green QPCR Master Mix) from Agilent (Santa Clara, Calif.) and 25 ng of each genomic DNA in triplicate. Details for the primers and PCR conditions are listed in Table 1.
Copy Number Calculation and Statistical Analysis
The results from quantitative real-time DNA PCR analyses of MLV-ERVs were calculated as a relative copy number per single copy of the hypoxanthine phosphoribosyl transferase (HPRT) gene using a modified delta-delta CT method (2̂(CT(HPRT)−CT(MLV-ERV))) (Livak and Schmittgen, 2001). A one-way ANOVA was used to determine the significance of differences in relative MLV-ERV copy number values between individual pairs of groups. Statistical significance was indicated when the P value is less than 0.05.
Results
Germ-Line Variations in the Genome-Wide TREome Landscapes Among C57BL/6J Inbred Mice
Current understanding of inbred mouse genetics projects that the genomic configuration of germ-line cells from mice of an inbred strain are virtually identical (Beck et al., 2000). Snapshots of the TREome landscapes of the germ-line (sperm) genomic DNA samples isolated from three age groups (8-, 12-, and 20-weeks) of C57BL/6J inbred mice were taken using a conserved region of MLV-ERVs as a probe (
Spatial Variations in the TREome Landscapes Among Germ-Line and Somatic Genomes in a Single C57BL/6J Inbred Mouse
It is widely accepted that there are no significant changes in genome configuration, primarily in regard to the number and position information of nucleotides, during development and/or differentiation of an individual mouse or human (Giachino et al., 2013; Walsh et al., 1998). In this experiment, we investigated whether the structural configuration of the germ-line genome of an individual diversifies during development and/or differentiation, resulting in a pool of disparate TREome landscapes within somatic genomes. Genome-wide TREome landscapes of a set of 18 different somatic organs (13 non-lymphoid and 5 lymphoid) and sperm collected from a single male C57BL/6J inbred mouse (one of the 12-week olds above) were analyzed to examine spatial genomic variations within an individual. The TREome landscapes of the 13 non-lymphoid organs were highly variable and they were also different from the pattern of sperm although the profile of about a dozen I-PCR amplicons was shared among all of them (
Polymorphic TREome Landscapes in Different Brain Compartments of C57BL/6J Inbred Mice
To examine whether there are spatial variations in TREome landscapes in the brain, genomic DNA isolated from six different brain compartments (brain stem, cerebral cortex, corpus callosum, cerebellar hemisphere, hippocampus, and olfactory bulb) of C57BL/6J female mice (5-week old) were subjected to the I-PCR landscaping analysis (
Temporal Variations in TREome Landscapes in the Primary and Secondary Immune Organs of C57BL/6J Inbred Mice
The cells in immune organs, such as thymus (primary) and spleen (secondary), are constantly subjected to a wide range of intrinsic and external stressors, some of which may have the potential for stimulating TREome activity. Using the I-PCR protocol, we examined whether TREome landscapes of the thymus and spleen are temporally altered with specific patterns in four age groups (5-, 8-, 12-, and 20-weeks) of C57BL/6J male mice. Substantial variations were observed in the TREome landscapes in both types of immune organs among all four age groups of mice; however, no age group-specific or immune organ-specific I-PCR amplicon patterns were discernible (
Interestingly, the thymic TREome landscapes of the third mouse of the 20-week olds contained one unusually strong I-PCR amplicon band (
Variations in TREome Landscapes of Spatially Separated Organ Sets in C57BL/6J Inbred Mice
Certain types of organs (e.g., lymph nodes, mammary glands) in humans and mice are found in more than one location (e.g., left side, right side). In this experiment, we examined variations in TREome landscapes in a set of three lymph nodes (thoracic mammary, inguinal mammary, and mesenteric), a pair of mammary fat pads (#2-right and #4-right), and a pair of bone marrow samples (derived from left and right femurs) isolated from a 27-month old female mouse. Examination of I-PCR amplicon profiles revealed substantial variations in TREome landscapes among the individual sets of three lymph nodes and two mammary fat pads (
Gender- and Individual-Specific Variations in TREome Landscapes in C57BL/6J Inbred Mice
It has been reported that chromosome Y of C57BL/6J male mice is densely populated with a plethora of repetitive elements (Lee et al., 2013). In this experiment, we examined the differences in TREome landscapes between three females and three males using the kidney and liver samples from 19-month old C57BL/6J mice. Among other variations, there was one distinct I-PCR amplicon band which is found only in male mice in both tissues (
Currently, the normal and disease biology of laboratory mice and humans is explained primarily in the context of the function and polymorphism of conventional genes. Thus far, the majority of conventional gene-based attempts to decode the tangible mechanisms of normal and disease states and to identify diagnostic markers have been inconclusive or unsuccessful (Padyukov, 2013; Seok et al., 2013; Takao and Miyakawa, 2015a; Takao and Miyakawa, 2015b). Reportedly, laboratory mice and humans share ˜80% of conventional gene sequences (Consortium, 2002; Guenet, 2005); this is inconsistent with the notion that phenotypic details are primarily determined by conventional genes when dramatic phenotypic distances exist between the two species. Considering the limited understanding of the complete genome information system of mice and humans, conventional gene-focused approaches would be insufficient for decoding the enormous scope of phenotypic details and their variations. The inherent TREome diversity of individual laboratory mouse strains is directly associated with the polymorphic protein coding potentials for TREome genes. Whereas the acquired activity of the inherently diverse TREome may play a role in the fine-tuning the function of TREome genes as well as conventional genes, which reside near the new TREome positions, through their networks of transcription regulatory elements, contributing to strain-specific phenotypic details (Amid et al., 2009; Giardine et al., 2007).
The intricate and unexplored variations in the genomes of laboratory mouse strains are directly associated with two distinct, but interrelated, characteristics of the TREome. First, the information embedded in the TREome landscape of a mouse strain, which is defined by TRE information regarding type, copy number, and position, can be examined to understand how conventional gene(s) neighboring a genomic position of a specific TRE type is regulated. The finding from this study that TREome landscape patterns are different between the C3H/HeJ strain and its TLR4 wildtype control strain (C3H/HeOuJ) needs to be investigated further to determine whether the differences are linked to the expression of certain gene(s) other than TLR4 (Kamath et al., 2003). Similarly, confirmation of the impact of differences in the TREome landscapes between the CD14 knock-out and its backcross-control strain (C57BL/6J) on the expression of genes outside of the knock-out locus is deemed to be necessary. It likely that some other knock-out and/or transgenic mouse strains need to be subjected to similar scrutiny of their genomic configurations with regard to the TREome landscape, in comparison to their control strains. Second, there are numerous and highly polymorphic TREome genes in the genomes of laboratory mouse strains which have not been fully identified, accounted for, or understood. Tangible coding potentials of TREome genes could be either presumed full-length or variable in length due to introduction of mutations over time. It has been demonstrated that certain TREome genes, such as the envelope genes of MLV-ERVs and MMTV-ERV SAg genes, play functional roles in biological processes (Bentvelzen, 1992; Huber et al., 1994; Kotzin et al., 1993). In addition, TREome (human endogenous retroviruses) gene isoforms isolated from a human burn patient's genomic DNA demonstrated differential potentials for regulating inflammatory mediators, such as IL-6 and IL-1β (Lee et al., 2014). Despite the previous studies which reported a range of functionality of TREome genes in both mice and humans, unfortunately, they are often called non-coding long RNAs in current literature (Geisler and Coller, 2013; Gibb et al., 2015). In this study, polymorphisms in TREome genes in laboratory mouse strains are reflected in the identification of 183 isoforms of the MMTV-ERV SAg gene which was reported to play a critical role in shaping the systemic immune cell profile (Acha-Orbea and MacDonald, 1995; Kotzin et al., 1993; Tomonari et al., 1993). The differences in the profile of MMTV-ERV SAg gene isoforms between the C3H/HeJ strain and its TLR4 control (C3H/HeOuJ) suggest that a specific immune system would be developed within each strain due to the activity of a unique set of MMTV-ERV SAg genes, especially during T lymphocyte selection events. In order to confirm the data obtained from the TLR4 studies using C3H/HeJ and its control (C3H/HeOuJ) strains, the potential impacts of the differential MMTV-ERV SAg activities on immune function should to be examined.
Despite the absence of a single reference mouse genome, which is completely sequenced, it is often stated that the population of laboratory mouse strains share a high level of genome sequences (Frazer et al., 2007; Kirby et al., 2010; Mekada et al., 2009). In addition, a recent change in the putative size of the NCBI's reference mouse chromosome Y from ˜16 Mb (Build 37.2) to ˜92 Mb (Annotation Release 105) suggests that more time is needed to confirm the size of each chromosome within individual laboratory mouse strains (Lee et al., 2013). In spite of the lack of the full sequence information from a single reference strain, current genetic monitoring systems for laboratory mice rely primarily on polymorphism data derived from limited sets of conventional genes and microsatellites to determine genetic uniformity/status/identity of a specific strain/substrain. The finding that the TREome (MLV-ERVs) sequences are more variable among 12 laboratory mouse strains, in comparison to conventional gene sequences, suggests that the TREome landscape contributes to the formation of unique phenotypic characteristics embedded in each strain. Furthermore, the discrepancy in TREome landscapes between the CD14 knock-out and its backcross-control strain (C57BL/6J) informs that all genetically engineered mouse strains may need to be examined to confirm the genetic uniformity with their matching controls, outside of the individual targeted loci. On the other hand, the unexplained/unexpected phenotypic variations, which are frequently encountered in genetically engineered mouse strains, such as runt or normal weight of STAT-1−/− mice (Bona and Revillard, 2001; Kim et al., 2003), could be explained by checking the genomic configuration of the engineered mouse population. In certain circumstances, confirmation of uniformity within the entire genomes (minus targeted loci) may be necessary to validate the results collected from studies involving genetically engineered mouse strains.
Materials and Methods
Animal Experiments
The following mouse strains were purchased from the Jackson Laboratory: female C57BL/6J, C3H/HeJ, C3H/HeOuJ, and CD14 knock-out (B6.129S4-Cd14tm/frm/J). All animals were provided with water and food ad libitum at an UC Davis facility and some of the mice were aged for a period of time. The experimental protocol was approved by the Animal Use and Care Administrative Advisory Committee of UC Davis. Animals were sacrificed to collect tissues followed by snap-freezing in liquid nitrogen.
Genomic DNA of Various Laboratory Mouse Strains
Snap-frozen tissue samples were subjected to genomic DNA isolation using a DNeasy Tissue kit (Qiagen, Valencia, Calif.) and the DNA samples were normalized to 20 ng/μl. In addition, genomic DNA from 63 laboratory mouse strains, which include nine 129 sub strains, were purchased from the Jackson Laboratory (Bar Harbor, Me.). In addition, genomic DNA from a C57BL/6J×129S1/SvlmJF2/J (B6129SF2/J) mouse, a F2 hybrid from F1×F1 whose parents were C57BL/6J (female) and 129S1/SvlmJ (male), was obtained from the Jackson Laboratory. According to the information from the Jackson Laboratory's web site, the genomic DNA was isolated from either the brain or spleen of respective mouse strains. Gender identity of each DNA sample was confirmed by amplifying a region specific for mouse chromosome Y by PCR using a pair of primers (Table 2) followed by agarose gel electrophoresis.
Polymorphism Analysis of Genomic TREome (Murine Leukemia Virus-Type Endogenous Retrovirus [MLV-ERV]) Long Terminal Repeats (LTRs)
The polymorphic regions of the MLV-ERV LTRs were identified from the genomic DNAs of 12 laboratory mouse strains (Jackson Laboratory) by PCR using a set of primer pairs (Table 2) which were designed from a well-conserved region. Following ligation into a TA vector (Promega, Madison, Wis.), 24 colonies were picked from the MLV-ERV amplicons of each strain and plasmid DNAs were prepared using a QIAprep spin Miniprep kit (Qiagen) before sequencing (Molecular Cloning Laboratories, South San Francisco, Calif.). A set of unique MLV-ERV LTR sequences was compiled for each mouse strain by multiple alignment analysis using the Vector NTI program (Invitrogen, Carlsbad, Calif.). Within a set of unique MLV-ERV LTR sequences for each mouse strain, the occurrence frequency of 64 four-nucleotide “word” (a nucleotide sequence of specific length) combinations at all four possible reading frames were counted using a program we developed. Within each strain, the occurrence frequency data for the individual words were normalized and converted into probability distribution function (PDF) values. For each word, the average and standard deviation of the PDF values from all 12 strains were calculated using Excel (Microsoft, Redmond, Wash.). Based on an assumption that the higher the standard deviation in a word, the more variation in the word, the extent of variations in each four-nucleotide word within the 12 strain-population was visualized with a schedule of gray shades (white-lowest variation; black-highest variation) on a 16×16 (=256) matrix. To examine/simulate diversity in conventional gene sequences in comparison to the MLV-ERV LTR sequences, the single nucleotide polymorphism (SNP) data for the GAPDH gene (˜4.7 Kb) among 19 laboratory mouse strains (A/J, C57BL/6J, 129X1/SvJ, AKR/J, BALB/cByJ, C3H/HeJ, CAST/EiJ, DBA/2J, FVB/NJ, MOLF/EiJ, NOD/ShiLtJ, SM/J, BTBR T<+>Itpr3<tf/J, KK/H1J, LG/J, NZW/LacJ, PWD/PhJ, WSB/EiJ, and 129S1/SvImJ) was subjected to the same PDF analysis as above.
TREome Landscaping of Mouse Genomes Using MLV-ERV Sequences as a Probe
Genomic DNA (20 ng) was cut with Nco-I (New England Biolab, Ipswich, Mass.) at 37° C. for 4 hours followed by self-ligation of the cut fragments using T4 ligase (Promega) overnight at 4° C. The TREome landscape data was collected by I-PCR amplification of the junctions spanning putative MLV-ERV integration loci using 2 μl of the ligation products, Taq polymerase (Qiagen), and a pair of inverse primers designed from the conserved MLV-ERV sequences. The primer sequences and PCR condition are listed in Table 2. I-PCR amplicons were resolved in a 7.5% polyacrylamide gel for visualization.
Polymorphism Analysis of Genomic TREome (Mouse Mammary Tumor Virus-Type Endogenous Retrovirus [MMTV-ERV]) Superantigen (SAg) Genes
The MMTV-ERV SAg coding sequences were PCR amplified from the genomic DNA (46 of 57 mouse strains) obtained from the Jackson Laboratory using a set of primers (Table 2). Following cloning of the SAg amplicons using a pGEM-T Easy kit from Promega, plasmid DNAs were prepared for 12 colonies picked from each strain using a QIAprep spin miniprep kit and sequenced (Molecular Cloning Laboratories). Eleven mouse strains had no visible SAg coding sequences amplified (C57L/J, CASA/Rk, CAST/EiJ, CZECHII/EiJ, Mus caroli/EiJ, Mus Pahari/Ei, PANCEVO/Ei, PERA/EiJ, PERC/Ei, SKIVE/Ei, and TIRANO/Ei). Within each mouse strain, following identification of a set of unique MMTV-LTR sequences by multiple alignment analyses using Vector NTI (Invitrogen), MMTV-ERVs' SAg open reading frames were examined and translated in silico. Polymorphisms in the putative SAg proteins were visualized using a function in the Excel program (Microsoft).
Results
Diversity in TREome (MLV-ERV) Profiles Among 12 Laboratory Mouse Strains
Similarity among the genomes of a range of laboratory mouse strains has been examined primarily based on SNP polymorphism ((Frazer et al., 2007; Kirby et al., 2010; Mekada et al., 2009)). However, it needs to be noted that the genome similarity data was derived mostly from the sequences of conventional genes. To evaluate the extent of TREome diversity among different laboratory mouse strains in comparison to conventional genes, genomic profiles of MLV-ERVs, a mouse TREome family, were examined in 12 laboratory mouse strains (129P1/ReJ, 129X1/SvJ, A/HeJ, A/J, AKR/J, ALR/Lt, BALB/cJ, BDP/J, BPH/2J, BUB/BnJ, C3H/HeJ, and C3H/HeOuJ). The MLV-ERV LTR sequences were isolated from the genomic DNA of each strain and subjected to a probability distribution function analysis for the entire set of all possible four-nucleotide words in order to compute and visualize the variation levels of individual words on a 16×16 (=256) matrix. In contrast to the overall low variation matrix of GAPDH genes derived from 20 laboratory mouse strains including one reference, there were relatively high variations in the vast majority of the words from the MLV-ERV LTR sequences from the 12 inbred mouse strains (
Polymorphic TREome Landscapes Among 56 Laboratory Mouse Strains
To further study genomic diversity with regard to TREome profiles among laboratory mouse strains, TREome landscapes were visualized from the genomic DNA of 56 laboratory mouse strains using MLV-ERVs as a probe. At first glance, none of the 56 strains share the same TREome landscape patterns (
Dissimilar TREome Landscapes Between the C3H/HeJ Strain and its TLR4 Wildtype Control, C3H/HeOuJ
To confirm the initial finding (
Un-Identical TREome Landscapes Between the CD14 Knock-Out Strain (CD14−/−) and its Backcross-Control, C57BL/6J (CD14+/+) Strain
Genetically engineered mouse strains (transgenic or knock-out for a specific target gene) have served as critical and popular components of modern biomedical research efforts (Fox et al., 2006; Houdebine, 2007; Pearson et al., 2008). Typically, the inbred mouse strain, which is introduced during the backcrossing process of generating a genetically engineered strain, is chosen to control the modified/mutated target gene (Seong et al., 2004). In this study, we examined whether there are distinct variations in TREome landscapes between the genomes of a pair of CD14 knock-out (12-week old female) and its backcross-control (C57BL/6J; 12-week old female) strains (Haziot et al., 1996; Poltorak et al., 1998). Within the genomes of all six organs from each strain, two unique TREome/MLV-ERV amplicon bands were visible only in the CD14 knock-out strain, while two other TREome/MLV-ERV amplicon bands were found only in C57BL/6J backcross-control strain (
Variations in TREome Landscapes in 129 Mouse Substrains
For genetic engineering of transgenic and knock-out mouse models, such as the CD14 knock-out strain described above, embryos of various 129 mouse substrains have been extensively used as a target for the initial manipulation of the genome (Threadgill et al., 1997). Substantial levels of genetic variations among the 129 mouse substrains were reported to be linked to either accidental or intentional outcrossing(s) (Simpson et al., 1997). In this study, genomic DNA from nine 129 substrains (129S1/SvImJ, 129/Sv-Lyntm1Sor/J 129S1/Sv-Oca2+ Tyr+ KitlSl-J/J, 129S4/SvJae-Inhbbtm1Jae/J, 12954/SvJae-Pparatm1Gonz/J, 129S4/SvJaeSor-Gt(ROSA)26Sortm1(FLP1)Dym/J, 129S6/SvEv-Mostm1Ev/J, 129P1/ReJ, and 129X1/SvJ) were examined to survey variations in the TREome landscapes using an MLV-ERV probe. Overall, all nine 129 substrains shared a common TREome landscape pattern (
TREome Landscape-Based Surveillance of Genome-Crossing Between Two Mouse Strains
It has been a common practice to backcross chimera mice, which were derived from genetically targeted embryonic genomes of a 129 substrain, with the C57BL/6J strain to establish a stable strain (Hedrich, 2004; Threadgill et al., 1997). In this study, we examined whether the genome-crossing events between two mouse strains (C57BL/6J×129S1/Sv1mJ) are reflected in the TREome landscapes of the hybrid offspring. With regard to the TREome landscape, an F2 hybrid mouse (male), which was derived from an initial crossing of C57BL/6J (female) and 129S1/Sv1mJ (male), was compared to a C57BL/6J mouse (male) and a 129S1/Sv1mJ mouse (male). Although for the most part, the pattern of the F2 hybrid TREome landscape displayed the bands from both C57BL/6J and 129S1/Sv1mJ genomes, it lacked certain bands which were specific only for the individual parental strains (
Polymorphisms in a “TREome Gene” (MMTV-ERV SAg) Among Laboratory Mouse Strains
Much of the focus of recent advancements in the genomics and bioinformatics fields hinges on the notion that sequence polymorphisms, including small RNAs, and relevant functions of conventional genes are responsible for phenotypic variations in both normal and disease biology (Mardis, 2008; Mu and Zhang, 2012). To evaluate whether polymorphisms in “TREome genes” cast potential impacts on variable phenotypes among the mouse population, we examined sequence diversity in MMTV-ERV SAg genes, a well-studied immune-regulatory TREome gene (Lee et al., 2011; Peters et al., 1983), among 57 laboratory mouse strains. Only 46 of the 57 mouse strains yielded visible MMTV-ERV LTR amplicons which are presumed to harbor the SAg gene open reading frame (ORF). A high-level of SAg gene polymorphism was indicated by the finding that at least one (up to eight) unique SAg coding sequence was identified within the genomes of the 46 individual mouse strains (
The TREome landscapes of varying areas (normal, precancer, and tumor) in a biopsy sample from a subject diagnosed with breast cancer was evaluated using I-PCR as described above (
As shown in
The methods described above were used to evaluate the effects of injury stress on TREome landscape using HERV sequences as probes in a series of blood samples from a subject after a burn injury. As shown in
The methods described above were used to analyse the TREome in skin and brain of C57BL/6 inbred mice as they aged. The results, shown in
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
This invention was made with Government support under Grant No. GM071360 awarded by the National Institutes of Health. The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2016/057390 | 10/17/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62242858 | Oct 2015 | US |