This specification describes technologies generally relating to modeling enhancers, and in particular enhancer selectivity, using machine learning.
The instant application contains a Sequence Listing that has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML file, created on Oct. 3, 2023, is named 128604-5008-PR.xml and is 9,574 bytes in size.
Enhancers are DNA regulatory elements that stimulate transcription, somewhat independent of their position and orientation with respect to the transcriptional initiation site. See, Banerji et al., “Expression of a beta-globin gene is enhanced by remove SV40 DNA sequences,” Cell 2, pp. 299-308. Typical enhancers span 200-1,000 base pairs and bind to dozens of sequence-specific proteins including transcription factors. See, Carroll et al., 2001, “From DNA to diversity: Molecular genetics and the evolution of animal design,” Malden: Blackwell Science, 192 pages; and Davidson, 2001, Genomic regulatory systems: Development and evolution, San Diego; Academic Press, 261 pages.
To measure enhancer activity, assays have been developed to study the activity of yeast, Drosophila and human gene regulatory elements on a genome-wide scale. However, discovery of sequence determinants of enhancer mediated gene expression, including tissue specific enhancer mediated gene expression, using only genomic sequences is made difficult by the fact that the genome is repetitive and has evolved to perform multiple functions. Furthermore, the human genome is too short to even encode all combinations, orientations and spacings of approximately 1,639 human transcription factors in multiple independent sequence contexts. Thus, despite the information generated by genome-scale experiments, most sequence determinants that drive the activity of enhancers, including tissue specific activity, remain unknown. Thus, learning the rules by which enhancers help to determine where and when genes are expressed has proven difficult, despite the availability of full genome sequences of several mammals, extensive maps of genomic features and genome-scale data about transcription factor protein expression levels and transcription factor DNA binding in vitro. See Sahu et al., 2022, “Sequence determinants of human gene regulatory elements,” Nature Genetics 54, pp. 283-294.
Discovery of sequence determinants of enhancer mediated gene expression, including tissue specific enhancer mediated gene expression, is further complicated by the intricacy of binding site (e.g., transcription factor binding sites) grammar of individual enhancers. Enhancers typically have clusters of such binding sites, the presence and arrangement of which is defined by a grammar that affects the overall ability of a given enhancer to promote gene expression and, in some instances, the tissue specificity of such gene expression. Models for this grammar range from a simple linear “billboard” model in which each binding site within an enhancer is flexibly disposed, because the proteins bound to the enhancer do not operate as a single unit, with a strictly defined overall architecture, but rather as an ensemble of separately acting factors or small groups of factors that independently interact with their targets, to an “enhanceosome model” in which there is a high degree of cooperativity between enhancer-bound proteins, such that alterations in individual binding sites can have drastic effects on enhancer output. The enhanceosome and billboard models represent two extremes of a continuum that probably describes most cellular enhancers. Sub-elements of a particular regulatory region might exhibit critical cooperative interactions between some of the factors, while other portions of the regulatory region are loosely structured. See, Arnosti and Kulkarni, 2005, “Transcriptional Enhancers: Intelligent Enhanceosomes or Flexible Billboards?,” Journal of Cellular Biochemistry 94, p. 890-898, which is hereby incorporated by reference. Thus, to improve tissue specific gene expression, it is necessary to learn the grammar of enhancers, including the complex grammar that arises when such enhancers include or rely on binding components that are best described by the enhanceosome model.
Given the above background, systems and methods for modeling enhancer selectivity are needed in the art.
The present disclosure addresses the above-identified shortcomings by providing systems and methods for modeling enhancer activity using a training dataset comprising training enhancer sequences. For each respective training enhancer sequence, a corresponding measured activity of the respective sequence in each of one or more in vivo states is obtained. In instances where activity in more than one in vivo state is acquired, measured selectivity can be determined between any combination of measured states. For instance, if activity is measured for two in vivo states, selectivity between the two in vivo states can be determined by comparing the measured activity for the two in vivo states. In some embodiments, the activity for a sequence is a log of RNA counts to DNA counts for the given sequence. Advantageously, each training enhancer sequence is a tandem n-mer repeat. By using tandem n-mer repeats, a very large library of training enhancer sequences can be studied in order to identify the binding components within enhancers and their grammar. This library is studied in the present disclosure using models that comprise a plurality of parameters. Such a model is trained by inputting each training enhancer sequence into the model thereby obtaining as output from the model, for each sequence, a predicted amount of activity (e.g., transcriptional activity) of the respective sequence in each of the one or more in vivo states. The parameters of the model are trained (refined) based on differentials between the corresponding measured amounts of activity and the corresponding predicted amounts of activity (e.g., in one of the one or more in vivo states) for the training enhancer sequences. As discussed below in conjunction with
In accordance with the present disclosure, test enhancer sequences are generated using the trained model and their activity is assayed in an in vivo assay thereby screening for improved enhancer activity. By optimizing the model for activity in a particular in vivo state, such as in a particular tissue, it is possible to produce enhancers that are specific to particular in vivo states such that they exhibit high transcriptional activity in one in vivo state (e.g., a first cell line, a first tissue) and low transcriptional activity in another in vivo state (e.g., a second cell line, a second tissue).
In further detail, an aspect of the present disclosure provides a method of screening for improved enhancer activity. A first training dataset is obtained in electronic form. The first training dataset comprises a first plurality of training enhancer sequences and, for each respective training enhancer sequence in the first plurality of training enhancer sequences, a corresponding measured amount of activity of the respective training enhancer sequence in each of one or more in vivo states (e.g., 1, 2, 3, 4, 5, 6, 7, 8 or more in vivo states). Each training enhancer sequence in the first plurality of training enhancer sequences is a tandem n-mer repeat sequence.
In some embodiments, the first plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
In some embodiments, each training enhancer sequence in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different n-mer, where n is a positive integer between 4 and 20.
In some embodiments, each training enhancer sequence in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different 10-mer.
In some embodiments, each training enhancer sequence in the first plurality of training enhancer sequences consists of between 15 residues and 200 residues.
In some embodiments, each respective training enhancer sequence in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload. Each respective in vivo state in the plurality of in vivo states is an amount of abundance of the common payload (e.g., transcriptional expression, translation abundance, etc.) in a corresponding tissue. In some such embodiments, the tissue is brain tissue. In some embodiments the tissue is liver tissue. In some such embodiments, the tissue is blood, brain, colon, diaphragm, gastrocnemius muscle, heart, inguinal adipose, kidney, lung, lymph node, skin, small intestine, spinal cord, spleen, trachea, liver, stomach, large intestine, pancreas, gallbladder, bladder, cyes, thymus, or adrenal gland.
In some embodiments, each respective training enhancer sequence in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload under a particular transcription factor. Each respective in vivo state in the one or more in vivo states is an amount of expression of the common payload in a corresponding cell line originating from a particular tissue in conjunction with the particular transcription factors. In some embodiments the cell line is mouse primary neurons, HepG2, H4, or K-562. In some embodiments the particular transcription factor is FOS, FOSB::JUNB, FOSL2::JUNB, FOSL1::JUNB, FOS::JUN; FOSL1::JUN; BATF3, BNC2; ZBTB12; HSF4; HSF1; HSF2; REL; RELA; TEAD3; TEAD1; TP53; TP63; TP73; SMAD5; SMAD2; ZNF416; GRHL2; TFCP2; MLX; ARNT::HIF1A; or HIF1A.
In some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of abundance of the respective training enhancer sequence in each of the one or more in vivo states comprises a log fold change in (i) an abundance of the respective training enhancer sequence measured in the first in vivo state and (ii) an abundance of the respective training enhancer sequence measured in the second in vivo state.
In some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of abundance of the respective training enhancer sequence in each of the one or more in vivo states comprises a binary indication of measured activity (e.g., transcriptional activity) between the first in vivo state and the second in vivo state. Thus, if more activity is observed in the first in vivo state the binary indication has a first value (e.g., “1”) and if more activity is observed in the second in vivo state, the binary indication has a second value (e.g., “0”).
In some embodiments, the corresponding measured amount of abundance of the respective training enhancer sequence in each of the one or more in vivo states comprises a corresponding measured abundance of the respective training enhancer sequence in each of the one or more in vivo states.
In some embodiments, the corresponding measured abundance of the respective training enhancer sequence in an in vivo state in the one or more in vivo states is a corresponding averaged or summed count of a number of barcoded instances of the respective training enhancer sequence found in a corresponding plurality of sequence reads of RNA harvested from each replica in a plurality of replicas of a plurality of cells, representative of the in vivo state exposed to the respective training enhancer sequence. In some such embodiments, the corresponding averaged or summed count is log normalized by an amount of the respective training enhancer sequence exposed to each replica of the plurality of cells.
In some embodiments, each respective training enhancer sequence in the plurality of respective training enhancer sequences is uniquely assigned a corresponding plurality of barcodes comprising 5 or more, 10 or more, 20 or more, 50 or more, 75 or more, or 100 or more barcodes. Each replica of the plurality of cells representative of the in vivo state exposed to the respective training enhancer sequence is, for each respective barcode uniquely assigned to the respective training enhancer sequence, one or more copies of the respective training enhancer sequence tagged with the respective barcode. In some such embodiments, each barcode in the corresponding plurality of barcodes encodes a unique predetermined value selected from the set {1, . . . , 1024}, {1, . . . , 4096}, {1, . . . , 16384}, {1, . . . , 65536}, {1, . . . , 262144}, {1, . . . , 1048576}, {1, . . . , 4194304}, {1, . . . , 16777216}, {1, . . . , 67108864}, or {1, . . . , 1×1012}.
In some embodiments, the corresponding plurality of sequence reads of RNA harvested from each replica in the plurality of replicas of a plurality of cells comprises at least 1000, at least 2000, at least 3000, at least 4000, at least 5000, at least 6000, at least 7000, at least 8000, at least 9000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1 million, at least 2 million, at least 3 million, at least 4 million, at least 5 million, at least 6 million, at least 7 million, at least 8 million, at least 9 million, or more sequence reads.
In some embodiments, the corresponding plurality of replicas of the plurality of cells comprises 2, 3, 4, 5, 6, 7, 8, 9, or 10 replicas.
In some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of abundance of the respective training enhancer sequence in each of the one or more in vivo states comprises a scalar indication of measured activity (e.g., transcriptional activity) between the first in vivo state and the second in vivo state. Thus, if 2.5 times more activity is observed in the first in vivo state than in the second in vivo state, the scalar indication has a value of 2.5.
A model comprising a plurality of parameters is trained by a procedure comprising inputting a sequence of each respective training enhancer sequence in the first plurality of training enhancer sequences into the model. The model applies the plurality of parameters to the sequence of each respective training enhancer sequence to generate as output from the model, for each respective training enhancer sequence in the first plurality of training enhancer sequences, a corresponding predicted amount of activity of the respective training enhancer sequence in at least a first in vivo state in the one or more in vivo states. The procedure further refines the plurality of parameters based on a differential between the corresponding measured amount of abundance and the corresponding predicted amount of abundance (in at least the first in vivo state) for each respective training enhancer sequence in the first plurality of training enhancer sequences.
In some such embodiments, the model is a support vector machine. In some such embodiments, the support vector machine comprises a gapped k-mer kernel.
In some such embodiments, the model is a convolutional neural network.
In some embodiments, the plurality of parameters comprises at least 100 parameters, 1000 parameters, at least 5000 parameters, at least 10,000 parameters, at least 100,000 parameters, at least 250,000 parameters, at least 500,000 parameters, or at least 1,000,000 parameters.
In some embodiments, the method further comprises obtaining a second training dataset, in electronic form. The second training dataset comprises a second plurality of training enhancer sequences. For each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding measured amount of activity (e.g., transcriptional activity, translation activity, etc.) of the respective training enhancer sequence in each of the one or more in vivo states. Each training enhancer sequence in the second plurality of training enhancer sequences is an endogenous enhancer (e.g., has at least 60 percent, at least 70 percent, at least 80 percent, at least 90 percent, or at least 99 percent sequence identity to the sequence of a naturally occurring enhancer). The procedure further comprises, after the inputting and refining described above, inputting a sequence of each respective training enhancer sequence in the second plurality of training enhancer sequences into the model thereby obtaining as output from the model, for each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding predicted amount of abundance of the respective training enhancer in each of the one or more in vivo states. The plurality of parameters is then refined based on a differential between the corresponding measured amount of abundance and the corresponding predicted amount of abundance for each respective training enhancer sequence in the second plurality of training enhancer sequences (in at least one of the one or more in vivo states).
In some embodiments, the second plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
A plurality of test enhancer sequences is generated using the trained model. In some such embodiments, the plurality of test enhancer sequences comprises 10 or more, 100 or more, 1000 or more, or 10,000 or more test enhancer sequences.
In some such embodiments, a feature attribution method is applied to a test nucleic acid sequence (e.g., an endogenous enhancer) using the trained model to obtain a predictive weight matrix that provides a predictive contribution of every residue position in the test nucleic acid sequence to the plurality of in vivo states. In such embodiments, the generating the plurality of test enhancer sequences is in accordance with at least a portion of the predictive weight matrix. In some such embodiments, the model is a convolutional neural network and the feature attribution method is an integrated gradients feature attribution method.
In some embodiments, a test enhancer sequence in the plurality of test enhancer sequences comprises a plurality of residue positions. Each respective residue position in the plurality of residue positions is populated with one of four naturally occurring nucleotides based on a corresponding probability for the respective residue position given by the predictive weight matrix.
In some embodiments, the trained model is used to generate the plurality of test enhancer sequences using an input optimization algorithm.
In some embodiments, the trained model is used to generate the plurality of test enhancer sequences using a bit diffusion algorithm.
An activity (e.g., transcriptional activity, translation activity, etc.) of each respective test enhancer sequence in the plurality of test enhancer sequences in an in vivo assay is determined thereby screening for improved enhancer activity.
In some such embodiments, the determining the activity of each respective test enhancer sequence in the plurality of test enhancer sequences in an in vivo assay comprises exposing the respective test enhancer sequence operably connected to the common promoter and the common payload to each in vivo state in the one or more in vivo states and measuring an abundance of the common payload in each respective in vivo state in the one or more in vivo states.
Another aspect of the present disclosure provides a system comprising a processor and a memory storing instructions, which when executed by the processor, cause the processor to perform steps comprising any of the methods described in the present disclosure.
Still another aspect of the present disclosure provides a non-transitory computer-readable medium storing computer code comprising instructions, when executed by one or more processors, causing the processors to perform any of the methods described in the present disclosure.
The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “figure” and “FIG.” herein), of which:
Given the above background, the present disclosure provides systems and methods for screening for enhancer activity using a training dataset comprising training enhancer sequences and, for each respective sequence, a corresponding measured activity of the respective sequence in each of one or more in vivo states. Each sequence is a tandem n-mer repeat. A model comprising a plurality of parameters is trained by inputting each sequence into the model thereby obtaining as output from the model, for each sequence, a predicted amount of activity of the respective sequence in each of the one or more in vivo states. The procedure refines the parameters based on differentials between the corresponding measured activity and the corresponding predicted activity for the training enhancer sequences. A plurality of test enhancer sequences is generated using the trained model. An activity of each respective test enhancer sequence in the plurality of test enhancer sequences is determined in an in vivo assay thereby screening for improved enhancer activity.
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other forms of functionality are envisioned and may fall within the scope of the implementation(s). In general, structures and functionality presented as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the implementation(s).
It will also be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first dataset could be termed a second dataset, and, similarly, a second dataset could be termed a first dataset, without departing from the scope of the present invention. The first dataset and the second dataset are both datasets, but they are not the same dataset.
The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of the implementations and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined (that a stated condition precedent is true)” or “if (a stated condition precedent is true)” or “when (a stated condition precedent is true)” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.
Furthermore, when a reference number is given an “ith” denotation, the reference number refers to a generic component, set, or embodiment. For instance, a cellular-component termed “cellular-component i” refers to the ith cellular-component in a plurality of cellular-components.
In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It will be appreciated that, in the development of any such actual implementation, numerous implementation-specific decisions are made in order to achieve the designer's specific goals, such as compliance with use case- and business-related constraints, and that these specific goals will vary from one implementation to another and from one designer to another. Moreover, it will be appreciated that such a design effort might be complex and time-consuming, but nevertheless be a routine undertaking of engineering for those of ordering skill in the art having the benefit of the present disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like.
The language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention.
In general, terms used in the claims and the specification are intended to be construed as having the plain meaning understood by a person of ordinary skill in the art. Certain terms are defined below to provide additional clarity. In case of conflict between the plain meaning and the provided definitions, the provided definitions are to be used.
Any terms not directly defined herein shall be understood to have the meanings commonly associated with them as understood within the art of the invention. Certain terms are discussed herein to provide additional guidance to the practitioner in describing the compositions, devices, methods and the like of aspects of the invention, and how to make or use them. It will be appreciated that the same thing may be said in more than one way. Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein. No significance is to be placed upon whether a term is elaborated or discussed herein. Some synonyms or substitutable methods, materials and the like are provided. Recital of one or a few synonyms or equivalents does not exclude use of other synonyms or equivalents, unless it is explicitly stated. Use of examples, including examples of terms, is for illustrative purposes only and does not limit the scope and meaning of the aspects of the invention herein.
Unless defined otherwise, all technical and scientific terms used herein have the meaning commonly understood by one of ordinary skill in the art to which the invention pertains.
As used herein, the term “endogenous enhancer” refers to an enhancer that is either naturally occurring or that has at least 60 percent, at least 70 percent, at least 80 percent, at least 90 percent, or at least 99 percent sequence identity a naturally occurring enhancer.
As used herein, the terms “abundance,” “abundance level,” or “expression level” refers to an amount of a cellular constituent (e.g., a gene product such as an RNA species, e.g., mRNA or miRNA, or a protein molecule) present in one or more cells, or an average amount of a cellular constituent present across multiple cells. When referring to mRNA or protein expression, the term generally refers to the amount of any RNA or protein species corresponding to a particular genomic locus, e.g., a particular gene. However, in some embodiments, an abundance can refer to the amount of a particular isoform of an mRNA or protein corresponding to a particular gene that gives rise to multiple mRNA or protein isoforms. The genomic locus can be identified using a gene name, a chromosomal location, or any other genetic mapping metric.
As used interchangeably herein, a “cell state” or “biological state” refers to a state or phenotype of a cell or a population of cells. For example, a cell state can be healthy or diseased. A cell state can be one of a plurality of diseases. A cell state can be a response to a compound treatment and/or a differentiated cell lineage. A cell state can be characterized by a measure (e.g., an activation, expression, and/or measure of abundance) of one or more cellular constituents, including but not limited to one or more genes, one or more proteins, and/or one or more biological pathways.
As used herein, “messenger RNA” or “mRNA” are RNA molecules comprising a sequence that encodes a polypeptide or protein. In general, RNA can be transcribed from DNA. In some cases, precursor mRNA containing non-protein coding regions in the sequence can be transcribed from DNA and then processed to remove all or a portion of the non-coding regions (introns) to produce mature mRNA. As used herein, the term “pre-mRNA” can refer to the RNA molecule transcribed from DNA before undergoing processing to remove the non-protein coding regions.
As used herein, unless otherwise dictated by context “nucleotide” or “nt” refers to ribonucleotide.
As used herein, the terms “patient” and “subject” are used interchangeably, and may be taken to mean any living organism which may be treated with compounds of the present invention. As such, the terms “patient” and “subject” include, but are not limited to, any non-human mammal, primate and human.
The terms “sequence reads” or “reads,” used interchangeably herein, refer to nucleotide sequences produced by any sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”), and sometimes are generated from both ends of nucleic acids (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130 bp, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp or more. Nanopore sequencing, for example, can provide sequence reads that vary in size from tens to hundreds to thousands of base pairs. Illumina parallel sequencing can provide sequence reads vary to a lesser extent (e.g., where most sequence reads are of a length of about 200 bp or less). A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes (e.g., in hybridization arrays or capture probes) or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
As disclosed herein, the terms “sequencing,” “sequence determination,” and the like refer generally to any and all biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as a DNA fragment.
As used herein, the term “tissue” corresponds to a group of cells that group together as a functional unit. More than one type of cell can be found in a single tissue. Different types of tissue may consist of different types of cells (e.g., hepatocytes, alveolar cells or blood cells), but also can correspond to tissue from different organisms (mother vs. fetus) or to healthy cells vs. tumor cells. The term “tissue” can generally refer to any group of cells found in the human body (e.g., heart tissue, lung tissue, kidney tissue, nasopharyngeal tissue, oropharyngeal tissue). In some aspects, the term “tissue” or “tissue type” can be used to refer to a tissue from which a cell-free nucleic acid originates. In one example, viral nucleic acid fragments can be derived from blood tissue. In another example, viral nucleic acid fragments can be derived from tumor tissue.
As used interchangeably herein, the terms “algorithm”, “model”, “regressor”, and/or “classifier” refers to a machine learning model and/or artificial intelligence model. In some embodiments, a model makes use of supervised machine learning. Nonlimiting examples of supervised learning models include, but are not limited to, logistic regression, neural networks, support vector machines, Naive Bayes models, nearest neighbor models, random forest models, decision tree models, boosted trees models, multinomial logistic regression models, linear models, linear regression, Gradient Boosting, mixture models, Gaussian NB model, linear discriminant analysis, diffusion models, or any combinations thereof. In some embodiments, a model is a multinomial classifier. In some embodiments, a model is a 2-stage stochastic gradient descent (SGD) model. In some embodiments, a model is a deep neural network (e.g., a deep-and-wide sample-level model).
Neural networks. In some embodiments, the model is a neural network (e.g., a convolutional neural network and/or a residual neural network). Neural networks, also known as artificial neural networks (ANNs or NNs), include convolutional and/or residual neural network networks (deep learning models). In some embodiments, neural networks are trained to map an input dataset to an output dataset, where the neural network includes an interconnected group of nodes organized into multiple layers of nodes. For example, in some embodiments, the neural network includes at least an input layer, one or more hidden layers, and an output layer. In some embodiments, the neural network includes any total number of layers, and any number of hidden layers, where the hidden layers function as trainable feature extractors that allow mapping of a set of input data to an output value or set of output values. In some embodiments, a deep learning model is a neural network including a plurality of hidden layers, e.g., two or more hidden layers. In some instances, each layer of the neural network includes a number of nodes (or “neurons”). In some embodiments, a node receives input that comes either directly from the input data or the output of nodes in previous layers, and performs a specific operation, e.g., a summation operation. In some embodiments, a connection from an input to a node is associated with a parameter (e.g., a weight and/or weighting factor). In some embodiments, the node sums up the products of all pairs of inputs, xi, and their associated parameters. In some embodiments, the weighted sum is offset with a bias, b. In some embodiments, the output of a node or neuron is gated using a threshold or activation function, f, which, in some instances, is a linear or non-linear function. In some embodiments, the activation function is, for example, a rectified linear unit (ReLU) activation function, a Leaky ReLU activation function, or other function such as a saturating hyperbolic tangent, identity, binary step, logistic, arcTan, softsign, parametric rectified linear unit, exponential linear unit, softPlus, bent identity, softExponential, Sinusoid, Sine, Gaussian, or sigmoid function, or any combination thereof.
In some embodiments, the weighting factors, bias values, and threshold values, or other computational parameters of the neural network, are “taught” or “learned” in a training phase using one or more sets of training data. For example, in some implementations, the parameters are trained using the input data from a training dataset and a gradient descent or backward propagation method so that the output value(s) that the neural network computes are consistent with the examples included in the training dataset. In some embodiments, the parameters are obtained from a back propagation neural network training process.
Any of a variety of neural networks are suitable for use in accordance with the present disclosure. Examples include, but are not limited to, feedforward neural networks, radial basis function networks, recurrent neural networks, residual neural networks, convolutional neural networks, residual convolutional neural networks, and the like, or any combination thereof. In some embodiments, the machine learning makes use of a pre-trained and/or transfer-learned neural network or deep learning architecture. In some implementations, convolutional and/or residual neural networks are used, in accordance with the present disclosure.
For instance, a deep neural network model includes an input layer, a plurality of individually parameterized (e.g., weighted) convolutional layers, and an output scorer. The parameters (e.g., weights) of each of the convolutional layers as well as the input layer contribute to the plurality of parameters (e.g., weights) associated with the deep neural network model. In some embodiments, at least 50 parameters, at least 100 parameters, at least 1000 parameters, at least 2000 parameters or at least 5000 parameters are associated with the deep neural network model. As such, a deep neural network models requires a computer to be used because they cannot be mentally solved. In other words, given an input to the model, the model output needs to be determined using a computer rather than mentally in such embodiments. See, for example, Krizhevsky et al., 2012, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 2, Pereira, Burges, Bottou, Weinberger, eds., pp. 1097-1105, Curran Associates, Inc.; Zeiler, 2012 “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701; and Rumelhart et al., 1988, “Neurocomputing: Foundations of research,” ch. Learning Representations by Back-propagating Errors, pp. 696-699, Cambridge, MA, USA: MIT Press, each of which is hereby incorporated by reference.
Neural network models, including convolutional neural networks, suitable for use as models are disclosed in, for example, Vincent et al., 2010, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J Mach Learn Res 11, pp. 3371-3408; Larochelle et al., 2009, “Exploring strategies for training deep neural networks,” J Mach Learn Res 10, pp. 1-40; and Hassoun, 1995, Fundamentals of Artificial Neural Networks, Massachusetts Institute of Technology, each of which is hereby incorporated by reference. Additional example neural networks suitable for use as models are disclosed in Duda et al., 2001, Pattern Classification, Second Edition, John Wiley & Sons, Inc., New York; and Hastic et al., 2001, The Elements of Statistical Learning, Springer-Verlag, New York, each of which is hereby incorporated by reference in its entirety. Additional example neural networks suitable for use as models in the present disclosure are described in Draghici, 2003, Data Analysis Tools for DNA Microarrays, Chapman & Hall/CRC; and Mount, 2001, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, each of which is hereby incorporated by reference in its entirety.
Support vector machines. In some embodiments, the model is a support vector machine (SVM). SVMs suitable for use as models are described in, for example, Cristianini and Shawe-Taylor, 2000, “An Introduction to Support Vector Machines,” Cambridge University Press, Cambridge; Boser et al., 1992, “A training algorithm for optimal margin classifiers,” in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press, Pittsburgh, Pa., pp. 142-152; Vapnik, 1998, Statistical Learning Theory, Wiley, New York; Mount, 2001, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.; Duda, Pattern Classification, Second Edition, 2001, John Wiley & Sons, Inc., pp. 259, 262-265; and Hastie, 2001, The Elements of Statistical Learning, Springer, New York; and Furey et al., 2000, Bioinformatics 16, 906-914, each of which is hereby incorporated by reference in its entirety. When used for classification, SVMs separate a given set of binary labeled data with a hyper-plane that is maximally distant from the labeled data. For certain cases in which no linear separation is possible, SVMs work in combination with the technique of ‘kernels’, that automatically realizes a non-linear mapping to a feature space. The hyper-plane found by the SVM in feature space corresponds, in some instances, to a non-linear decision boundary in the input space. In some embodiments, the plurality of parameters (e.g., weights) associated with the SVM define the hyper-plane. In some embodiments, the hyper-plane is defined by at least 10, at least 20, at least 50, or at least 100 parameters and the SVM model requires a computer to calculate because it cannot be mentally solved.
Naïve Bayes models. In some embodiments, the model is a Naive Bayes model. Naïve Bayes models suitable for use as models are disclosed, for example, in Ng et al., 2002, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,” Advances in Neural Information Processing Systems, 14, which is hereby incorporated by reference. A Naive Bayes model is any model in a family of “probabilistic models” based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. In some embodiments, they are coupled with Kernel density estimation. See, for example, Hastic et al., 2001, The elements of statistical learning: data mining, inference, and prediction, eds. Tibshirani and Friedman, Springer, New York, which is hereby incorporated by reference.
Nearest neighbor models. In some embodiments, a model is a nearest neighbor model. In some implementations, nearest neighbor models are memory-based and include no model to be fit. For nearest neighbors, given a query point x0 (a test subject), the k training points x(r), r, . . . , k (here the training subjects) closest in distance to x0 are identified and then the point x0 is classified using the k nearest neighbors. In some embodiments, Euclidean distance in feature space is used to determine distance as d(i)=∥x(i)−x(0)∥. Typically, when the nearest neighbor model is used, the abundance data used to compute the linear discriminant is standardized to have mean zero and variance 1. In some embodiments, the nearest neighbor rule is refined to address issues of unequal class priors, differential misclassification costs, and feature selection. Many of these refinements involve some form of weighted voting for the neighbors. For more information on nearest neighbor analysis, see Duda, Pattern Classification, Second Edition, 2001, John Wiley & Sons, Inc; and Hastic, 2001, The Elements of Statistical Learning, Springer, New York, each of which is hereby incorporated by reference.
A k-nearest neighbor model is a non-parametric machine learning method in which the input consists of the k closest training examples in feature space. The output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k=1, then the object is simply assigned to the class of that single nearest neighbor. See, Duda et al., 2001, Pattern Classification, Second Edition, John Wiley & Sons, which is hereby incorporated by reference. In some embodiments, the number of distance calculations needed to solve the k-nearest neighbor model is such that a computer is used to solve the model for a given input because it cannot be mentally performed.
Random forest, decision tree, and boosted tree models. In some embodiments, the model is a decision tree. Decision trees suitable for use as models are described generally by Duda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York, pp. 395-396, which is hereby incorporated by reference. Tree-based methods partition the feature space into a set of rectangles, and then fit a model (like a constant) in each one. In some embodiments, the decision tree is random forest regression. For example, one specific model is a classification and regression tree (CART). Other specific decision tree models include, but are not limited to, ID3, C4.5, MART, and Random Forests. CART, ID3, and C4.5 are described in Duda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York, pp. 396-408 and pp. 411-412, which is hereby incorporated by reference. CART, MART, and C4.5 are described in Hastie et al., 2001, The Elements of Statistical Learning, Springer-Verlag, New York, Chapter 9, which is hereby incorporated by reference in its entirety. Random Forests are described in Breiman, 1999, “Random Forests—Random Features,” Technical Report 567, Statistics Department, U.C. Berkeley, September 1999, which is hereby incorporated by reference in its entirety. In some embodiments, the decision tree model includes at least 10, at least 20, at least 50, or at least 100 parameters (e.g., weights and/or decisions) and requires a computer to calculate because it cannot be mentally solved.
Regression. In some embodiments, the model uses any type of regression. For example, in some embodiments, the regression is logistic regression. In some embodiments, the regression is logistic regression with lasso, L2 or elastic net regularization. In some embodiments, those extracted features that have a corresponding regression coefficient that fails to satisfy a threshold value are pruned (removed from) consideration. In some embodiments, a generalization of the logistic regression model that handles multicategory responses is used as the model. Logistic regression is disclosed in Agresti, An Introduction to Categorical Data Analysis, 1996, Chapter 5, pp. 103-144, John Wiley & Son, New York, which is hereby incorporated by reference. In some embodiments, the model makes use of a regression model disclosed in Hastic et al., 2001, The Elements of Statistical Learning, Springer-Verlag, New York. In some embodiments, the logistic regression model includes at least 10, at least 20, at least 50, at least 100, or at least 1000 parameters (e.g., weights) and requires a computer to calculate because it cannot be mentally solved.
Linear discriminant analysis. In some embodiments, the model is linear discriminant analysis (LDA), normal discriminant analysis (NDA), or a discriminant function. These are a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. In some embodiments, the resulting combination is used as the model (linear model) in some embodiments of the present disclosure.
Ensembles of models and boosting. In some embodiments, an ensemble (two or more) of models is used. In some embodiments, a boosting technique such as AdaBoost is used in conjunction with many other types of models to improve the performance of the model. In this approach, the output of any of the models disclosed herein, or their equivalents, is combined into a weighted sum that represents the final output of the boosted model. In some embodiments, the plurality of outputs from the models is combined using any measure of central tendency known in the art, including but not limited to a mean, median, mode, a weighted mean, weighted median, weighted mode, etc. In some embodiments, the plurality of outputs is combined using a voting method. In some embodiments, a respective model in the ensemble of models is weighted or unweighted.
As used herein, the term “parameter” refers to any coefficient or, similarly, any value of an internal or external element (e.g., a weight and/or a hyperparameter) in an algorithm, model, regressor, and/or classifier that can affect (e.g., modify, tailor, and/or adjust) one or more inputs, outputs, and/or functions in the algorithm, model, regressor and/or classifier. For example, in some embodiments, a parameter refers to any coefficient, weight, and/or hyperparameter that can be used to control, modify, tailor, and/or adjust the behavior, learning, and/or performance of an algorithm, model, regressor, and/or classifier. In some instances, a parameter is used to increase or decrease the influence of an input (e.g., a feature) to an algorithm, model, regressor, and/or classifier. As a nonlimiting example, in some embodiments, a parameter is used to increase or decrease the influence of a node (e.g., of a neural network), where the node includes one or more activation functions. Assignment of parameters to specific inputs, outputs, and/or functions is not limited to any one paradigm for a given algorithm, model, regressor, and/or classifier but can be used in any suitable algorithm, model, regressor, and/or classifier architecture for a desired performance. In some embodiments, a parameter has a fixed value. In some embodiments, a value of a parameter is manually and/or automatically adjustable. In some embodiments, a value of a parameter is modified by a validation and/or training process for an algorithm, model, regressor, and/or classifier (e.g., by error minimization and/or backpropagation methods). In some embodiments, an algorithm, model, regressor, and/or classifier of the present disclosure includes a plurality of parameters. In some embodiments, the plurality of parameters is n parameters, where: n≥2; n≥5; n≥10; n≥25; n≥40; n≥50; n≥75; n≥100; n≥125; n≥150; n≥200; n≥225; n≥250; n≥350; n≥500; n≥600; n≥750; n≥1,000; n≥2,000; n≥4,000; n≥5,000; n≥7,500; n≥10,000; n≥20,000; n≥40,000; n≥75,000; n≥100,000; n≥200,000; n≥500,000, n≥1×106, n≥5×106, or n≥1×107. As such, the algorithms, models, regressors, and/or classifiers of the present disclosure cannot be mentally performed. In some embodiments n is between 10,000 and 1×107, between 100,000 and 5×106, or between 500,000 and 1×106. In some embodiments, the algorithms, models, regressors, and/or classifier of the present disclosure operate in a k-dimensional space, where k is a positive integer of 5 or greater (e.g., 5, 6, 7, 8, 9, 10, etc.). As such, the algorithms, models, regressors, and/or classifiers of the present disclosure cannot be mentally performed.
As used herein, the term “untrained model” refers to a model that has not been trained on a target dataset. In some embodiments, “training a model” (e.g., “training a neural network”) refers to the process of training an untrained or partially trained model (e.g., “an untrained or partially trained neural network”). Moreover, it will be appreciated that the term “untrained model” does not exclude the possibility that transfer learning techniques are used in such training of the untrained or partially trained model. For instance, Fernandes et al., 2017, “Transfer Learning with Partial Observability Applied to Cervical Cancer Screening,” Pattern Recognition and Image Analysis: 8th Iberian Conference Proceedings, 243-250, which is hereby incorporated by reference, provides non-limiting examples of such transfer learning. In instances where transfer learning is used, the untrained model described above is provided with additional data over and beyond that of the primary training dataset. Typically, this additional data is in the form of parameters (e.g., coefficients, weights, and/or hyperparameters) that were learned from another, auxiliary training dataset. Moreover, while a description of a single auxiliary training dataset has been disclosed, it will be appreciated that there is no limit on the number of auxiliary training datasets that can be used to complement the primary training dataset in training the untrained model in the present disclosure. For instance, in some embodiments, two or more auxiliary training datasets, three or more auxiliary training datasets, four or more auxiliary training datasets or five or more auxiliary training datasets are used to complement the primary training dataset through transfer learning, where each such auxiliary dataset is different than the primary training dataset. Any manner of transfer learning is used, in some such embodiments. For instance, consider the case where there is a first auxiliary training dataset and a second auxiliary training dataset in addition to the primary training dataset. In such a case, the parameters learned from the first auxiliary training dataset (by application of a first model to the first auxiliary training dataset) are applied to the second auxiliary training dataset using transfer learning techniques (e.g., a second model that is the same or different from the first model), which in turn results in a trained intermediate model whose parameters are then applied to the primary training dataset and this, in conjunction with the primary training dataset itself, is applied to the untrained model. Alternatively, in another example embodiment, a first set of parameters learned from the first auxiliary training dataset (by application of a first model to the first auxiliary training dataset) and a second set of parameters learned from the second auxiliary training dataset (by application of a second model that is the same or different from the first model to the second auxiliary training dataset) are each individually applied to a separate instance of the primary training dataset (e.g., by separate independent matrix multiplications) and both such applications of the parameters to separate instances of the primary training dataset in conjunction with the primary training dataset itself (or some reduced form of the primary training dataset such as principal components or regression coefficients learned from the primary training set) are then applied to the untrained model in order to train the untrained model.
Now that an overview of some aspects of the present disclosure and some definitions used in the present disclosure have been provided, details of an exemplary system are described in conjunction with
In typical embodiments, computer system 100 comprises one or more computers. For purposes of illustration in
Turning to
The memory 92 of the computer system 100 stores:
In some embodiments, one or more of the above identified data elements or modules of the computer system 100 are stored in one or more of the previously mentioned memory devices, and correspond to a set of instructions for performing a function described above. The above identified data, modules or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, the memory 92 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments the memory 92 stores additional modules and data structures not described above.
Referring to block 200 of
Referring to block 202, a first training dataset 102 is obtained in electronic form. The first training dataset comprises a first plurality of training enhancer sequences and, for each respective training enhancer sequence 108 in the first plurality of training enhancer sequences, a corresponding measured amount of activity 110 of the respective training enhancer sequence in each of one or more in vivo states.
In some embodiments, each training enhancer sequence in the first plurality of training enhancer sequences is a tandem n-mer repeat sequence. In some embodiments, at least 10 percent, at least 20 percent, at least 30 percent, at least 40 percent, at least 50 percent at least 60 percent, at least 70 percent, at least 80 percent, or at least 90 percent of the training enhancer sequence in the first plurality of training enhancer sequences are tandem n-mer repeat sequences, while the remaining training enhancer sequences in the first plurality of training enhancer sequences are not tandem n-mer repeat sequences.
The tandem n-mer sampling, illustrated in
Regardless of which model best describes the training enhancer sequences, the sampling of
Referring to block 204, in some embodiments, the first plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
Referring to block 206, in some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different n-mer, where n is a positive integer between 4 and 20.
Referring to block 208, in some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different 10-mer. Example 1 provides justification for the use of a 10-mer for the tandem repeat training enhancer sequences. In some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different 7-mer, 8-mer, 9-mer, 10-mer, 11-mer, 12-mer, 13-mer, 14-mer, or 15-mer.
Referring to block 210, in some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences consists of between 15 residues and 200 residues. In some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences consists of between 30 residues and 300 residues. In some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences consists of between 40 residues and 500 residues. In some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences consists of between 50 residues and 500 residues, 70 residues and 500 residues, 90 residues and 500 residues, 110 residues and 500 residues, or 130 residues and 500 residues. In some embodiments, each training enhancer sequence 108 in the first plurality of training enhancer sequences consists of between 100 residues and 1000 residues, 120 residues and 1000 residues, 130 residues and 1000 residues, 140 residues and 1000 residues, or 150 residues and 1000 residues.
Referring to block 212, in some embodiments, each respective training enhancer sequence 108 in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload. Each respective in vivo state in the one or more in vivo states is an amount of expression of the common payload in a corresponding tissue.
Referring to block 214, in some embodiments, the tissue represented by an in vivo state in the one or more in vivo states is brain tissue. In some embodiments, the tissue represented by an in vivo state in the one or more in vivo states is brain tissue. Referring to block 216, in some embodiments, the tissue represented by an in vivo state in the one or more in vivo states is blood, brain, colon, diaphragm, gastrocnemius muscle, heart, inguinal adipose, kidney, lung, lymph node, skin, small intestine, spinal cord, spleen, trachea, liver, stomach, large intestine, pancreas, gallbladder, bladder, eyes, thymus, or adrenal gland.
Referring to block 218, in some embodiments, each respective training enhancer sequence 108 in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload under a particular transcription factor. Each respective in vivo state in the one or more in vivo states is an amount of expression of the common payload in a corresponding cell line in a corresponding one or more different cell lines. Each such cell line in the one or more cell lines originates from a particular tissue in a corresponding one or more tissues.
Referring to block 220, in some embodiments, one such cell line is mouse primary neurons. HepG2, H4, or K562. In some embodiments, the transcription factor is FOSB::JUNB, FOSL2::JUNB, FOSL1::JUNB, FOS::JUN; FOSL1::JUN; BATF3, BNC2; ZBTB12; HSF4; HSF1; HSF2; REL; RELA; TEAD3; TEAD1; TP53; TP63; TP73; SMAD5; SMAD2; ZNF416; GRHL2; TFCP2; MLX; ARNT::HIF1A; or HIF1A.
Referring to block 222, in some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity (e.g., transcriptional activity) of the respective training enhancer sequence 108 in each of the one or more in vivo states comprises a log fold change in (i) an activity of the respective training enhancer sequence measured in the first in vivo state and (ii) an activity of the respective training enhancer sequence measured in the second in vivo state.
Referring to block 224, in some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more in vivo states comprises a binary indication of measured selectivity between the first in vivo state and the second in vivo state. Thus, if more activity is observed in the first in vivo state the binary indication of measured selectivity has a first value (e.g., “1”) and if more activity is observed in the second in vivo state, the binary indication of measured selectivity has a second value (e.g., “0”).
Referring to block 226, in some embodiments, the corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more in vivo states comprises a corresponding measured abundance of the respective training enhancer sequence in each of the in vivo states. In some such embodiments, the corresponding measured abundance of the respective training enhancer sequence is determined by measuring transcription of a barcoded payload associated with the respective training enhancer sequence in each of the one or more in vivo states in accordance with the construct illustrated in
Referring to block 228, in some embodiments, the corresponding measured abundance of the respective training enhancer sequence 108 in an in vivo state in the one or more in vivo states is a corresponding averaged or summed count of a number of barcoded instances of the respective training enhancer sequence found in a corresponding plurality of sequence reads of RNA harvested from each replica in a plurality of replicas of a plurality of cells, representative of the in vivo state exposed to the respective training enhancer sequence. In some embodiments, these barcoded instances of the respective training enhancer sequence are determined by measuring a transcription of a barcoded payload associated with the respective training enhancer sequence in the in vivo state in accordance with the construct illustrated in
Referring to block 230, in some embodiments, the corresponding averaged or summed count for the given training enhancer sequence is log normalized by an amount of the respective training enhancer sequence exposed to (e.g., by transfection, by adeno-associated virus (AAV) vector-mediated gene delivery, etc.) each replica of the plurality of cells (for a particular in vivo state). This is further illustrated in
Referring to block 232, in some embodiments, each respective training enhancer sequence 108 in the plurality of respective training enhancer sequences is uniquely assigned a corresponding plurality of barcodes comprising 5 or more, 10 or more, 20 or more, 50 or more, 75 or more, or 100 or more barcodes. Each replica of the plurality of cells representative of the in vivo state exposed to (e.g., by transfection, by adeno-associated virus (AAV) vector-mediated gene delivery, etc.) the respective training enhancer sequence is, for each respective barcode uniquely assigned to the respective training enhancer sequence, one or more copies of the respective training enhancer sequence tagged with the respective barcode.
Referring to block 234, in some embodiments, each barcode in the corresponding plurality of barcodes encodes a unique predetermined value selected from the set {1, . . . , 1024}, {1, . . . , 4096}, {1, . . . , 16384}, {1, . . . , 65536}, {1, . . . , 262144}, {1, . . . , 1048576}, {1, . . . , 4194304}, {1, . . . , 16777216}, {1, . . . , 67108864}, or {1, . . . , 1×1012}.
Referring to block 236, in some embodiments, the corresponding plurality of sequence reads of RNA harvested from each replica in the plurality of replicas of a plurality of cells comprises at least 1000, at least 2000, at least 3000, at least 4000, at least 5000, at least 6000, at least 7000, at least 8000, at least 9000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1 million, at least 2 million, at least 3 million, at least 4 million, at least 5 million, at least 6 million, at least 7 million, at least 8 million, at least 9 million, or more sequence reads.
For instance, consider the case in which the barcode is represented by a set of five nucleotide positions. In this instance, each nucleotide position contributes four possibilities (A, T, C or G), giving rise, when all five positions are considered, to 4×4×4×4×4=1024 possibilities. As such, the five nucleotide positions form the basis of the set {1, . . . , 1024}. Thus, when the barcode sequence is a 5-mer, barcode encodes a unique predetermined value selected from the set {1, . . . , 1024}. Likewise, when the barcode sequence is represented by a set of six nucleotide positions, the six nucleotide positions collectively contribute 4×4×4×4×4×4=4096 possibilities. As such, the six nucleotide positions form the basis of the set {1, . . . , 4096}. In other words, when the barcode sequence is a 6-mer, the barcode encodes a unique predetermined value selected from the set {1, . . . , 4096}.
In some embodiments, the barcode is a contiguous set of oligonucleotides. In one such exemplary embodiment, the contiguous set of oligonucleotides is an n-mer, where n is an integer selected from the set {4, . . . , 20}. In other words, in some embodiments, the barcode is a contiguous set of nucleotide positions (e.g., 4 contiguous nucleotide positions, 5 contiguous nucleotide positions, 6 contiguous nucleotide positions, 7 contiguous nucleotide positions, 8 contiguous nucleotide positions, 9 contiguous nucleotide positions, 10 contiguous nucleotide positions, 11 contiguous nucleotide positions, 12 contiguous nucleotide positions, 13 contiguous nucleotide positions, 14 contiguous nucleotide positions, 15 contiguous nucleotide positions, 16 contiguous nucleotide positions, 17 contiguous nucleotide positions, 18 contiguous nucleotide positions, 19 contiguous nucleotide positions, or 20 contiguous nucleotide positions).
Referring to block 238, in some embodiments, the corresponding plurality of replicas of the plurality of cells comprises 2, 3, 4, 5, 6, 7, 8, 9, or 10 replicas.
Referring to block 240, in some embodiments, the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more of in vivo states comprises a scalar indication of measured selectivity between the first in vivo state and the second in vivo state. Thus, if 2.5 times more activity is observed in the first in vivo state than in the second in vivo state, the scalar indication has a value of 2.5. Thus, if 0.5 times less activity is observed in the first in vivo state than in the second in vivo state, the scalar indication has a value of 0.5.
Referring to block 242, in some embodiments, a model 114 is trained. The model 114 comprises a plurality of parameters 116. The model is trained by a training procedure. In the training procedure a sequence of each respective training enhancer sequence 108 in the first plurality of training enhancer sequences is inputted into the model 114. The model 114 applies the plurality of parameters 116 to the sequence of each respective training enhancer sequence to generate, as output from the model, for each respective training enhancer sequence in the first plurality of training enhancer sequences, a corresponding predicted amount of activity of the respective training enhancer sequence 108 in a in vivo state in the plurality of in vivo states. In some embodiments, there is a different model for each in vivo state in the plurality of in vivo states. Further in the training procedure the plurality of parameters 116 are refined based on a differential between the corresponding measured amount of activity and the corresponding predicted amount of activity for each respective training enhancer sequence in the first plurality of training enhancer sequences of panel 602 of
Next, the trained model 114 was asked to predict the H4/HepG2 activity of each enhancer in a library of approximately 12,000 naturally occurring enhancers delivered to replicas of H4 cells and HepG2 cells using an adeno-associated virus (AAV) vector. The model 114 was not trained on the measurement of the payload expression of these deliveries. All that the model 114 received was the sequences of the naturally occurring enhancers.
Bar 1502 of
Bar 1506 of
A model 114, in some embodiments, is associated with an objective function, which generates a value that describes the objective goal of the training process. For example, in some embodiments, the training intends to reduce the error rate of the model 114 in generating a prediction of the activity of an enhancer. In such a case, the objective function monitors the error rate of the machine learning model. Such an objective function, in some embodiments, is called a loss function. In some embodiments, other forms of objective functions are also used. In some embodiments, the loss function determines the difference between model output (predicted activity) and measured activity, and the gradient with respect to the input is calculated and backpropagated to update the model parameters. In various embodiments, the error rate is measured as cross-entropy loss, L1 loss (e.g., the sum of absolute differences between the predicted values and the actual value), L2 loss (e.g., the sum of squared distances).
Referring to block 244, in some embodiments, the model 114 is a support vector machine. Referring to block 246, in some such embodiments, the support vector machine comprises a gapped k-mer kernel. Non-limiting examples of support vector machines are provided in the definitions section above.
Referring to block 248, in some embodiments, the model 114 is a convolutional neural network (CNN). In some embodiments, the convolutional neural network includes different kinds of layers, such as convolutional layers, pooling layers, recurrent layers, fully connected layers, and custom layers. A convolutional layer convolves the input of the layer (e.g., an RNA sequence) with one or more weight kernels to generate different types of sequences that are filtered by the kernels to generate feature sequences. Each convolution result, in some embodiments, is associated with an activation function. A convolutional layer, in some embodiments, is followed by a pooling layer that selects the maximum value (max pooling) or average value (average pooling) from the portion of the input covered by the kernel size. The pooling layer reduces the spatial size of the extracted features. Optionally, in some embodiments, one or more convolutional layers and pooling layer is followed by an optional recurrent layer that includes one or more feedback loops. The feedback, in some embodiments, is used to account for spatial relationships of the features in an image or temporal relationships in sequences. In some embodiments, the above-described layers are followed by multiple fully connected layers that have nodes connected to each other. The fully connected layers are, in some embodiments, used for classification and regression. In various embodiments, a CNN includes one or more convolutional layer but does not include any pooling layer or recurrent layer. In various embodiments, a CNN includes one or more convolutional layer, one or more pooling layer, one or more recurrent layer, or any combination thereof. If a pooling layer is present, not all convolutional layers are always followed by a pooling layer. A recurrent layer, in some embodiments, is also positioned differently at other locations of the CNN. For each convolutional layer, the sizes of kernels (e.g., 1×3, 1×5, 1×7, etc.) and the numbers of kernels allowed to be learned, in some embodiments, are different from other convolutional layers.
In some embodiments, a model includes certain layers, nodes, kernels and/or coefficients. Training of a model, such as a CNN, in some embodiments, in accordance with block 242 includes multiple iterations of forward propagation and backpropagation. In embodiment where the model is a CNN, each layer in the neural network, in some embodiments, includes one or more nodes that are fully or partially connected to other nodes in adjacent layers. In forward propagation, the neural network performs the computation in the forward direction based on outputs of a preceding layer. The operation of a node, in some embodiments, is defined by one or more functions. The functions that define the operation of a node, in some embodiments, include various computation operations such as convolution of data with one or more kernels, pooling, recurrent loop in RNN, various gates in LSTM, etc. The functions, in some embodiments, also include an activation function that adjusts the weight of the output of the node. Nodes in different layers, in some embodiments, are associated with different functions.
Each of the functions in the neural network, in some embodiments, is associated with different parameters (e.g., weights and kernel coefficients) that are adjustable during training. In addition, some of the nodes in a neural network, in some embodiments, are also associated with an activation function that decides the weight of the output of the node in forward propagation. Common activation functions include, but are not limited to, step functions, linear functions, sigmoid functions, hyperbolic tangent functions (tanh), and rectified linear unit functions (ReLU). After an input is provided into the neural network and passes through a neural network in the forward direction, in some implementations, the results are compared to the training labels or other values in the training set to determine the neural network's performance. In some embodiments, the process of prediction is repeated for other inputs in the training sets to compute the value of the objective function in a particular training round. In turn, the neural network performs backpropagation by using gradient descent such as stochastic gradient descent (SGD) to adjust the coefficients in various functions to improve the value of the objective function.
In some embodiments, multiple iterations of forward propagation and backpropagation are performed and the model is iteratively trained. In some embodiments, training is completed when the objective function has become sufficiently stable (e.g., the machine has converged) or after a predetermined number of rounds for a particular set of training enhancer sequences. In some embodiments, the trained model is used for performing various tasks as discussed in this disclosure.
Models 114 that are convolutional neural networks (CNNs) have unique advantages over other models because CNNs can model binding motifs within enhancer sequences independent of their positions within the enhancer sequences.
In various embodiments, a wide variety of models 114 are applicable for performing the methods disclosed herein. Non-limiting examples include different forms of supervised learning models such as decision trees, support vector machines (SVMs), linear regression, logistic regression, Bayesian networks, and boosted gradient models. Deep learning models such as neural networks, including convolutional neural networks (CNN), recurrent neural networks (RNN), and attention-based models (such as Transformers), are also contemplated. In some embodiments ensembles of models are used.
In some embodiments, the model 114 is a convolutional neural network, a recurrent neural network, a multilayer perceptron, XGBoost (e.g., eXtreme Gradient Boosting), a transformer model, and/or a generative model.
As another example, in some embodiments, the model 114 makes use of a bagging architecture (e.g., random forest, extra tree algorithms) and/or a boosting architecture (e.g., gradient boosting, XGBoost, etc.). In some embodiments, the model 114 is an extreme gradient boost (XGBoost) model. Description of XGBoost models is found, for example, in Chen et al., 2016, “XGBoost: A Scalable Tree Boosting System,” arXiv: 1603.02754v3 [cs.LG], the disclosure of which is hereby incorporated by reference, in its entirety, for all purposes, and specifically for its teaching of training and using XGBoost models.
In some embodiments boosting is used. In some embodiments, a boosting technique such as AdaBoost is used in conjunction with many other types of learning algorithms to improve the performance of the model 114. In this approach, the output of any of the models 114 disclosed herein, or their equivalents, is combined into a weighted sum that represents the final output of the boosted model. In some embodiments, the plurality of outputs from the models is combined using any measure of central tendency known in the art, including but not limited to a mean, median, mode, a weighted mean, weighted median, weighted mode, etc. In some embodiments, the plurality of outputs is combined using a voting method. In some embodiments, a respective model in the ensemble of models is weighted or unweighted.
Referring to block 250, in some embodiments, the plurality of parameters 116 comprises at least 100 parameters, 1000 parameters, at least 5000 parameters, at least 10,000 parameters, at least 100,000 parameters, at least 250,000 parameters, at least 500,000 parameters, or at least 1,000,000 parameters.
In some embodiments, the plurality of parameters 116 is at least 1000 parameters, at least 5000 parameters, at least 10,000 parameters, at least 100,000 parameters, at least 250,000 parameters, at least 500,000 parameters, or at least 1,000,000 parameters.
In some embodiments, the plurality of parameters 116 comprises at least 200, at least 500, at least 1000, at least 5000, at least 10,000, at least 100,000, at least 250,000, at least 500,000, at least 1,000,000, or at least 5,000,000 parameters. In some embodiments, the plurality of parameters comprises no more than 10,000,000, no more than 5,000,000, no more than 1,000,000, no more than 100,000, no more than 10,000, or no more than 1000 parameters. In some embodiments, the plurality of parameters consists of from 200 to 10,000, from 1000 to 100,000, from 5000 from 200,000, from 100,000 to 500,000, from 500,000 to 1,000,000, from 1,000,000 to 5,000,000, or from 5,000,000 to 10,000,000 parameters. In some embodiments, the plurality of parameters falls within another range starting no lower than 200 parameters and ending no higher than 10,000,000 parameters.
Referring to block 252, in some embodiments, the method further comprises obtaining a second training dataset, in electronic form. The second training dataset comprises a second plurality of training enhancer sequences. For each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding measured amount of selectivity of the respective training enhancer sequence across the plurality of in vivo states. In practice, this involves the determination of a measured amount of activity in each in vivo state in the plurality of in vivo states. Each training enhancer sequence in the second plurality of training enhancer sequences is an endogenous enhancer. The procedure further comprises, after the inputting and refining described in block 242, inputting a sequence of each respective training enhancer sequence in the second plurality of training enhancer sequences into the model thereby obtaining as output from the model, for each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding predicted activity of the respective training enhancer in at least a first in vivo state in the one or more in vivo states. The procedure refines the plurality of parameters based on a differential between the corresponding measured amount activity and the corresponding predicted activity in at least the first in vivo state for each respective training enhancer sequence in the second plurality of training enhancer sequences. Referring to block 254, in some embodiments, the second plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
Bar 1602 of
Bar 1604 of
As can be seen by comparison of bars 1602 and 1604 of
By comparison to bars 1602 and 1604, bar 1606 shows that a degradation in model performance is observed when the model is trained on the 50,000 tandem repeat training enhancer sequences of panel 602 of
Bars 1608, 1610 and 1612 show that the transfer learning approach improves for selectivity of HepG2 activity as well.
Bar 1608 of
Bar 1610 of
As can be seen by comparison of bars 1608 and 1610 of
Comparison to bars 1608 and 1610, bar 1612 of
The Spearman agreement in fold-changes reported in bars 1614, 1616, and 1618 are consistent with the previous observations that pre-training on the 50,000 training enhancer sequences of panel 602 of
Bar 1614 of
Bar 1616 of
Bar 1618 of
Panel 1620 of
Additional datasets can be used as a second training set beyond, or instead of, the approximately 12,000 naturally occurring enhancers described above in conjunction with
Referring to block 255 of
In some implementations, the trained model is used to score randomly and/or algorithmically generated novel nucleic acid sequences of test enhancer sequences. In some such embodiments, the model scores for these test enhancer sequences are used to rank and select new test enhancer sequences for experimental testing.
Referring to block 256, in some embodiments, the plurality of test enhancer sequences comprises 10 or more, 100 or more, 1000 or more, or 10,000 or more test enhancer sequences.
Referring to block 258, in some embodiments, the method further comprises applying a feature attribution method to a test nucleic acid sequence using the trained model 114 to obtain a predictive weight matrix that provides a predictive contribution of every residue position in the test nucleic acid sequence to the plurality of in vivo states. The generating of the plurality of test enhancer sequences using the trained model is in accordance with at least a portion of the predictive weight matrix.
Referring to block 260, in some embodiments, the model is a convolutional neural network and the feature attribution method is an integrated gradients feature attribution method. One example an integrated gradients feature attribution method is GkmExplain. See, Shrikumar et al., 2019, “GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs to a test nucleic acid sequence using the trained model to obtain a predictive weight matrix,” Bioinformatics 35, i173-i182, which is hereby incorporated by reference.
Referring to block 262, in some embodiments, a test enhancer sequence in the plurality of test enhancer sequences comprises a plurality of residue positions. Each respective residue position in the plurality of residue positions is populated with one of four naturally occurring nucleotides based on a corresponding probability for the respective residue position given by the predictive weight matrix. The probability at any given residue position across the four naturally occurring nucleotides (A, T, C, and G) sums to 100 percent probability in some embodiments.
In some embodiments, the position weight matrix derived from the trained model 114 through feature attribution is used to identify one or more regions of low activity, such as region 504 of
In some embodiments the position weight matrices available for known transcription binding motif sequences is used to generate, tens, hundreds, or thousands of sequences for the one or more regions of low activity, such as region 502 of
In some embodiments, between 5 and 25 residues, between 10 and 30 residues, or between 15 and 50 residues of the test nucleic acid sequence of block 258 that have regions of low activity across the plurality of in vivo states are replaced with sequences that are either known binding sites or are generated from the position frequency matrices of such binding sites.
Referring to block 264, in some embodiments, the trained model is used to generate the plurality of test enhancer sequences using an input optimization algorithm.
In embodiments that make use of input optimization, the model comprises an input layer configured to accept a test sequence. The model is configured to output, a calculated activity of the sequence responsive to inputting the test sequence. A desired activity level for the test enhancer sequence (in a particular in vivo state) is set and a difference between this desired activity level and the calculated activity of the sequence is backpropagated through the model with the parameters fixed and the test enhancer sequence is modified to resolve the difference between the desired activity level and the calculated activity of the sequence, thereby generating candidate sequences. Such an approach is referred to in statistics as Response surface methodology. In such approaches, the backpropagation involves using a gradient ascent/descent optimization function on the input test sequence rather than the parameter trained model to find the optimal input (enhancer sequences with desired activity. Such an approach is call “Input optimization” and is used to generate the test enhancer sequences summarized in panel 2104 of
Referring to block 266, in some embodiments, the trained model is used to generate the plurality of test enhancer sequences using a bit diffusion algorithm. In some such implementations, a generative model is first trained by randomly adding noise to a test nucleic acid sequence thereby generating a seed (forward diffusion process). The model then performs a reverse diffusion process to remove the noise from the test nucleic acid sequence conditioned on an optimal activity in the desired in vivo state. To generate novel enhancer sequences, the reverse diffusion process is run on randomly generated test enhancer sequences 108. The performance of the test enhancer sequences 108 is then experimentally tested, and such performance measures and/or generated polymer sequences can be fed back into the generative model for iterative training until a target performance is achieved. See Killoran et al., 2017, “Generating and designing DNA with deep generative models,” arXiv: 1712.06148v1, which is hereby incorporated by reference. Such an approach is call “Bit diffusion” and is used to generate the test enhancer sequences summarized in panel 2102 of
Referring to block 267 of
Referring to block 268, in some embodiments, the determining the activity of each respective test enhancer sequence in the plurality of test enhancer sequences in an in vivo assay comprises exposing the respective test enhancer sequence operably connected to the common promoter and the common payload to each respective in vivo state in the one or more in vivo states and measuring an abundance of the common payload in each respective in vivo state in the one or more in vivo states.
Panel 2102 shows the activity of test enhancer sequences 120 generated using a trained convolutional neural network model 114 using bit diffusion algorithm in accordance with block 266.
Panel 2104 shows the activity of test enhancer sequences 120 generated using a trained convolutional neural network model 114 using an input optimization method in accordance with block 264.
Panel 2106 shows the activity of test enhancer sequences 120 generated using a trained convolutional neural network model 114 in which the feature attribution method gkmExplain (Shrikumar et al., 2019, “GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMsto a test nucleic acid sequence using the trained model to obtain a predictive weight matrix,” Bioinformatics 35, i173-i182, which is hereby incorporated by reference), was used to generate a predictive weight matrix that provides a predictive contribution of every residue position in the test nucleic acid sequence to the activity for mouse primary neurons or to HepG2 cells. Then, these predictive weight matrices were used to replace k-mers in enhancer sequences that the predictive weight matrix indicated had low scoring activity (for MPN or conversely for HepG2) with k-mers the predictive weight matrix indicated had a greater contribution to activity in MPN or HepG2.
Panel 2108 shows the activity of test enhancer sequences 120 generated using a trained convolutional neural network model 114 in which the feature attribution method gkmExplain (Shrikumar et al., 2019, “GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs,” Bioinformatics 35, i173-i182, which is hereby incorporated by reference), was used to generate a predictive weight matrix that provided a predictive contribution of every residue position in the test nucleic acid sequence to the activity in mouse primary neuron cells or in HepG2 cells. Then, these predictive weight matrices were used to replace k-mers within enhancer sequences that the predictive weight matrix indicated had low scoring activity (for MPN or conversely for HepG2) with k-mers the predictive weight matrix indicated had a greater contribution to activity.
Panel 2110 is a negative control.
Panel 2112 is the activity of the original test enhancer sequences prior to optimization.
Panel 2114 is a positive control.
Panel 2116 illustrates an embodiment in which regions of test nucleic acid sequences tested in accordance with block 258 that have suitable activity are combined together to produce composite test enhancer sequences from these regions.
Another aspect of the present disclosure provides a computer system comprising one or more processors and a non-transitory computer-readable medium including computer-executable instructions that, when executed by the one or more processors, cause the processors to perform any of the methods and/or embodiments disclosed herein.
Yet another aspect of the present disclosure provides a non-transitory computer-readable storage medium having stored thereon program code instructions that, when executed by a processor, cause the processor to perform any of the methods and/or embodiments disclosed herein.
Although inventions have been particularly shown and described with reference to a preferred embodiment and various alternate embodiments, it will be understood by persons skilled in the relevant art that various changes in form and details can be made therein without departing from the spirit and scope of the invention.
Referring to the left portion of
A model was trained to generate enhancer sequences conditioned on mouse primary neuron (MPN) activity, liver cancer cell (HepG2) activity, and/or a metric comprising log fold change of MPN vs HepG2 activity. The bit diffusion model included a U-Net model comprising an architecture as illustrated in FIGS. 23A-B of U.S. Provisional Patent Application No. 63/585,119, filed Sep. 25, 2023, which is hereby incorporated by reference.
The model was trained on the unbiased tandem repeat library of panel 602 of
A total of 25,000 enhancer sequences predicted to have mouse primary neuron (MPN) activity were generated from the above-described models. A total of 25,000 enhancer sequences predicted to have HepG2 activity were generated form the above-described models A total of 25,000 enhancer sequences were generated that were predicted to have appreciable log fold MPN versus HepG2 activity using the above-described models. After subsampling the generated enhancer sequences, on an unweighted random basis, a corresponding 2000 enhancer sequences predictive to have MPN activity, 200 enhancer sequences predicted to have HepG2 activity, and 500 enhancer sequences predicted to have appreciable log fold change in MPN versus HepG2 activity remained. The generated sequences could be further synthesized for experimental validation.
A proof-of-concept assay was performed to assess the ability of the above described models to generate enhancer sequences conditioned on a particular target property for MPN activity. Performance of the generated enhancer sequences generated by the above-described models was estimated using a trained CNN model that produced, as output, predicted labels according to each of three performance criteria (MPN activity, HepG2 activity, and/or log fold change of MPN vs HepG2 activity) upon receiving a generated enhancer sequence as input. The output of the bit diffusion model was furtbased on a comparison of the predicted labels (y-axis) with the “true” labels used for conditioning (x-axis).
In this example, the effect of using a model trained to score enhancer sequences conditioned on epithelial cell line (ARPE-19) activity using a 10-mer tandem repeat training enhancer sequence library was investigated.
To determine this effect, first the base line selectivity between HepG2 (liver) and ARPE-19 (epithelial) cells was determined. For the column marked “endogenous” in
The column marked “Random sample” provides another basis for background enhancer selectivity between ARPE19 and HepG2 cells. The enhancers for this column were randomly selected enhancers from the 50,000 10-mer tandem repeat training enhancer sequence library described above in conjunction with
Next, a model was trained on the unbiased tandem repeat library of panel 602 of
The tandem repeat library trained model was then used to identify positive regions (regions that contribute to activity) in endogenous enhancers using the feature attribution techniques (e.g., as disclosed above in conjunction with
The tandem repeat library trained model was also used to rank the approximately 12,000 naturally occurring enhancers described above (e.g., in conjunction with
All references cited herein are incorporated by reference to the same extent as if each individual publication, database entry (e.g., Genbank sequences or GeneID entries), patent application, or patent, was specifically and individually indicated to be incorporated by reference in its entirety, for all purposes. This statement of incorporation by reference is intended by Applicants, pursuant to 37 C.F.R. § 1.57(b)(1), to relate to each and every individual publication, database entry (e.g., Genbank sequences or GeneID entries), patent application, or patent, each of which is clearly identified in compliance with 37 C.F.R. § 1.57(b)(2), even if such citation is not immediately adjacent to a dedicated statement of incorporation by reference. The inclusion of dedicated statements of incorporation by reference, if any, within the specification does not in any way weaken this general statement of incorporation by reference. Citation of the references herein is not intended as an admission that the reference is pertinent prior art, nor does it constitute any admission as to the contents or date of these publications or documents.
Embodiment 1. A method of screening for improved enhancer activity comprising: obtaining a first training dataset, in electronic form, wherein the first training dataset comprises a first plurality of training enhancer sequences and, for each respective training enhancer sequence in the first plurality of training enhancer sequences, a corresponding measured amount of activity of the respective training enhancer sequence in each of one or more in vivo states, wherein each training enhancer sequence in the first plurality of training enhancer sequences is a tandem n-mer repeat sequence; training a model comprising a plurality of parameters by a procedure comprising: (i) inputting a sequence of each respective training enhancer sequence in the first plurality of training enhancer sequences into the model, wherein the model applies the plurality of parameters to the sequence of each respective training enhancer sequence to generate as output from the model, for each respective training enhancer sequence in the first plurality of training enhancer sequences, a corresponding predicted amount of activity of the respective training enhancer sequence for at least one a first in vivo state of the one or more in vivo states, and (ii) refining the plurality of parameters based on a differential between the corresponding measured amount of activity and the corresponding predicted amount of activity in the first in vivo state for each respective training enhancer sequence in the first plurality of training enhancer sequences; generating a plurality of test enhancer sequences using the trained model; and determining an activity of each respective test enhancer sequence in the plurality of test enhancer sequences in an in vivo assay thereby screening for improved enhancer activity.
Embodiment 2. The method of embodiment 1, wherein the model is a support vector machine.
Embodiment 3. The method of embodiment 2, wherein the support vector machine comprises a gapped k-mer kernel.
Embodiment 4. The method of embodiment 1, wherein the model is a convolutional neural network.
Embodiment 5. The method of any one of embodiments 1-4, wherein the method further comprises applying a feature attribution method to a test nucleic acid sequence using the trained model to obtain a predictive weight matrix that provides a predictive contribution of every residue position in the test nucleic acid sequence to an activity in the first in vivo state, and the generating the plurality of test enhancer sequences using the trained model is in accordance with at least a portion of the predictive weight matrix.
Embodiment 6. The method of embodiment 5, wherein the model is a convolutional neural network and the feature attribution method is an integrated gradients feature attribution method.
Embodiment 7. The method of embodiment 5, wherein a test enhancer sequence in the plurality of test enhancer sequences comprises a plurality of residue positions, and each respective residue position in the plurality of residue positions is populated with one of four naturally occurring nucleotides based on a corresponding probability for the respective residue position given by the predictive weight matrix.
Embodiment 8. The method of embodiment 1, wherein the trained model is used to generate the plurality of test enhancer sequences using an input optimization algorithm.
Embodiment 9. The method of embodiment 1, wherein the trained model is used to generate the plurality of test enhancer sequences using a bit diffusion algorithm.
Embodiment 10. The method of any one of embodiments 1-9, wherein the plurality of parameters is at least 100 parameters, 1000 parameters, at least 5000 parameters, at least 10,000 parameters, at least 100,000 parameters, at least 250,000 parameters, at least 500,000 parameters, or at least 1,000,000 parameters.
Embodiment 11. The method of any one of embodiment 1-10, wherein each training enhancer sequence in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different n-mer, wherein n is a positive integer between 4 and 20.
Embodiment 12. The method of any one of embodiments 1-10, wherein each training enhancer sequence in the first plurality of training enhancer sequences comprises 5 to 20 contiguous repeats of a different 10-mer.
Embodiment 13. The method of any one of embodiments 1-12, wherein each training enhancer sequence in the first plurality of training enhancer sequences consists of between 15 residues and 200 residues.
Embodiment 14. The method of any one of embodiments 1-13, wherein each respective training enhancer sequence in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload; and the corresponding measured amount of activity in the first in vivo state is a transcriptional activity in the first in vivo state in a tissue.
Embodiment 15. The method of embodiment 14, wherein the tissue is brain tissue or liver tissue.
Embodiment 16. The method of embodiment 14, wherein the tissue is blood, brain, colon, diaphragm, gastrocnemius muscle, heart, inguinal adipose, kidney, lung, lymph node, skin, small intestine, spinal cord, spleen, trachea, liver, stomach, large intestine, pancreas, gallbladder, bladder, eyes, thymus, or adrenal gland.
Embodiment 17. The method of any one of embodiments 1-13, wherein
each respective training enhancer sequence in the first plurality of training enhancer sequences is operably connected to a common promoter and a common payload under a common transcription factor; and each respective in vivo state in the one or more in vivo states is an amount of expression of the common payload in a corresponding cell line originating from a particular tissue.
Embodiment 18. The method of embodiment 17, wherein the cell line is HepG2, H4, or K562.
Embodiment 19. The method of embodiment 17, wherein the common transcription factor is FOSB::JUNB, FOSL2::JUNB, FOSL1::JUNB, FOS::JUN; FOSL1::JUN; BATF3, BNC2; ZBTB12; HSF4; HSF1; HSF2; REL; RELA; TEAD3; TEAD1; TP53; TP63; TP73; SMAD5; SMAD2; ZNF416; GRHL2; TFCP2; MLX; ARNT::HIF1A; or HIF1A.
Embodiment 20. The method of any one of embodiments 1-19, wherein the corresponding measured amount of activity of the respective training enhancer sequence in the one or more in vivo states comprises a corresponding measured abundance of a payload associated with the respective training enhancer sequence in each of the one or more in vivo states.
Embodiment 21. The method of embodiment 20, wherein the corresponding measured abundance of the respective training enhancer sequence in an in vivo state in the one or more in vivo states is a corresponding averaged or summed count of a number of barcoded instances of a payload associated with the respective training enhancer sequence found in a corresponding plurality of sequence reads of RNA harvested from each replica in a plurality of replicas of a plurality of cells, representative of the in vivo state exposed to the respective training enhancer sequence.
Embodiment 22. The method of embodiment 21, wherein the corresponding averaged or summed count is log normalized by an amount of the respective training enhancer sequence used for exposing each replica of the plurality of cells.
Embodiment 23. The method embodiment 21 or embodiment 22, wherein each respective training enhancer sequence in the plurality of respective training enhancer sequences is uniquely assigned a corresponding plurality of barcodes comprising 5 or more, 10 or more, 20 or more, 50 or more, 75 or more, or 100 or more barcodes, and each replica of the plurality of cells representative of the in vivo state exposed to the respective training enhancer sequence is, for each respective barcode uniquely assigned to the respective training enhancer sequence, one or more copies of the respective training enhancer sequence tagged with the respective barcode.
Embodiment 24. The method of embodiment 23, wherein each barcode in the corresponding plurality of barcodes encodes a unique predetermined value selected from the set {1, . . . , 1024}, {1, . . . , 4096}, {1, . . . , 16384}, {1, . . . , 65536}, {1, . . . , 262144}, {1, . . . , 1048576}, {1, . . . , 4194304}, {1, . . . , 16777216}, {1, . . . , 67108864}, or {1, . . . , 1×1012}.
Embodiment 25. The method of any one of embodiment 21-24, wherein the corresponding plurality of sequence reads of RNA harvested from each replica in the plurality of replicas of a plurality of cells comprises at least 1000, at least 2000, at least 3000, at least 4000, at least 5000, at least 6000, at least 7000, at least 8000, at least 9000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1 million, at least 2 million, at least 3 million, at least 4 million, at least 5 million, at least 6 million, at least 7 million, at least 8 million, at least 9 million, or more sequence reads.
Embodiment 26. The method of any one of embodiments 21-25, wherein the corresponding plurality of replicas of the plurality of cells comprises 2, 3, 4, 5, 6, 7, 8, 9, or 10 replicas.
Embodiment 27. The method of any one of embodiment 1-26, wherein the first plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
Embodiment 28. The method of any one of embodiments 1-27, wherein the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more in vivo states comprises a log fold change in (i) an abundance of the respective training enhancer sequence measured in the first in vivo state and (ii) an abundance of the respective training enhancer sequence measured in the second in vivo state.
Embodiment 29. The method of any one of embodiments 1-27, wherein the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more in vivo states comprises a binary indication of measured activity between the first in vivo state and the second in vivo state.
Embodiment 30. The method of any one of embodiments 1-27, wherein the one or more in vivo states consists of a first in vivo state and a second in vivo state, and the corresponding measured amount of activity of the respective training enhancer sequence across the one or more in vivo states comprises a scalar indication of measured activity between the first in vivo state and the second in vivo state.
Embodiment 31. The method of any one of embodiments 1-30, wherein the method further comprises: obtaining a second training dataset, in electronic form, wherein the second training dataset comprises a second plurality of training enhancer sequences and, for each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding measured amount of activity of the respective training enhancer sequence in each of the one or more in vivo states, wherein each training enhancer sequence in the second plurality of training enhancer sequences is an endogenous enhancer; and the procedure further comprises, after the (i) inputting and (ii) refining, (iii) inputting a sequence of each respective training enhancer sequence in the second plurality of training enhancer sequences into the model thereby obtaining as output from the model, for each respective training enhancer sequence in the second plurality of training enhancer sequences, a corresponding predicted amount of activity of the respective training enhancer in at least the first in vivo state, and (iv) refining the plurality of parameters based on a differential between the corresponding measured amount of activity and the corresponding predicted amount of activity in at least the first in vivo state for each respective training enhancer sequence in the second plurality of training enhancer sequences.
Embodiment 32. The method of embodiment 31, wherein the second plurality of training enhancer sequences comprises 10,000 or more training enhancer sequences, 20,000 or more training enhancer sequences, 30,000 or more training enhancer sequences, 40,000 or more training enhancer sequences, 50,000 or more training enhancer sequences, 60,000 or more training enhancer sequences, 70,000 or more training enhancer sequences, 80,000 or more training enhancer sequences, 90,000 or more training enhancer sequences, or 100,000 or more training enhancer sequences.
Embodiment 33. The method of any one of embodiments 1-32, wherein the plurality of test enhancer sequences comprises 10 or more, 100 or more, 1000 or more, or 10,000 or more test enhancer sequences.
Embodiment 34. The method of any one of embodiments 1-33, wherein the determining the activity of each respective test enhancer sequence in the plurality of test enhancer sequences in an in vivo assay comprises: exposing the respective test enhancer sequence operably connected to the common promoter and the common payload to each respective in vivo state in the one or more in vivo states; and measuring an abundance of the common payload in each respective in vivo state in the one or more in vivo states.
Embodiment 35. A computer system comprising: one or more processors; and
a non-transitory computer-readable medium including computer-executable instructions that, when executed by the one or more processors, cause the processors to perform the method according to any one of embodiments 1-34.
Embodiment 36. A non-transitory computer-readable storage medium having stored thereon program code instructions that, when executed by a processor, cause the processor to perform the method according to any one of embodiments 1-34.
The foregoing description of the embodiments has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the patent rights to the precise forms disclosed. Persons skilled in the relevant art will appreciate that many modifications and variations are possible in light of the above disclosure.
Any feature mentioned in one claim category, e.g., method, can be claimed in another claim category, e.g., computer program product, system, storage medium, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims (in particular multiple dependencies) can be claimed as well, so that any combination of claims and the features thereof is disclosed and can be claimed regardless of the dependencies chosen in the attached claims. The subject matter, in some embodiments, includes not only the combinations of features as set out in the disclosed embodiments but also any other combination of features from different embodiments. Various features mentioned in the different embodiments can be combined with explicit mentioning of such combination or arrangement in an example embodiment or without any explicit mentioning. Furthermore, any of the embodiments and features described or depicted herein, in some embodiments, are claimed in a separate claim and/or in any combination with any embodiment or feature described or depicted herein or with any of the features.
Some portions of this description describe the embodiments in terms of algorithms and symbolic representations of operations on information. These operations and algorithmic descriptions, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as engines, without loss of generality. The described operations and their associated engines are, in some embodiments, embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein, in some embodiments, are performed or implemented with one or more hardware or software engines, alone or in combination with other devices. In one embodiment, a software engine is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described. The term “steps” does not mandate or imply a particular order. For example, while this disclosure describes, in some embodiments, a process that includes multiple steps sequentially with arrows present in a flowchart, the steps in the process do not need to be performed by the specific order claimed or described in the disclosure. In some implementations, some steps are performed before others even though the other steps are claimed or described first in this disclosure. Likewise, any use of (i), (ii), (iii), etc., or (a), (b), (c), etc. in the specification or in the claims, unless specified, is used to better enumerate items or steps and also does not mandate a particular order.
This application claims priority to U.S. Provisional Patent Application No. 63/511,138 entitled “GENERATIVE SEQUENCE SCREENING WITH CONDITIONAL GANS, DIFFUSION MODELS, AND DENOISING DIFFUSION CONDITIONAL GANS,” filed Jun. 29, 2023, which is hereby incorporated by reference. This application also claims priority to U.S. Provisional Patent Application No. 63/585,119 entitled “GENERATIVE SEQUENCE SCREENING WITH CONDITIONAL GANS, DIFFUSION MODELS, AND DENOISING DIFFUSION CONDITIONAL GANS,” filed Sep. 25, 2023, which is hereby incorporated by reference. This application also claims priority to U.S. Provisional Patent Application No. 63/587,665 entitled “REGULATING ENHANCER ACTIVITY USING MACHINE-LEARNING,” filed Oct. 3, 2023, which is hereby incorporated by reference.
| Number | Date | Country | |
|---|---|---|---|
| 63587665 | Oct 2023 | US | |
| 63511138 | Jun 2023 | US | |
| 63585119 | Sep 2023 | US |