The present invention relates to transcription factors. More specifically, the present invention relates to systems and methods implementing machine learning models to identify activation domain regions of transcription factors.
Transcription factors (TFs) play a key role in eukaryotic cell pathways by modulating the expression of genes in response to a signal. They are thus involved in all central processes such as growth, stress response, and development, and when mis-regulated or mutated can lead to many human diseases. Each TF includes a family-determining DNA-binding domain (DBD) and an effector domain that regulates nearby gene transcription. Activation domains (ADs)—effector domains that increase transcription—have long been of particular interest due to their roles as oncogenic drivers and use as scientific tools.
ADs were discovered as regions which could independently stimulate transcription when ectopically recruited to a gene promoter. Early experiments showed that ADs were unlike structured domains because progressive truncations showed graded, proportional reductions in activity. Subsequent studies showed that ADs were high in predicted disorder and had few similarities in their primary sequence. Instead, ADs were classified based on their enrichment of certain residues, whether acidic, glutamine, proline, or other residues.
Acidic ADs are the most common and best characterized. Acidic ADs retain activity when transferred between yeast and humans, pointing to a conserved eukaryotic mechanism. While some have found that acidic residues are necessary for activation, others have found that they are dispensable. Besides their negative charge, acidic ADs are abundant in bulky hydrophobic residues. Mutating these hydrophobic residues reduces activation, often In proportion to the number mutated.
Because AD sequences are highly diverse and poorly conserved, only a small fraction of all ADs have likely been annotated. Sequence motifs have been proposed based on analysis of select ADs but have not been used for large-scale prediction. Screens of random sequences identified numerous, heterogeneous activating sequences that represented as much as 1-5% of elements tested. However, wild-type protein sequences and structures are highly non-random, so predictions based on random sequence may not generalize. On the other hand, direct screening of wild-type sequences has identified only modest numbers of ADs at coarse resolution. As of yet, there lack methods to experimentally detect or computationally predict the full diversity of wild-type ADs.
This summary is meant to provide some examples and is not intended to be limiting of the scope of the invention in any way. For example, any feature included in an example of this summary is not required by the claims, unless the claims explicitly recite the features. Various features and steps as described elsewhere in this disclosure may be included in the examples summarized here, and the features and steps described here and elsewhere can be combined in a variety of ways.
In one embodiment, a method includes obtaining a convolutional neural network (CNN), where the CNN is trained with functional activation domain data from a first organism and identifying an activation domain in a second organism using the CNN.
In a further embodiment, the functional activation domains are described by at least one of: peptide sequence, predicted secondary structure, actual secondary structure, predicted disorder, and activity of the functional activation domain.
In another embodiment, the method further includes obtaining the functional activation domain data and training the CNN with the functional activation domain data.
In a still further embodiment, obtaining functional activation domain data includes obtaining a library of nucleic acid molecules, where each molecule in the library of nucleic acid molecules encodes a peptide including a DNA-binding domain and a potential activation domain, providing the library to a collection of cells, where each cell in the collection of cells includes a target gene operatively coupled to a promoter region to which the DNA-binding region binds, screening the collection of cells for a cell that expresses the target gene, which indicates a functional activation domain within the molecule from nucleic acid molecules provided to the cell, and identifying the functional activation domain in the library.
In still another embodiment, identifying the functional domain includes sequencing the nucleic acid molecule introduced into the cell.
In a yet further embodiment, the method further includes screening the collection of cells for a cell that expresses the molecule from the library of nucleic acid molecules.
In yet another embodiment, the peptide further includes a reporter domain.
In a further embodiment again, the reporter domain is selected from mCherry, GFP, YFP, RFP, DsRed, mStrawberry, mOrange, and dTomato.
In another embodiment again, screening the collection of cells for a cell that expresses the target gene simultaneously screens the collection of cells for a cell that expresses the molecule from the library of nucleic acid molecules.
In a further additional embodiment, identifying the functional activation domain includes sequencing the molecule from the library of nucleic acid molecules introduced to the cell.
In another additional embodiment, the peptide further includes an inducer domain, and the method further includes inducing expression of the target gene.
In a still yet further embodiment, inducing expression includes providing an exogenous chemical to the collection of cells.
In still yet another embodiment, the inducer is an estrogen inducer, and the exogenous chemical is estrogen.
In a still further embodiment again, the collection of cells are selected from bacteria, yeast cells, plant cells, and mammalian cells.
In still another embodiment again, the collection of cells are yeast cells.
In a still further additional embodiment, the CNN possesses 1-20 hidden layers.
In still another additional embodiment, the CNN possesses 3-9 convolutional layers.
In a yet further embodiment again, the CNN possesses 9 convolutional layers of kernel size 10 and channel width 30.
In yet another embodiment again, the first organism and the second organism are different species.
In a yet further additional embodiment, the method further includes obtaining the functional activation domain data and training the CNN with the functional activation domain data, where the CNN possess 1-20 hidden layers and 3-9 convolutional layers, where the convolutional layers possess kernel size 10 and channel width 30 and the functional activation domains are described by at least one of: peptide sequence, predicted secondary structure, actual secondary structure, predicted disorder, and activity of the functional activation domain.
Other features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the invention.
The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
Turning now to the drawings, systems and methods to identify transcription factor activation domains and uses thereof are provided. Many embodiments utilize machine learning methodologies to identify peptide sequences, domains, motifs, or regions capable of activating gene transcription. Additional embodiments design and/or construct peptides to activate gene transcription in vivo and/or in vitro. Certain embodiments design nucleic acid sequences (e.g., DNA and/or RNA) that encode such custom peptides. Further embodiments treat an organism, such as a human, with one or more of a nucleic acid sequence and a peptide encoding for a custom transcription factor, such that it activates or increase expression of one or more genes within the organism.
Transcription factor proteins play a key role in cell pathways by modulating the expression of genes in response to a signal. They are thus involved in all central processes such as growth, stress response, and development, and when mis-regulated or mutated can lead to many human diseases. Transcription factors are also used by viruses to hijack and direct a host cell's processes towards amplifying the virus. Furthermore, transcription factors are used extensively as research tools and in engineering cellular circuits.
Activation domains are the regions of transcription factors that are responsible for increasing expression of a gene. However, very few activation domains in human or virus genomes are annotated, because they are poorly conserved in sequence even though they are highly conserved in function. Many embodiments describe a computational tool to predict locations and strengths of activation domains from protein sequence to enable rapid functional characterization of transcription factors across all newly sequenced genomes and of genetic mutants, whether due to natural genetic variation in humans or viruses or arising in diseases such as cancer. Such embodiments also enable rational design of activation domains with varying strengths for use as research tools and for cellular engineering.
Many embodiments herein utilize a dataset of in vivo activation and/or in vitro coactivator-binding domains to train a machine learning model, such as a convolutional neural network, to identify activation domains and strengths of these domains in organisms (e.g., eukaryotes) and/or tissues (e.g., cancers). In certain embodiments, the in vivo activation and/or in vitro coactivator-binding domains arise from experimental work identifying activation domains in one or more species, such as a model species (e.g., yeast, Arabidopsis spp., mouse, rat, etc.).
Many embodiments are capable of identifying core regions of all ADs responsible for activation and dissect the sequence and mechanistic determinants of coactivator elements (e.g., Mediator).
It should be noted that within the context of this description the term “in vivo” refers to “in an organism” and/or “in a cell line” (e.g., tissue culture), and the term “in vitro” refers to “in a solution” and/or “in a test tube.”
Turning to
At 102, many embodiments obtain a library of molecules. Various embodiments obtain a library of proteins or peptides comprising one or more functional domains, and certain embodiments obtain a library of nucleic acid molecules (e.g., RNA and/or DNA) that encode for a peptide with one or more functional domains. In many embodiments, each molecule in the library comprises a DNA binding domain and a potential activation domain. Additional embodiments include additional elements for identification and/or induction of gene activation—for example, some embodiments include a reporter peptide (e.g., mCherry, GFP, YFP, RFP, DsRed, mStrawberry, mOrange, dTomato, and/or any other known reporter molecule), and additional embodiments include an inducer, such as an estrogen response domain and/or any other domain that can localize the molecule to the nucleus and/or induce gene activation via an exogenous treatment.
In many embodiments the potential activation domain is a segment of a known transcription factor protein. Additional embodiments possess segments of other known proteins, and further embodiments possess random amino acid segments. In some embodiments using segments of known proteins, the potential activation domain represents a tile from a sliding window of a protein's amino acid sequence—e.g., amino acids 1-25, amino acids 2-26, amino acids 3-27, etc.
In certain embodiments, the potential activation domain is a segment of a constant amino acid length (e.g., each molecule possesses a potential activation domain with the same segment length, such as 53 amino acids). In some embodiments, the potential activation domain is a segment of a variable amino acid length (e.g., molecules within the library possess a potential activation domain of varying sizes, such as segments ranging between 15-60 amino acids).
As noted above, certain embodiments obtain a library of nucleic acid molecules that encode for a protein or peptide, such as described. As such, each nucleic acid molecule encodes one or more of a DNA binding domain and a potential activation domain, a reporter peptide, an inducer, such as described herein. Additional embodiments of nucleic acid molecules further encode for one or more functional regions on the molecule, such as a promoter, a terminator, an origin of replication, a splice site, a start codon, a stop codon, a poly-A tail, and/or any other nucleic acid sequence that can aid in replication or expression of the molecule. In some such embodiments, the molecules in the nucleic acid library are circular (e.g. plasmids, BACs, etc.), while in some embodiments, the nucleic acid is linear.
Turning to
Returning to
At 106, various embodiments screen for aTF presence. In embodiments using cells, the cells can be screened or sorted (e.g., via flow cytometry) to select for cells that include a molecule from the library of molecules, such as embodiments which encode a reporter moiety on the aTF (e.g.,
At 108, some embodiments induce gene expression. Induction can include translocating an aTF to a nucleus or by encouraging DNA binding of a DNA-binding domain. Various gene induction systems are known in the art, such as an estrogen induction system, where introduction of exogenous estrogen leads to expression of a target gene. Some embodiments do not require induction via exogenous control, such that target gene expression proceeds regardless of exogenous control.
Further embodiments screen for activation domain activity at 110. Expression within a cell or aliquot is indicative of effective aTFs (e.g., aTFs possessing activation domains with activation activity). In many embodiments, the target gene is a reporter gene, such as a fluorescent molecule. As such, many embodiments screen cells via flow cytometry, a plate reader, and/or any other method to identify target gene expression. In various embodiments, expressing cells and/or aliquots are isolated from non-expressing cells and/or aliquots. Some embodiments provide a quantitative measurement of activation as a measurement of the target gene, such as through fluorescence (if the target gene is a fluorescent reporter gene) or other methodology to quantify expression, such as qPCR.
At 112, many embodiments identify a functional activation domain from the expressing cells and/or aliquots. In many of these embodiments, the activation domains are sequenced via protein sequencing and/or nucleic acid sequencing. For example, in embodiments where the library of molecules are nucleic acids, the nucleic acid sequence can identify the underlying amino acid sequence based on codon usage.
Various embodiments perform features of method 100 in different order, omit certain features, and/or repeat certain features. For example, if no induction is necessary, inducing gene expression 106 may be omitted, while screening steps, such as identifying expressing cells 108 can be repeated, such as to set an expression threshold and/or limit sequencing costs and/or sequencing resources. Additionally, some embodiments quantitatively screen for expression of an aTF and expression of the target gene at the same time through a methodology that can both quantify and identify multiple targets, such as multi-channel fluorescence, qPCR, nucleic acid sequencing, or any other suitable method to quantify expression.
Further embodiments repeat the entire process with a different size or sizes of potential activation domains to identify smallest possible unit with activation activity, while some embodiments (in addition to or instead of) repeat the entire process with one or more alterations in the amino acid sequence, either through an alteration in a nucleic acid sequence or alteration of an amino acid sequence, to identify a key residue or key residues that contribute to activation activity.
As some genetic diseases can be resolved by regulating a gene (either up-regulating or down-regulating), identifying domains that activate gene transcription can give rise to therapeutics to control genes. As such, certain embodiments utilize a machine learning model, such as a neural network, to identify activation domains in proteins, organisms, and/or species. In many of these embodiments, a convolutional neural network (CNN) is trained based on known activation domains. In some of these embodiments, the identities of activation domains are obtained by methods described herein. Some embodiments further include quantitative measurements of activation activity for the activation domains. Some of these embodiments train the neural network on the amino acid composition of the known activation domains, and certain embodiments further train the neural network based on one or more of the following predicted secondary structure, actual secondary structure, and predicted disorder of the activation domain (see e.g., Oates, M. E., et al. (2012). D2P2: database of disordered protein predictions. Nucleic Acids Research 41, D508-D516; the disclosure of which is hereby incorporated by reference herein in its entirety.)
CNNs in accordance with many embodiments evaluate sequences by hierarchically integrating matches to a diverse suite of learned motifs. Such CNNs and have recently found great success in many genomic prediction tasks. Based on such methodologies, many embodiments explain greater than 80% of observed variation in data withheld from training, markedly better than an amino acid composition-based predictor. Many embodiments accurately predicted the activation strength of (1) new ADs within TFs omitted from training (
Turning to
Turning to
Some embodiments are able to identify the minimal region for activation.
Many embodiments utilize machine learning embodiments into processes to engineer organisms, including cells cultures, for research or industrial production. Such embodiments identify and characterize ADs in one or more systems of interest or design novel activation domains with specific activities for a particular use.
Further embodiments implement machine learning methodologies for genetic counseling by identifying impacts of naturally occurring or disease-related mutations, where the mutations affect the activation domain activity. As such, these embodiments could identify a disease or disease risk based on functional impact of a mutation.
Additional embodiments implement machine learning methodologies, such as those described herein, to determine clinical outcomes of mutations in cancers. Genetic fusions with activation domains are driving events in many cancers; however, cancers genome sequences are very different between patients so interpreting the functional effect of mutations in sequenced tumor genomes could be done using these embodiments.
Further embodiments implement such machine learning methodologies to determine clinical outcomes of mutations in viruses. Sequences of activation domains evolve quite quickly and can affect their activity in subtle ways. Thus, some embodiments predict how specific mutations in virus activation domains affects its function in the host cell.
Experiments were conducted to demonstrate the capabilities of the assays and inhibitors in accordance with embodiments. These results and discussion are not meant to be limiting, but merely to provide examples of operative devices and their features.
A peptide library was designed based on all 162 yeast (Saccharomyces cerevisiae S288C) transcription factors annotated with the Gene Ontology term GO:003700 (DNA-binding transcription factor activity), plus MET4 and HAP4, two yeast proteins known to be involved in transcription. Each transcription factor protein sequence was fragmented into 53-amino acid tiles with at least 40 amino acid overlap (overlap was adjusted based on the protein length to make the tiling as evenly distributed as possible across any given protein), yielding 7,457 unique peptides (sublibrary 1A-yeast-tiles). A set of 50 random synthetic (i.e., not derived from a known protein) sequences with the same amino acid frequencies as the overall set of yeast transcription factors (sublibrary 1A-random-control), as well as a set of 50 peptides derived from non-nuclear proteins (proteins not annotated with the GO term GO:0005634 (nucleus)) were used as controls (sublibrary 1A-yeast-nonnuclear). Note that there are no true negative controls; the random and non-nuclear sequences serve as references for comparison purposes, but it is unknown a priori whether they are strong or weak binders. Additionally, 10 positive control sequences previously reported to activate and bind Med15 were included (sublibrary 1A-yeast-controls). This gave a total of 7,567 unique 53-amino acid sequences.
Pairwise edit distance analysis demonstrates the diversity of peptide sequences in the library.
Across this library, 451 tiles showed significant activation (P<0.0001,
These 150 ADs included many (e.g., approximately 75%) of previously-reported ADs, showing this measurement was highly sensitive. A large amount (112) of the ADs were previously unknown. Furthermore, the 53aa tile length was not limiting, since ADs were identified in over 85% of full-length TFs that activated in a one-hybrid screen. In fact, AD-containing TFs upregulated a higher proportion of downstream genes than AD-less TFs, suggesting that these ADs are functional within the full-length protein. Together, these results show that this screen is both accurate and comprehensive and has yielded the first complete annotation of ADs in any eukaryotic genome.
In the reverse translation design process, the aim was to optimize library DNA fragments for compatibility and consistency with an in vitro assay, standard RNA-seq protocols, and built-in redundancy for error-correcting reads. In particular, the Python package dnachisel 1.4.1 was used to optimize the following objectives:
An edit distance of 6 among the first 48 bases and last 48 bases of any two sequences in the same sublibrary was enforced. This was performed in a randomized, brute-force, iterative approach:
Finally, it was verified that across each set of sequences sharing the same primer (e.g., all sequences in library 1A, or all sequences in library 1B), the total paired-end edit distance (sum of edit distance of 5′-most 50 bases and edit distance of 3′-most 50 bases) was at least 6.
The use of sequencing primers unique to each sublibrary enabled submission of samples for sequencing in multiplexed format and accurately assign reads to the correct sublibrary computationally. An edit distance margin built into the library was further leveraged to enable mapping of sequencing reads with a small number of errors.
Sequencing read alignment was performed using custom bash script built on top of existing tools and additional custom scripts. It takes as input arguments the UMI length, the sublibrary sequencing primer, the edit distance threshold for that sublibrary, and raw FASTQ files. Unique molecular identifiers (UMI) were extracted from reads and appended to the read names using umi_tools 1.0.0. cutadapt 1.18 was used to discard reads without matching paired-end sublibrary sequencing primers and trim the primers in reads with matching primers; the default error tolerance was used. bwa-mem 0.7.17-r1188 was used to perform a first-pass alignment of reads to the DNA fragment library. Imperfectly mapped read pairs (i.e., those without paired read SAM flags of 99 and 147) were re-mapped to the library sequence with minimal edit distance. This was necessary because bwa-mem did not always correctly map paired reads as a pair, a problem most evident in the mutant library with many similar sequences. Pairwise Levenshtein edit distance was computed using the Python package editdistance 0.5.3. Paired reads exceeding the edit distance threshold were discarded using reformat.sh from BBTools 38.61. Duplicate reads were identified and deduplicated using utmi_tools 1.0.0. Finally, reads mapped to each DNA library fragment sequence were counted.
Twenty percent (20%) of the total library sequences were reserved as a held-out test set and split the remaining library into 10 splits for cross-validation.
Two broad ways of encoding library peptide sequences were considered. For each encoding method, deep learning regression models were developed to predict activation from those encodings. All models were trained using mean squared error as the loss function.
The first category of encodings did not retain explicit position-specific information. As a specific method, each peptide was encoded as a 20-dimensional vector giving the proportions of each of the 20 amino acids. Using scikit-learn 0.22.1, fully-connected neural networks were trained with 1 to 20 hidden layers of width 20 and ReLU activation using either the Adam or SGD optimizer and the default initial learning rate of 0.001 and L2-penalty of 0.0001.
The second category of encodings explicitly retained position-specific information. For a library peptide sequence of length L, these encoding methods produced a L-by-d matrix, where d depended on the specific encoding method. The most expressive encoding was a one-hot encoding with d=20. Amino acid embeddings learned by the UniRep model (d=10) were also tried, and a d=2 embedding method capturing the charge (1 for lysine and arginine, −1 for aspartic acid and glutamic acid, 0 otherwise) and Wimley-White interfacial hydrophobicity of each amino acid. Using TensorFlow 2.2, convolutional neural networks were trained with 3-9 convolutional layers followed by max-pooling along either the channels or sequence-length dimension and 2 fully-connected layers of width 20. Kernel sizes between 5 and 20, channel widths (i.e., number of filters at each convolutional layer) between 10 and 20, L2 weight penalties between 1e-3 and 1e-5, and initial learning rates between 1e-2 and 1e-4 were tried experimentally. Batch norm and Swish or ReLU activation was applied to each convolutional and fully-connected layer (except the last layer). Models were trained using the Adam optimizer for up to 500 epochs with two scheduling callbacks: reduction of the learning rate by 5-fold if training loss did not improve for 20 epochs, and early stopping if no improvement on the validation loss was observed for 50 epochs. Models were trained on Stanford's Sherlock computing cluster using CPUs only. The final model had 9 convolutional layers of kernel size 10 and channel width 30 followed by max-pooling along the sequence dimension; swish activation was applied to each layer (except the last). This model was trained using the adam optimizer with batch normalization, an L2 weight penalty of 1e-3, a dropout of 0.1, and an initial learning rate of 1e-3.
Having described several embodiments, it will be recognized by those skilled in the art that various modifications, alternative constructions, and equivalents may be used without departing from the spirit of the invention. Additionally, a number of well-known processes and elements have not been described in order to avoid unnecessarily obscuring the present invention. Accordingly, the above description should not be taken as limiting the scope of the invention.
Those skilled in the art will appreciate that the foregoing examples and descriptions of various preferred embodiments of the present invention are merely illustrative of the invention as a whole, and that variations in the components or steps of the present invention may be made within the spirit and scope of the invention. Accordingly, the present invention is not limited to the specific embodiments described herein, but, rather, is defined by the scope of the appended claims.
The current application claims priority to U.S. Provisional Patent Application No. 63/125,836, entitled “Systems and Methods to Identify Transcription Factor Activation Domains and Uses Thereof” by Roger D. Kornberg et al., filed Dec. 15, 2020; the disclosure of which is hereby incorporated by reference in its entirety.
This invention was made with Governmental support under Contract Nos. 133097 and 11696 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63125836 | Dec 2020 | US |