The disclosed subject matter relates to methods, systems, and media for identifying transcription factor binding sites.
The dynamic process of gene regulation is essential for embryonic development and cellular function. Gene regulation is primarily mediated by the combinatorial effects of transcription factors interacting with cis-regulatory elements such as promoters and enhancers. Therefore, accurate identification of transcription factor binding sites within the genome is necessary to understand a wide range of cellular processes from cell differentiation to homeostasis to cancer. However, identifying these sites within the genome remains a complex biological and computational question.
One of the challenges in predicting transcription factor binding sites is that identification of the strongest binding sequence, or consensus site, is not sufficient. Research analyzing genome wide transcription factor occupancy has shown that low affinity binding sites are also significantly occupied in both yeast and drosophila. Furthermore, transcription factors from the same family have been shown to bind identical high affinity sites, but distinct low affinity sites. Therefore, identification of both high and low affinity sites will aid in fully understanding transcription factor specificity within the genome.
Nkx2.2 is a homeodomain transcription factor expressed in the ventral neural tube and the pancreas during development. A consensus sequence (T(t/c)AAGT(a/g)(c/g)TT) has been identified by SELEX and a corresponding position weight matrix (PWM) was generated and deposited in the TRANSFAC database. However, the predictive power of this PWM is low. More recently, a PWM for Nkx2.2 was generated using protein binding microarray technology. Protein Binding Microarrays use a mathematically constructed set of oligos to quantitatively measure protein-DNA binding for all possible octamers.
The identification of transcription factor binding sites is an important biological question. To date, the majority of methods to detect these sites have focused on creating statistical models, such as position weight matrices, of transcription factor specificities. However, these models are limited due to the fact that they must make generalized assumptions about transcription factor binding properties that are not completely understood. Conversely, recent technologies have been developed such as ChIP-seq to look at genomic transcription factor occupancy. However, these technologies are technically difficult and limited by the lack of high quality antibodies for many transcription factors.
Accordingly, new mechanisms for identifying transcription factor binding sites are needed.
Methods, systems, and media for identifying transcription factor binding sites in accordance with some embodiments are provided. In accordance with some embodiments, systems for identifying transcription factor binding sites are provided, the systems comprising at least one processor that: receives chromosome sequence data; selects a first plurality of overlapping octamers from the chromosome sequence data; assigns an enrichment score to each of the first plurality of overlapping octamers to produce a first set of enrichment scores; calculates a first average of the first set of enrichment scores; determines whether the first average is above a threshold; selects a second plurality of overlapping octamers from the chromosome sequence data; assigns an enrichment score to each of the second plurality of overlapping octamers to produce a second set of enrichment scores; calculates a second average of the second set of enrichment scores; determines whether the second average is above the threshold; and outputs data that indicates that a transcription factor binding site has been identified in connection with at least one of the first plurality of octamers and the second plurality of octamers.
In accordance with some embodiments, methods for identifying transcription factor binding sites are provided, the methods comprising: receiving chromosome sequence data; selecting a first plurality of overlapping octamers from the chromosome sequence data; assigning an enrichment score to each of the first plurality of overlapping octamers to produce a first set of enrichment scores; calculating a first average of the first set of enrichment scores; determining whether the first average is above a threshold; selecting a second plurality of overlapping octamers from the chromosome sequence data; assigning an enrichment score to each of the second plurality of overlapping octamers to produce a second set of enrichment scores; calculating a second average of the second set of enrichment scores; determining whether the second average is above the threshold; and outputting data that indicates that a transcription factor binding site has been identified in connection with at least one of the first plurality of octamers and the second plurality of octamers.
In accordance with some embodiments, computer readable media containing computer-executable instructions that, when executed by a processor, cause the processor to perform a method for identifying transcription factor binding sites are provided, the method comprising: receiving chromosome sequence data; selecting a first plurality of overlapping octamers from the chromosome sequence data; assigning an enrichment score to each of the first plurality of overlapping octamers to produce a first set of enrichment scores; calculating a first average of the first set of enrichment scores; determining whether the first average is above a threshold; selecting a second plurality of overlapping octamers from the chromosome sequence data; assigning an enrichment score to each of the second plurality of overlapping octamers to produce a second set of enrichment scores; calculating a second average of the second set of enrichment scores; determining whether the second average is above the threshold; and outputting data that indicates that a transcription factor binding site has been identified in connection with at least one of the first plurality of octamers and the second plurality of octamers.
As is known in the art, the transcription factor Nkx2.2 binds a 10 base-pair sequence that was thought to contain an invariable “AAGT” core sequence. In accordance with some embodiments, a mechanism for identifying an alternative core sequence for a transcription factor (such as Nkx2.2) is provided. Using this mechanism, an alternative low-affinity core sequence with a wobble in the first position that contains “GAGT” has been identified.
Berger M F, et al., “Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences,” Cell 133(7):1266-1276, 2008, which is hereby incorporated by reference herein in its entirety, published a protein binding microarray (PBM) analyzing the binding affinity of the Nkx2.2 homeodomain transcription factor. PBMs generate an enrichment score (E-score) with a range from −0.5 to 0.5 for every possible eight-base combination based on the relative intensity readouts from the microarray data.
In accordance with some embodiments, a mechanism for identifying an alternative core sequence for a transcription factor can operate as follows: First, all octamers with an E-score greater than 0.45 can be selected. As shown in the last row of column 104 of
As can be seen, the two identified core sequence motifs differ only in the first position. In order to determine whether significant enrichment can be seen with the other two possible first bases (e.g., TAGT and CAGT), a histogram 110 of the number of occurrences of each possible base in the first position (i.e., AAGT, GAGT, TAGT and CAGT) for all E-scores can be plotted as shown in
In order to experimentally test the alternative GAGT binding site, Electrophoretic Mobility Shift Assay (EMSA) experiments were performed as shown in
The EMSA experiments were performed as follows: First, in vitro synthesized Nkx2.2 protein was made using the TNT Coupled Reticulolysate System (available from Promega Corporation). Probes were next prepared containing each of the predicted core sequences analyzed or a deleted core sequence. The sequences of each of the probes are listed in Table 1 of Appendix I.
The probe containing the Nkx2.2 consensus sequence was prepared as described in Watada H, Mirmira R G, Kalamaras J, & German M S, “Intramolecular control of transcriptional activity by the NK2-specific domain in NK-2 homeodomain proteins,” Proc Natl Acad Sci USA, 97(17):9443-9448, 2000, and Anderson K R, et al., “Cooperative transcriptional regulation of the essential pancreatic islet gene NeuroD1 (beta2) by Nkx2.2 and neurogenin 3,” J Biol Chem 284(45):31236-31248, 2009, which are hereby incorporated by reference herein in their entireties.
Binding of each of the probes to the in vitro synthesized Nkx2.2 (Myc-Nkx2.2 TNT Protein) or alphaTC 1 nuclear extract with or without transfected Myc-Nkx2.2 was measured as follows.
Probes were labeled by filling in 5′ overhangs with 32P-dCTP. The binding buffer included 100 mM Tris HCl pH 7.5, 500 mM NaCl, 5 mM EDTA, 10 mM MgCl2, 40% glycerol, 5 mM DTT, 10×BSA, and 0.1 μg/μl of polydIdC. Binding reactions were incubated on ice for 45 minutes with 5 μl of in vitro synthesized protein and 25,000 CPMs, corresponding to approximately 1 fmol, of labeled probe. Samples were run on 5% non-denaturing polyacrylamide gels at 180 V for 1.5 hours in 1×TGE buffer (250 mM Tris base, 1.9 M glycine, and 10 mM EDTA).
Bands were quantified using the integrated mean of a fixed window for each of the shifts using Photoshop Extended CS3 (available from Adobe Systems Inc.). Values were normalized to total probe (shifted probe+free probe).
Binding of each probe was next compared to both the original consensus probe and a probe with a deleted core sequence. The GAGT containing probe showed significant binding with in vitro translated Nkx2.2 (TNT Nkx2.2) or nuclear extract from alphaTC1 cells with or without transfected Nkx2.2, although binding was weaker than the AAGT containing probe.
Taken together, these experiments show that GAGT represents an alternative core sequence for Nkx2.2 binding sites, although its relative binding affinity is lower than the canonical AAGT core sequence.
In accordance with some embodiments, protein binding microarray data can be mapped directly to the genome to identify putative binding sites, such as Nkx2.2 binding sites.
The enrichment score (E-score) generated from the protein binding microarray can represent a semi-quantitative estimate of transcription factor binding affinity. In accordance with some embodiments, the E-score for each octamer can be mapped to the genome to predict Nkx2.2 binding sites. This mapping can be referred to a PBM-mapping.
In accordance with some embodiments, single octamers with an E-score greater than 0.4 (or any other suitable threshold) can be mapped.
In accordance with other embodiments, a moving average of seven (or any other suitable number) of octamers can be mapped to predict binding affinity with greater accuracy. Sequences with a moving average greater than a given threshold can then be deposited into a database and can be output to a display if desired. The threshold can be set to approximately 0.37 (or any other suitable value).
A PBM-mapping process 200 that can be used in accordance with some embodiments is illustrated in
Using this technique, complete analysis of the genome resulted in 3×10̂6 predicted sites, which falls within range of the expected number of transcription factor binding sites expected in the genome. In order to investigate sites that are most likely to be biologically relevant, a search for sites was limited to bound promoters (from 2.5 kb upstream to 1 kb downstream) of genes with expression levels significantly changed (e.g., more than two-fold) in Nkx2.2 null mice at e12.5 or e13.5 and one hundred and eleven novel Nkx2.2 binding site found.
The results of sites within these promoters can be found in Table 2 of Appendix II. Binding sites were found in seven out of eight genes with increased expression and 24 out of 27 genes with decreased expression in the Nkx2.2 null pancreas. GAGT containing sites were highly represented in the predicted sites—confirming the ability of the technique to predict alternate sites. Twenty three sites, including six GAGT containing sites, were confirmed using EMSA analysis as shown in
EMSA analysis of selected predicted sites was performed as described above except that probes spanning approximately 50-60 base pairs surrounding the predicted site were incubated with in vitro synthesized Nkx2.2, and the Nkx2.2 consensus probe and the consensus probe with the core sequence deleted were used as positive and negative controls, respectively.
Confirmation of in vivo promoter occupancy at predicted sites by ChIP was performed using the Active Motif ChIP IT Express kit (available from Active Motif, Inc.). BetaTC6 cells were used for chromatin input and Nkx2.2 mouse monoclonal antibody was used for precipitations. BetaTC6 cells were grown in DMEM supplemented with 15% FBS. Approximately 1.5×10̂7 cells were crosslinked in 1% paraformaldehyde for five minutes at room temperature. Chromatin was then extracted and sheared by sonication using a Diagnode BioRuptor (8 min-30 sec ON/OFF) resulting in chromatin fragments from 200-800 base pairs long. The sheared chromatin was divided into six reactions and run independently. Pulldowns were done with 3 μg mouse anti-Nkx2.2 monoclonal antibody (available from Developmental Studies Hybridoma Bank). Enrichment is shown as fold change over IgG. Normal mouse IgG (available from Millipore Corporation) was used as a negative control. Occupancy of the predicted sites was tested by Sybr-Green qPCR (primers are listed in Table 3 of Appendix III).
All predicted sites were significantly increased over the IgG control. The housekeeping gene GapdH was used as a negative control and was not significantly enriched. Nkx6.2 −1441, nkx6.2 +669, Irs4 +1495 and Tm4sf4 +912 were not tested in ChIP for technical reasons.
Tested sites were randomly selected from putative sites in bound promoter regions. In addition to the randomly selected sites, the following sites were also included: a site predicted by the PBM-mapping mechanism described herein that is located in the Region IV enhancer of the Pdx1 promoter, an additional Irs4 site downstream of the bound region (Irs4 +1495), and a previously published Nkx2.2 binding site in the insulin promoter that was the only published site not predicted the PBM-mapping mechanism described herein.
Of the 28 sites tested by EMSA, only the insulin promoter site, the Nkx6.2 +669 site, and the glucagon −1080 site did not show detectable binding. Glucagon −1080 and Nkx6.2 +669 had an average E-score of 0.347 and 0.364, respectively, and represented the lowest scores of any predicted site tested. The Ins2 −144 site was below an original threshold with an average E-score of 0.233.
In order to test whether the E-score is correlated with relative Nkx2.2 binding affinity, the relative binding affinity of Nkx2.2 binding in the EMSA experiments was quantified and graphed against the TRANSFAC PWM score, the PBM seed and wobble matrix score, and the E-score. The TRANSFAC PWM was developed from alignment of 23 sequences enriched using SELEX experiments. The PBM-PWM was based on microarray experiments, which provide data for all possible octamers. Numerous statistical corrections to the PWM model were not part of this study.
As shown in
Single E-scores for the highest octamer and averages of three, five, six, seven, and eight octamer were tested as shown in
Although the above-described mechanism for determining transcription factor binding sites has been illustrated for Nkx2.2, this mechanism can additionally or alternatively be applied to other transcription factor binding sites to create composite transcription factor binding site maps across the entire genome. Generation of such a map can greatly aid work to identify cis-regulatory elements and understand gene regulation. PBM data is available for at least 391 non-redundant proteins from several species, as described in Newburger D E & Bulyk M L, “UniPROBE: an online database of protein binding microarray data on protein-DNA interactions,” Nucleic Acids Res 37(Database issue):D77-82, 2009, which is hereby incorporated by reference herein in its entirety. However, adjustments to the mechanism may need to be made to account for different profiles of different classes of proteins.
Although there is overlap between PWM based predictions and PBM mapping, two examples of promoters where the predictions are significantly different have been identified: NeuroD and Insulin. The functional control of the NeuroD promoter by Nkx2.2 is described in Anderson KR, et al., “Cooperative transcriptional regulation of the essential pancreatic islet gene NeuroD1 (beta2) by Nkx2.2 and neurogenin 3,” J Biol Chem 284(45):31236-31248, 2009, which is hereby incorporated by reference herein in its entirety. In the NeuroD promoter, the TRANSFAC-PWM for Nkx2.2 predicted two sites while PBM mapping predicted a novel site upstream of the two TRANSFAC predicted sites that were not bound in vitro or in vivo as illustrated in
As shown in
The PBM mapping site is unique because it is predicted to consist of two adjacent binding sites separated by four base pairs as illustrated in the schematic representation of the NeuroD promoter shown in
An Nkx2.2 binding site in the insulin promoter (Ins2 −144) was previously published in Watada H, Mirmira R G, Kalamaras J, & German M S, “Intramolecular control of transcriptional activity by the NK2-specific domain in NK-2 homeodomain proteins,” Proc Natl Acad Sci USA, 97(17):9443-9448, 2000, which is hereby incorporated by reference herein in its entirety. This site is the only published Nkx2.2 binding site not predicted by the process illustrated in
Insulin expression is lost in the Nkx2.2 null mouse. However, mutation of the Ins2 −144 site resulted in a paradoxical increase in insulin expression. Therefore, luciferase assays were performed to assess Nkx2.2 function through the upstream Nkx2.2 binding site. Luciferase constructs were created to contain the 586 bases upstream of the Ins2 promoter.
The insulin promoter from −585 to +2 was cloned into the pGL4.17 luciferase plasmid (available from Promega Corporation). Mutagenesis of the previously published and predicted Nkx2.2 binding sites was done using the Quickchange II mutagnesis kit (available from Agilent Technologies Inc., formerly Stratagene) with the following primers and their respective reverse compliment sequence:
GGAGGAGGGACCATTGCCTTGCTGCCTGAATTC (Ins2 −144) and GACCTAGCACCAGGGGTTTGGAAACTGCAGC (Ins2 −477). A ratio of 10:1 (500 ng/50 ng) of pGL4:ins2 promoter/pRL-null plasmids were transfected using Fugene 6 transfection reagent (available from F. Hoffmann-La Roche Ltd.) into 5×10̂5 betaTC6 cells. After 48 hours, cells were harvested and assayed for luciferase activity using the dual luciferase assay kit (available from Promega Corporation). At least three independent experiments were performed in triplicate and the unpaired student t-test was used to measure significance of changes between sample conditions.
Basal activity of the promoter was very high in BetaTC6 cells. Mutation of the upstream Nkx2.2 binding site resulted in a 50% reduction in activity, indicating that Nkx2.2 increases the rate of insulin production, but is not necessary for insulin expression. Mutation of the downstream site also resulted in a decrease in luciferase levels, contrary to what was previously published. These experiments show that Nkx2.2 activates the insulin promoter through both binding sites, but binds more strongly to the Ins2 −477 site.
In accordance with some embodiments, the techniques described herein can be implemented at least in part in one or more computer systems. These computer systems can be any of a general purpose device such as a computer or a special purpose device such as a client, a server, etc. Any of these general or special purpose devices can include any suitable components such as a processor (which can be a microprocessor, digital signal processor, a controller, etc.), memory, communication interfaces, display controllers, input devices, etc.
In some embodiments, any suitable computer readable media can be used for storing instructions for performing the processes described herein. For example, in some embodiments, computer readable media can be transitory or non-transitory. For example, non-transitory computer readable media can include media such as magnetic media (such as hard disks, floppy disks, etc.), optical media (such as compact discs, digital video discs, Blu-ray discs, etc.), semiconductor media (such as flash memory, electrically programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), etc.), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable tangible media. As another example, transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable intangible media.
Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention. For example, in some embodiments, rather than operating on octamers (which include 8 base pairs), a suitable portion of a DNA strand including any suitable number of base pairs (e.g., 10) can be used. Features of the disclosed embodiments can be combined and rearranged in various ways.
This application claims the benefit of U.S. Provisional Patent Application No. 61/349,131, filed May 27, 2010, which is hereby incorporated by reference herein in its entirety.
This invention was made with government support under Grants U01 DK072504 and RO1 DK082590 awarded by the National Institute of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61349131 | May 2010 | US |