Mapping transcription factor (TFs) to their target genes is fundamental to understanding how gene networks function. However, for most organisms no TF-gene interactions information is available. DNA affinity purification sequence (DAP-seq) combines advantages of in vivo and in vitro assays. DAP-seq directly measures TF binding in a native local genomic context in an in vitro TF-DNA binding assay that allows rapid generation of genome-wide binding site mapping for a large numbers of TFs, while capturing genomic DNA binding property that impacts binding in vivo (e.g., DNA methylation). The primary technical bottleneck to acquire such information is the significant effort involved in cloning and expressing tagged TFs used in in vitro assays.
In one aspect, the present disclosure provides a method of affinity-labeling a polypeptide, e.g., a transcription factor, in an in vitro transcription and translation reaction to evaluate interactions of the polypeptide with other molecules of interest, such as cellular nucleic acids or proteins. The method employs a tRNA having an affinity-labeled amino acid, e.g., a lysine tRNA in which the lysine is linked to an affinity label, such as biotin. In some embodiments, a nucleic acid encoding a polypeptide of interest is amplified and transcribed in an in vitro transcription reaction. Labeled polypeptide is obtained by providing the tRNA loaded with labeled amino acid in an in vitro translation reaction. In some embodiments, the affinity-labeled polypeptide is a transcription factor. In some embodiments, an affinity-labeled transcription factor is employed to evaluate transcription factor binding sites. In some embodiments, an affinity-labeled transcription factor, e.g., a biotin-labeled transcription factor labeled at a subset of lysine residues with biotin, is used in DAP-seq (referred to herein as “biotin-DAP-seq” for convenience) to evaluate TF-genomic DNA binding interactions. In some embodiments, the binding moiety is biotin. One of skill understands that alternative binding moieties can also be used. The same approach used to immobilize TFs for DAP-seq can also be used for a low-cost high-throughput protein isolation for downstream characterization of interactions of the labeled protein with other molecules, e.g., protein-protein interactions, ligand binding, and/or structural analysis. In some embodiments, an affinity-labeled polypeptide generated as described herein is used in conjunction with other massively parallel sequence-based analyses. In some embodiments, an affinity-labeled polypeptide is used to evaluate binding of the polypeptide to RNA. In some embodiments, an affinity-labeled polypeptide is used to evaluate binding to synthetic nucleic acids, e.g., in a Systematic Evolution of Ligands by Exponential Enrichment (SELEX) method, e.g., for the identification of aptamers that bind the protein of interest.
In one embodiment, the disclosure provides a method, e.g., biotin-DAP-seq, in which labeled TF proteins are expressed from templates that are PCR amplified directly from genomic DNA or cDNA. Thus, for example, primers flanking a transcription factor gene are used to amplify the gene. Such amplification primers are designed to contain an appropriate promoter, such as a T7 promoter, and other required components for expression in an in vitro coupled transcription and translation reaction mixture. An affinity agent, such as biotin, is introduced into the TF polypeptide encoded by the TF gene during translation by including a tRNA loaded with affinity-labeled lysine, e.g., biotinylated lysine, in the translation reaction mixture. This results in incorporation of affinity moieties, e.g., biotin moieties, at a random subset of lysine codons within the protein sequence. The affinity moiety, e.g., biotin, allows for downstream affinity capture of TFs along with bound DNA sequences using a biotin binding agent, e.g., streptavidin, as a capture agent, e.g., using streptavidin-coated magnetic beads. Methods used to immobilize TFs for DAP-seq can also be used for a low-cost high-throughput protein isolation for downstream characterization of any kind of protein-protein interactions, ligand binding interactions, structural analysis, and other methods that evaluate interactions of a protein with another molecule. In some embodiments, the method described herein to affinity label proteins using charged tRNA to label a polypeptide in an in vitro translation reaction can be used in conjunction with massively parallel sequence analysis, including for example, massively parallel RNA sequence and synthetic DNA sequencing.
In one aspect, the disclosure provides a method of incorporating an affinity moiety into a polypeptide, e.g., a transcription factor, during an in vitro transcription-translation reaction in which RNA is transcribed from a template in vitro, e.g., from a template from an amplification reaction such as PCR, and translated in a reaction in which a tRNA loaded with an amino acid coupled to the affinity moiety is included for incorporation into the polypeptide, thereby producing an affinity labeled-polypeptide. In some embodiments, the template for translation is a plasmid DNA, a viral nucleic acid, or other template provided in sufficient quantity to obtain sufficient affinity-labeled polypeptide for analysis of binding of the labeled polypeptide to nucleic acids, ligands, or other binding molecules. The binding moiety of the affinity-labeled polypeptide specifically binds to a binding partner, e.g., immobilized on a solid support such as a bead. The affinity-labeled polypeptide can be used to evaluate binding interactions of the polypeptide with various molecules of interest, including nucleic acids, such as DNA, RNA, and synthetic DNA generated by chemical synthesis; polypeptides, and ligands. In some embodiments, an affinity labeled-polypeptide is used in conjunction with a massively parallel sequencing analysis to identify polynucleotides that bind the captured affinity-labeled polypeptide.
In some embodiments, the method comprises a coupled in vitro transcription and in vitro translation reactions. A nucleic acid for use as a template for transcription can be any nucleic acid that encodes a polypeptide of interest, including genomic DNA, e.g., from an intronless gene, cDNA, chemically synthesized DNA, or a plasmid DNA template. In some embodiments, the template is an RNA strand in which an RNA-dependent RNA polymerase is employed to generate the corresponding RNA strand that is translated. In some embodiments, the template is generated using an amplification reaction. As used herein, “amplification” of a nucleic acid sequence has its usual meaning, and refers to in vitro techniques for enzymatically increasing the number of copies of a target sequence. The terms refers to both linear and exponential amplification. Amplification methods include both asymmetric methods in which the predominant product is single-stranded and conventional methods in which the predominant product is double-stranded. In typical embodiments, amplification comprises a PCR to obtain amplified products to serve as the template for transcription. Primers are typically designed to include an RNA polymerase binding site, such as an SP6, T7 or T3 binding site.
In some embodiments, the template is a plasmid DNA, which is provided in an amount sufficient to generate RNA that in turn is translated in vitro as described herein to generate affinity-labeled polypeptide. Thus, for example, in some embodiments, the template is a plasmid that comprises a promoter for transcription of RNA in vitro for translation in an in vitro translation reaction.
Translation systems for in vitro translation are known. In some embodiments, in vitro translation is coupled with in vitro transcription reaction. Translation systems include cell lysate translation systems, such as bacterial cell lysates, wheat germ lysates, and rabbit reticulocyte lysate translation systems. Such systems can be supplemented with additional components such as ATP, protease inhibitors, RNA polymerases, etc. In some embodiments, the in vitro translation systems is reconstituted from individual purified or partially purified components (e.g., a reconstituted E coli translation system using recombinant components as described in Shimizu et al. (2001) Nature Biotechnology 19, 751-755).
Amino Acid-Loaded tRNA
Any amino acid can be coupled to an affinity label, e.g., biotin, and loaded onto a corresponding tRNA. As used in this context, a “loaded” tRNA is used interchangeably with “precharged” tRNA to indicate that an amino acid is bound to its corresponding tRNA. In some embodiments, the amino acid coupled to the affinity label, e.g., biotin, has a reactive side chain. In some embodiments, the amino acid has a hydrophilic or charged side chain. In some embodiments, the amino acid is lysine, arginine, tyrosine, glutamate, aspartate or cysteine. In some embodiments, affinity-modified lysine is used to charge the corresponding tRNA. In some embodiments, the affinity label is biotin. In some embodiments, a precharged, affinity-labeled tRNA is provided in the in vitro translation reaction in an amount such that a desired percentage of labeled amino acid residues are incorporated into the product of the translation reaction. For example, in some embodiments, over 20%, e.g., from 20%-35%, or 20% to 50% of the residues of the amino acid selected for labeling are affinity-labeled in the translation product. For example, where biotin-lysine is employed as the affinity label, the proportion of biotinylated-lysine tRNA added to the translation reaction can be adjusted to obtain a translation product in which 20% or greater of the lysine residue, or 30% or greater of the lysine residues are affinity labeled.
In some embodiments, the tRNA anticodon site can target to a different codon than to the corresponding amino acid with which it is loaded. For example, a lysine-biotin loaded tRNA can encode an anticodon that targets a stop codon or a four-base non-natural codon to allow for incorporation of the lysine-biotin at a unique and specific location in the polypeptide.
Transfer RNAs charged with affinity-labelled amino acids, e.g., biotinylated lysine tRNA, are known in the art. For example, biotinylated lysine tRNA is available from Promega. In some instances, an affinity label is conjugated to an amino acid and subsequently loaded onto the corresponding tRNA. Such conjugation reactions are well known in the art. Examples include, but are not limited to amide coupling reaction, Michael addition reactions, hydrazone formation reactions and click chemistry cycloaddition reactions.
Affinity moieties and corresponding binding partners are well known in the art. In some embodiments, the affinity binding partner is immobilized on a solid support. The solid support can be any solid substrate, such as a well or other compartment, or a bead. In some embodiments the solid support is a bead, such as a magnetic bead. In some embodiments, the affinity agent is an aptamer, a hapten, a ligand, a dye, or a biotin binding moiety. Thus, for example, in some embodiments, the binding moiety is biotin and the affinity binding partner is streptavidin or avidin. In some embodiments, the binding moity/affinity binding partner is dithiobiotin/avidin, iminobiotin/avidin, dithiobiotin/succinilated avidin, iminobiotin/succinilated avidin, or biotin/succinilated avidin. In some embodiments, the binding moiety/binding partner comprise FITC/anti-FITC antibody, digoxigenin/anti-digoxigenin antibody, or a hapten or epitope and antibody that binds the hapten or epitope, dithiobiotin-avidin, iminobiotin-avidin, biotin-avidin, dithiobiotin-succinilated avidin, iminobiotin-succinilated avidin, biotin-streptavidin, and biotin-succinilated avidin. In some embodiments, the
As explained above, in vitro transcription coupled with in vitro translation is employed to generate an affinity-labeled polypeptide. Such a labeled polypeptide can be used for any application to evaluate interactions of the polypeptide with other molecules, for evaluation of protein-protein interactions, protein-nucleic acid interactions, structure evaluation, and the like. In typical embodiments, affinity-labeled polypeptide is used in conjunction with a massively parallel sequencing methodology, for example to evaluate changes in polypeptide interactions with other molecules in different types of cells or cells subjected to different environment conditions.
In some embodiments, the methods described herein are employed to generate an affinity labeled transcription factor for evaluation of transcription factor binding to genomic DNA. In some embodiments, the affinity-labeled TF is employed in DAP-seq (O'Malley et al., Cell 165:1280-1292, 2016, which is incorporated by reference). In brief, affinity-labeled TF is incubated with genomic DNA obtained from an organism of interest. Fragmented genomic DNA is incubated with the affinity-labeled TF, and the affinity label is used to bind to its binding partner that is attached to a solid support. Genomic DNA bound to the TF can then be processed for sequencing, e.g., using any massively parallel sequencing technique, to determine TF binding motifs. The affinity-labeled TF can be incubated with genomic DNA either before or after the TF binds to its binding partner on the solid support. In DAP-Seq, genomic DNA fragments are typically amplified to incorporate adaptor sequence. As used herein an “adaptor sequence” comprises one or more functional sequences used for amplification, sequencing, identification of the genomic sample used in the analysis, and/or quantification. Thus, for examples, an adaptor sequence may comprise one or more of the following: a universal sequence employed for sequencing, a primer sequence, a cellular identification sequence, a unique molecular identifier (UMI) sequence, a sample identification sequence, and combinations thereof. An adaptor sequence may comprise double-stranded regions, single-stranded regions, or both. This disclosure is not limited to the type of adaptor sequences which could be used and a skilled artisan will recognize additional sequences which may be of use for library preparation and next generation sequencing.
In some embodiments, DAP-seq is employed in conjunction with other sequence-based analyses. Exemplary assays include RNA sequencing, including, but not limited to, sequencing of mRNA, and other RNA populations of interest such as miRNA, snRNA, lncRNA and the like. In some embodiments, DAP-Seq may be performed in conjunction with DNA bisulfite sequencing, e.g., methyl-seq to analyze methylation status, or ATAC-Seq.
In some embodiments, DAP-Seq can be employed in multiplex reactions in which affinity-labeled TFs are incubated with genomic DNA from a diversity of samples to identify TF binding motifs, e.g., TF binding motifs present in different organisms.
In some embodiments, the methods described herein are employed to generate an affinity-labeled protein for evaluation of binding to RNA. Thus for example, affinity-labeled polypeptides can be localized to beads and incubated with RNA obtained from cells. RNA bound to the immobilized polypeptide can then be processed for sequencing, e.g., using any massively parallel sequencing technique. In some embodiments, the affinity-labeled polypeptide is incubated with RNA after the polypeptide is captured on the solid support. The RNA can then be process for sequencing, e.g., a reverse transcriptase reaction, to generate DNA and incorporating adaptor sequences in a subsequent amplification reaction.
In some embodiments, an affinity-labeled polypeptide generated as described herein is incubated with a population of synthetic nucleic acid molecules, e.g., a library of sequences, to identify an aptamer that binds to the polypeptide. Nucleic acid aptamers are a class of small nucleic acid ligands that are composed of RNA or single-stranded DNA oligonucleotides folded into a three-dimensional structure that have high specificity and affinity for their targets. For example, SELEX technology can be used to obtain aptamers specific to the affinity labeled polypeptide. Nucleic acid aptamers can be produced by as chemical synthesis or in vitro transcription for RNA aptamers. Nucleic acid aptamers include DNA aptamers, RNA aptamers, XNA aptamers (nucleic acid aptamer comprising xeno nucleotides) and L-RNA aptamers.
As appreciated by one of skill in the art, any for the foregoing methods can be performed in a multiplex reaction analyzing different populations of nucleic acid molecules, e.g., nucleic acids samples obtained from different types of cells, nucleic acid samples from single cells for analysis of nucleic acid profiles of single cells from a population of cells.
As noted above and as appreciated by one of skill in the art, an affinity-labeled polypeptide can also be used in various analyses for identification and characterization of polypeptides interactions with any number of molecules, including other polypeptides, polynucleotides, carbohydrates, glyoclipids, and any other molecule of interest. Such analyses can be conducted in combination with other high throughput, massively parallel sequencing analyses, as illustrated above.
In a further aspect, the disclosure provides kits and reagents for analyzing binding interactions of an affinity-labeled polypeptide with other molecules. In some embodiments, a kit can comprise sequencing adaptors, reagents for in vitro translation to label the polypeptide with an affinity moiety, and optionally, other reagents to perform massively parallel sequencing.
Understanding the interactions between TFs, their binding sites, and the collection of target genes they regulate is key to our ability to model transcriptional programs and ultimately engineer them. However, large-scale decoding of these interactions is currently limited to a small set of model organisms, in part because of the limitations posed by existing technologies. In vivo methods such as ChIP-seq1-4 can capture TF binding in a physiologically relevant state, but are difficult to scale up to match the hundreds to thousands of TFs found in a single organism. In contrast, in vitro methods such as protein binding microarrays (PBMs)5 and systematic evolution of ligands by exponential enrichment (SELEX)6-8 can be leveraged at large scales. However, most in vitro methods rely on indirect characterization of binding sites by identifying TF binding motifs using synthetic short DNA sequence pools, followed by scanning for these motifs in the reference genome to predict TF binding sites. As a result, these in vitro assays are unable to capture effects of native genomic context including DNA shape, chemical modifications, and conserved local cis-element architecuture that can have a large impact on TF binding specificity.
DNA affinity purification sequencing (DAP-seq), the method we developed in 2016,9 uniquely combines the advantages of in vivo and in vitro assays. Similar to ChIP-seq, DAP-seq directly measures TF binding in native local genomic contexts, and can be scaled up to comprehensively assay all TFs within a species, as demonstrated previously with Arabidopsis. To achieve this, DAP-seq leverages in vitro expressed and affinity-purified TFs to capture binding events with fragmented native genomic DNA (gDNA), followed by high-throughput sequencing.10 DAP-seq has proven to be an effective method to study TF binding sites in a variety of model organisms11-13 and the resulting large-scale datasets have been central to a variety of approaches for understanding gene regulation.14-16
One limitation to DAP-seq, as well as all other existing TF binding assays, is the significant upfront investment required to purify each TF of interest. This is the major bottleneck for all high-throughput TF DNA binding techniques, and the primary restriction on the total number of TFs that can be assayed. In the original DAP-seq method, TF proteins are expressed in vitro from E. coli plasmid templates, which allows fusion of the TF coding sequence with an affinity tag that is required for the pulldown of the expressed TF and the DNA sequences it binds to. This limits widespread application to non-model organisms for which pre-existing TF plasmid collections are not available, and in particular to microbial studies where short generation times and high mutation rates have generated a diversity of TFs too vast to be practically surveyed using a plasmid-based approach. In addition, the original DAP-seq method only enables mapping gDNA binding properties in a single genome at a time. The relationships between TFs, their binding sites, and target genes are known to be conserved sometimes over incredibly long periods of time, and have been shown to be a predictor of conserved biological functions.17,18 Therefore, a broader understanding of how TF binding sites and target genes evolve across phylogenetically relevant sets of species will be of great value to reveal the conservation, evolution, and the function of TF-target gene pathways, of which our current understanding is very limited.
This example illustrates the production of biotin-labeled TFs using biotinylated lysine tRNA. A schematic is provided in
Primers specific to each transcription factor were designed against the first and last 20-24 bases of the corresponding coding sequence. All non-standard start codons were switched to ATG. In each forward primer, a 5′ constant region was introduced immediately upstream of the sequence annealing to the start of the coding sequence, containing a T7 polymerase promoter and Kozak sequence. In each reverse primer, a 5′ sequence of 30×T was introduced to mimic a poly-A tail and facilitate protein expression in eukaryotic in vitro systems. These primers were used to amplify transcription factor coding sequences directly from the genomic DNA using KAPA HiFi 2×PCR master mix with the following conditions: Ta=60° C., 2 minute extension time at 72° C., total reaction volume=50 uL, for 24 cycles of PCR. PCR products were checked for amplification specificity using an Agilent 2200 TapeStation instrument. PCR products were purified using Omega Mag-Bind TotalPure NGS SPRI beads and eluted in 12 μL Tris-HCl buffer pH=8, yielding a DNA concentration of approximately 20-200 ng/μL.
TF proteins were expressed in vitro in 96-well microtiter plates using Promega TnT T7 Quick for PCR DNA following the manufacturer's protocol. For each 50 μL reaction, we use 5 μL of purified TF PCR product for a total of 250-1000 ng template. Negative control wells were included containing mock PCR product, where the PCR was performed with water in place of primers. In order to produce biotin-tagged TF proteins that can later be purified using streptavidin-coated beads, we also spiked in 4 μL of Promega Transcend tRNA to each 50 μL reaction. After combining all components at 4° C., the mixture was incubated at 30° C. overnight (12-18 hours).
The following example illustrates the use of the biotin-labeled TF proteins in DAP-SEQ.
We investigated 354 transcription factors in 48 bacterial genomes, and generated 17,000 high quality TF binding site maps. This unprecedented rich dataset revealed themes of ancient conservation as well as rapid evolution of gene regulatory modules. We observed various patterns of evolution and regulatory rewiring, where the TF's sensing and regulatory role is maintained while the arrangement and identity of target genes diverges. Such regulatory rewirings execute analogous functions in some cases, while in others appear to have been repurposed for entirely new functions. We also integrated existing phenotypic information, established novel functional regulatory modules, and defined new pathways. Finally, we identified 242 new TF DNA binding motifs, yielding a 70% increase of characterized TF motifs in Escherichia coli, and annotations of TF motifs in Pseudomonas simiae for the first time. Integrative analyses of the TF DNA motifs across bacterial genomes revealed deep conservation in gene promoter architecture. Our methods are highly versatile for rapid characterization of gene pathways across any organisms, enabling direct annotation and dissection of regulatory pathways and laying the foundation for modeling and designing synthetic regulatory networks.
We validated this new streamlined TF expression approach (Example 1) using a test set of 216 known Escherichia coli TFs and observed one or more putative binding sites in at least one of two trials for 125 TFs (58% successful of 216 total). We then compared our results to previously described TF binding sites published in RegulonDB20 (
The streamlined biotin-DAP-seq is particularly suited to studying non-model organisms. We demonstrated this by mapping TF binding sites in Pseudomonas simiae,21 an emerging model for plant-commensal microbes that currently has no available TF binding site annotations.22,23 We compiled a comprehensive set of 567 putative P. simiae TFs by combining three different predicted gene annotations from GenBank24, RefSeq25, and IMG.26 We initially screened the entire set of 567 TFs in two replicate DAP-seq experiments, of which 138 (24%) were successful as defined by at least one peak observed in both replicates. The lower overall success rate compared to the well characterized E. coli TFs is not surprising, as we screened any gene with predicted DNA-binding activity, many of which may not be functional TFs. We chose to use this set of 138 P. simiae TFs for further characterization.
Multiplexed TF Mapping: multiDAP
In parallel, we developed multiDAP, a method that allows mapping TF binding sites in multiple genomes simultaneously. The central concept of multiDAP is to leverage the fact that in the original DAP-seq assay the immobilized TF binding sites were not saturated, and by using a pool of gDNA samples from different species or strains we can directly map TF binding sites across a diverse array of organisms.
For our TF set we used a total of 354 TFs, including the 138 P. simiae TFs that were successful in the preceding biotin DAP-seq screen, and the entire 216 E. coli TF set regardless of success or failure in the preceding screen (
Based on the combination of molecular barcodes from each sequencing read, the dataset was computationally de-multiplexed to yield the equivalent of one DAPseq dataset per TF per organism. After alignment to the corresponding genomes, regions that contain TF binding sites were apparent as peaks, resulting from the pileup of DNA fragments that are bound by the TF. By mapping the binding of the 354 TFs from E. coli and P. simiae across the set of 48 bacterial genomes, we produced a combinatorial dataset equivalent to 17,000 DAP-seq experiments. This dataset allowed direct comparison across divergent bacterial species to reveal conserved patterns and evolution of TF binding at orthologous genes (
Using the resulting multiDAP dataset, we quantified the degree of TF target conservation across the 48 bacterial strains and species. Given that transcription factors often bind in the promoter region directly upstream of genes in bacteria,27 we assigned each peak to the predicted operon(s) that are directly adjacent to and oriented away from the peak. Thus for a given TF, we compiled a set of genes that we predict may be regulated by the TF in each of the organisms. We then calculated a target gene similarity score by comparing the sets of target genes across organisms. We first grouped all protein-coding genes from all 48 species into groups of putative orthologs (orthogroups).28 Next, we quantified TF target conservation by comparing the set of orthogroups targeted in the species from where the TF itself originated (either E. coli or P. simiae) with those targeted in each of the remaining 47 organisms. The results of this analysis give a global view of TF target gene similarity in divergent bacteria for both TFs from E. coli (
We observed that while some TFs and their targets appear to be confined to a small subset of species, others are highly conserved across large evolutionary distances. As may be expected, there appears to be a general trend where the majority of TF-target relationships from E. coli are well conserved within the closely related Enterobacteria clade. A similar degree of conservation is apparent when considering TFs from P. simiae within the Pseudomonas clade. One striking feature is the high degree of conservation of several TF targets across clades that diverged long ago. For example, the most highly conserved TF targets appear to be those of the MraZ transcriptional repressor from E. coli which regulates its own expression as well as genes involved in cell division and cell wall synthesis (
In contrast to these highly conserved features, we also observe evidence of regulatory changes at the sub-species level. To test the ability to accurately discriminate small genetic differences in gene regulation, we included two very closely related strains of E. coli (
A third category of features appear to be less conserved even in closely related species, yet are scattered across larger evolutionary distances. For example, the E. coli MqsA regulator of the mqsA/mqsR toxin/antitoxin system is found sporadically throughout the phylum Proteobacteria: E. coli and Pseudomonas putida in the class Gammaproteobacteria, as well as Ralstonia sp. and Herbaspirillum seropedicae in the class Betaproteobacteria (
While the global analysis gives general insights into conserved binding features, closer inspection of specific TF targets offer examples of evolution within target operon structure. The E. coli autoregulator MraZ shows a strongly conserved operon structure, with only small differences in the gene content and their arrangement in operons in even the most distantly related species (
An extreme example of divergence is seen in the case of the E. coli arsenic resistance regulator, ArsR, which is limited to bacteria sampled from the class Gammaproteobacteria (
Having observed evidence of rewiring in TF targets, we next examined the dataset for examples of TFs that had diverged to take on entirely new functions within different bacterial clades. In order to identify clusters of conservation, we analyzed each TF individually and compared target gene sets of the 48 species to each other. This revealed clusters of conserved target genes and operons that are not found in the species from which the TFs originate (either E. coli or P. simiae). For example in E. coli, the TF AscG regulates genes involved in β-glucoside sugar and propionate utilization.35,36 While the majority of this E. coli regulon appears to be conserved throughout the Enterobacteria and in a few scattered organisms outside this clade, a second separate cluster extends across the genus Pseudomonas and into the class β-Proteobacteria (
Our multiDAP species target set overlapped with 33 species utilized in a previous study designed to measure the fitness costs of gene knockouts on a range of conditional challenges including limiting carbon and nitrogen sources.23 To investigate how the multiDAP and phenotype datasets can complement each other we initially identified a simple and well-characterized example from E. coli, FucR. In response to environmental sources of fucose, E. coli FucR activates genes involved in fucose import and degradation, as well as as the expression of FucR itself (i.e. autoregulation).39 Disruption of fucR or other genes in the fuc operon resulted in a similar growth deficit. Similarly, in Klebsiella oxytoca when the ortholog of fucR or genes in its operon were knocked out, a fucose-dependent growth defect was observed. In both E. coli and K. oxytoca the binding sites predicted by the E. coli FucR multiDAP experiment correctly identified the TF and target genes for fucose sensing and metabolism (
We then investigated the non-model species P. simiae, where we used the multiDAP data in conjunction with phenotype information to establish functional relationships when transcription factors and target genes are at distant locations in the genomes. For example, in the muliDAP we observed that the P. simiae TF Ps109 appears to regulate genes at two distantly located promoters. While the TF knockout confers a growth advantage when 2′-deoxyinosine is the sole carbon source, knockouts of all four regulated genes show a growth disadvantage. The multiDAP binding information allows bundling of this phenotypic information to establish a functional regulatory model, with TF Ps109 acting as a transcriptional repressor at two distant operons involved in 2′-deoxyinosine utilization (
A third example, TF Ps17, shows how multiDAP allows bundling of phenotype information both across distant genome locations as well as across species, indicating Ps17's conserved function in succinate utilization (
One challenge when studying bacterial transcription factor binding sequence motifs is that many TFs only bind strongly to a small number of sites in an entire genome, which can make it difficult to confidently identify a binding sequence motif. However, by assaying 48 microbial genomes in a single multiDAP experiment, the total number of binding sites for each TF in this dataset is multiplied by the number of species containing TF binding sites. We were able to call a high quality motif for 124 TFs from E. coli, 66 of which are not represented in RegulonDB (
We applied these motifs to explore conservation and variation in TF binding site architecture in the promoters of orthologous genes. We mapped motifs back to promoter sequences to identify the exact location and orientation of binding sites in promoters across species. Auto-regulating TFs serve as a particularly tractable set, because there is less ambiguity in identifying the corresponding promoters to compare from each genome. We observe a variety of patterns, some of which are well conserved across divergent species. For some TFs such as MraZ, we observe closely spaced clusters of multiple motifs with variability in the number and strength of motifs (
Beyond revealing conserved promoter architecture in known gene targets, TF binding sequence motifs can also aid in identifying previously unknown regulatory targets. We expanded our analysis beyond the 48 bacterial species by searching for TF orthologs in all metagenome assembled genomes in the Integrated Microbial Genomes (IMG) database,26 based on amino acid sequence identity. We identified approximately 1.25M possible orthologs, of which >170 k showed evidence of conserved auto-regulation where TF motifs are enriched in their respective promoters (
In non-model organisms and metagenomes, a genome sequence provides a wealth of information about gene content and allows prediction of gene function based on similarity to known proteins, however the function of intergenic sequences remains difficult to annotate. In this work, we used multiDAP to identify TF binding sites in 48 diverse bacterial species as well as define 242 high quality binding site motifs for TFs from E. coli and P. simiae, most of which have not been previously described. This multiDAP dataset illustrates patterns of evolutionary rewiring and TF repurposing and defines new gene regulatory modules that are conserved across multiple species. The motifs described here can also be valuable in studying promoter architecture, functionally annotating metagenomic sequences, and designing novel synthetic promoters with desired regulatory properties. Beyond serving as a starting point for future characterization, these results also provide a blueprint for further multiDAP experiments. The new biotin DAP-seq approach facilitates rapid and inexpensive production of expressed TFs, while multiDAP allows analysis of many genomes simultaneously, thereby enriching the biological information extracted from each experiment. These two new techniques can be applied independently or in conjunction for large-scale studies, to begin mapping transcriptional regulatory networks and annotating functional gene regulatory modules across all kingdoms of life.
Genomic DNA from each organism was first sheared using ultrasonic shearing (Covaris LE220-plus) using the following settings: peak power=450W, duty factor=30%, cycles/burst=200. DNA was sheared to an average size of 75 bp in Tris-HCl buffer (pH=8) and applied in multiple cycles of 30 minutes each for a total of 60-90 minutes, allowing time for the water bath to cool between cycles such that the maximum temperature of the samples did not exceed 15° C. After shearing, up to 1 μg of each genomic DNA sample was used to prepare fragment libraries using the KAPA HyperPrep kit and standard manufacturer's protocol. During the adapter ligation step, custom annealed Y-adapters were introduced at a concentration of 15 pM (5 μL adapters in a reaction volume of 110 μL, final adapter concentration=0.7 pM). These custom adapters were prepared by annealing a full-length i5 index adapter with a stub i7 adapter. Ligated libraries were amplified for 8-10 cycles using primers P1 and P2 stub. Oligonucleotide and barcodes as well as strains used in this work are detailed in the supplementary information.
ThermoFisher Dynabeads MyOne Streptavidin T1 were pelleted on a magnetic rack, washed 4× in PBS pH=7.4+0.1% v/v Tween20 and resuspended in an equal volume of this buffer. For each reaction, the following were combined in a mastermix (volumes given are per well/reaction): 15 μL resuspended beads, 1 μg salmon sperm DNA, and 1 ng amplified DNA fragment library from each organism, and topped off with PBS pH=7.4+0.1% v/v Tween20 to a final volume of 50 μL. Mastermix volume was scaled up for 384 samples. Subsequent steps were carried out in 96-well plates using a Hamilton Vantage liquid handler.
The bead+library master mix was aliquoted into each well of a 96-well plate, topped off with 50 μL PBS pH=7.4 and stamped into the plates containing the expressed TF proteins. Plates were incubated for 1 hour at room temperature, with gentle pipet mixing every 2 minutes to keep beads from settling.
After incubation, beads were pelleted and washed 4× with PBS pH=7.4+0.1% v/v Tween20, then resuspended in 10 μL i7 index primers (reference supplementary table) diluted to a final concentration of 1 uM each in Tris-HCl pH=8. An additional 10 μL KAPA HiFi 2×PCR master mix was added to each well. Plates were sealed, vortexed, centrifuged, and placed directly onto a thermocycler running the following program: an initial elution/denaturation step of 98° C. for 10 min, followed by 10 cycles of 98° C. for 10 sec, 60° C. for 30 sec, and 72° C. for 30 sec, with a final extension time of 72° C. for 1 min then a hold at to 4° C. Finished PCRs were pooled across each 96-well plate, using 10 μL from each well and purified using a 1.4× Ampure bead ratio, followed by elution in 30 μL Tris-HCl pH=8. In subsequent experiments we found that including an additional gel purification step to remove primer and adapter dimer carry-over is helpful to reduce issues related to index hopping on Illumina sequencers.
Pooled sequencing libraries were quantified by qPCR and sequenced on NovaSeq 6000 S4 Flowcell to target ˜1 M reads for each of the 36,846 barcode pairs (384 TFs/wells×48 genomes). Libraries were de-multiplexed, adapter trimmed, and quality filtered using BBTools41.
Analysis scripts are described in brief below. Code is available upon request. Each library was subsampled to at most 1 M fragments, aligned against the corresponding reference genome using Bowtie242 and quality filtered with samtools.43 Coverage plots were generated using deeptools.44 Peaks were called using MACS2.45 We observed a significant degree of index hopping, which resulted in cases of leak-through of signal between i7 barcodes. This was addressed using a custom script to identify and filter out cases of overlapping peaks for libraries that had been loaded on the same NovaSeq flow cell lane. This approach was also used to remove any peaks that are present at similar strengths in negative control wells (i.e. wells with mock expressed TF proteins). Target gene assignment for each peak was done using the reference annotation (gff3) and bedtools46.
Gene orthology and phylogeny was assigned using Orthofinder228. Phylogenetic trees were visualized using iTOL47. For TF target gene comparisons, we used a custom python script. We only considered intergenic peaks, and limited the analysis to at most the top 10 target promoters in each organism. We filtered for peaks with a fold-change >=5, p-score >=60, and located <500 bases from the start codon, where p-score is the value assigned by macs2 equal to −log 10 (peak p-value). To avoid matches based on weak binding sites, we filtered out any peaks with a fold-change of less than 5% of the tallest intergenic peak in the same library. We also exclude any TFs that did not perform well, by examining the corresponding peaks in the species from which they originate (E. coli or P. simiae). We defined good performance as having at least one intergenic peak with a fold-change >=15 and p-score >=180. For comparison of target gene similarity between species, we only considered matches in organisms that have at least one putative ortholog of the TF in their own genome. Since in some cases a single organism contributes multiple genes to a single orthogroup, we adjusted for the uniqueness of each target gene by weighting each match based on the number of genes in the corresponding orthogroup. For each target gene set comparison we calculated a p-score by running the same analysis on an equal sized set of randomly selected genes for 10,000 iterations.
We downloaded the phenotype dataset for each relevant species from the Fitness Browser web site at http fit.genomic.lbl.gov. We only considered phenotype measurements that were scored as both significant and specific (“specific phenotypes”). From these datasets, we identified cases where the same conditional challenge yielded a specific phenotype assignment for both the TF itself and the TF target gene(s) as predicted by multiDAP.
Motifs were called using MEME.48 The input sequences used were those flanking the summit position +/−30 bases. For each TF, we only use the top 30 summits (scored by fold-change over background) in the dataset. Significant motifs (E-value <0.05) were manually inspected for quality to exclude motifs that were not found enriched near the center of strong peaks and those that had low total information content. Motifs were mapped against promoter sequences using FIMO49 with default options, and only motifs with scores >0 were considered.
We used Tomtom50 to compare the E. coli motifs from this study to the motifs published in RegulonDB.26 Motifs were considered to be in agreement if their comparison produced a score with p-value <0.01.
We used 113,676 annotated metagenomic datasets from the Integrated Microbial Genomes (IMG)26 database to extract homologs of E. coli TFs and their corresponding promoter sequences. First, for each E. coli TF, we found the corresponding orthologs in 48 selected bacterial species based on bidirectional best BLAST hits and tabulated each TF orthogroup with conserved Pfam domains found in them. We searched E. coli proteins against predicted genes in metagenomes using MMseqs2″ with E-value 1e-5 and selected all hits which have a start codon (starting with Met) and at least 100 bp upstream sequence from gene. Corresponding promoters were extracted as regions (−100 to +10) around the start codon. Selected orthologs were further filtered to keep only those which have the same Pfam domain(s) and the length within the range of protein lengths of the corresponding orthogroup. To remove the redundant sequences, for each TF, all of its metagenome homologs were clustered using UCLUST52 at the percent identity cutoff of 80%, and only one TF and its corresponding promoter were kept for each cluster. Motifs in promoter sequences were predicted using FIMO49 with default options.
It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims.
All publications, patents, and patent applications cited herein are hereby incorporated by reference with respect to the material for which they are expressly cited.
This application claims priority benefit of U.S. Provisional Application No. 63/191,553, filed May 21, 2021, which is incorporated by reference for all purposes.
This invention was made with government support under Contract No. DE-AC02-OSCH11231 awarded by the U.S. Department of Energy. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
63191553 | May 2021 | US |