Provided herein are method of analyzing genomic and metabolomic data from fungi to identify relationships between biosynthetic gene clusters and mass spectrometric features of metabolites.
Metabolites from fungi have historically been an invaluable source of therapeutics, including compounds such as penicillin, lovastatin, and cyclosporine. Advances in genome sequencing have revealed that a wealth of new compounds awaits discovery in fungal genomes. Despite the vast potential of fungi for therapeutic development, there is a lack of tools that combine advances in big data analytics, “-omics” biology, and artificial intelligence for large-scale discovery. Standard approaches rely on a “bioactivity-guided” approach that typically results in rediscovery of known compounds.
Provided herein are method of analyzing genomic and metabolomic data from fungi to identify relationships between biosynthetic gene clusters and mass spectrometric features of metabolites.
The present platform combines genomics, metabolomics, and machine learning for systematic discovery of new therapeutics from microbes (e.g., fungi). We have previously derisked the Metabologenomics process in actinobacteria (MicroMGx). Systems and methods herein find use in drug discovery, agrochemicals and agricultural biocontrol, fungal pathogen identification and characterization, etc. The present approach instead relies on genomics, metabolomics, and machine learning. Others have used synthetic biology approaches involving extensive manipulations of DNA that are expensive, not scalable, and are challenging to implement in unstudied fungal species. The present approach relies on native producers of natural products and requires no DNA manipulations.
The natural world has provided humanity with a plethora of molecules that have allowed major advances in modem medicine and agriculture. Fungi are one of most prolific providers of these chemicals—yet remain understudied compared to bacteria. With often over 50 natural product biosynthetic gene clusters (BGCs) per strain, fungi contain a potential wealth of new molecules ready to exploit in research. Provided herein is a scalable platform to identify fungal natural products through a fruitful union of bioinformatics, genomics and metabolomics. Provided herein is a “metabologenomics” platform, applied to strain collections of >1000 strains ofActinomycete bacteria, that involves prediction of BGCs from genome sequence data, clustering into gene cluster families (GCFs), collection of large-scale metabolomics data, and correlation of gene cluster families to metabolites. Additionally, in some embodiments the platforms herein utilize machine learning algorithms utilizing custom Hidden Markov Models and random forest classifiers to improve the precision of bioinformatic tools for BGC and GCF annotation, thereby creating a custom fungal-informatic ecosystem that is portable to any strain collection. Experiments were conducted during development of embodiments herein to demonstrate the feasibility of the pipeline herein through a study on nearly 100 sequenced and unsequenced fungal strains. Experiments establish the background library of fungal biosynthetic potential through the meta-analysis of 1,000 publicly available sequenced fungal genomes and then use this library to correlate metabolites to gene clusters for 75 sequenced fungal strains. In some embodiments, provided herein are tools for prioritization of fungal strains for sequencing and application of the pipeline to the metabolites produced by 12 unsequenced strains, sequencing the five most biosynthetically diverse.
The technology utilizes a large-scale correlative approach for connecting biosynthetic pathways encoded in fungal genomes with the metabolites that these pathways produce. The input to the platform is a fungal strain collection. These strains are subjected to broad metabolomics analysis by liquid chromatography-mass spectrometry and whole genome sequencing (if their genomic sequences are unavailable). The pipeline involves a series of informatics steps.
In some embodiments, provide herein are methods and systems utilizing biosynthetic networking and machine learning predictions to analyze fungal genomic sequences to identify BGCs, perform pairwise comparisons of structural and sequence characteristics of BGCs, group BGCs into GCFs, predict molecular substrates for enzymes produced by GCFs and/or BGCs, and/or link GCFs and/or BGCs with product metabolites and/or mass spectrometric features. In some embodiments, a series of bioinformatics algorithms organize predicted biosynthetic pathways into a graph structure based on their similarity. In some embodiments, a machine learning model is used to predict the substrates of enzymes within these pathways, allowing for prediction of metabolite structure.
In some embodiments, provided herein are systems and methods utilizing metabolomics networking and machine learning predictions to analyze mass spectra of fungal metabolite extracts, perform pairwise comparisons mass spectral features between mass specta, group mass spectrometric features into molecular families (MFs), group metabolites into MFs, etc. In some embodiments, the metabolomics approach uses algorithms for organizing mass spectrometry spectral data into a graph structure based on their similarity. These clustered spectra are input into a machine learning model that predicts metabolite structural features.
In some embodiments, provided herein are methods and systems for connecting biosynthetic pathways to metabolites. In some embodiments, a whole-library approach is used for correlating clusters of biosynthetic pathways with spectral nodes in a metabolomics network. In some embodiments, methods and systems herein identify causal relationships between biosynthetic pathways and metabolites, allowing for their targeted discovery for downstream commercial applications including small molecule discovery for both pharmaceutical (human, veterinary) and agrochemical purposes.
In some embodiments, provided herein are methods of combined genomic and metabolomic analysis comprising: (a) analyzing genomic sequences from multiple strains of fungi to generate a network of biosynthetic gene clusters (BGCs); (b) analyzing mass spectra of extracts from multiple strains of fungi to generate a network of metabolite features; and (c) comparing the network of BGCs and network of metabolites to link particular mass spectrometric features with the BGCs responsible for the synthesis of metabolites that correspond to the particular mass spectrometric features.
In some embodiments, the genomic sequences from multiple strains of fungi comprise 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) full or partial genomic sequences. In some embodiments, the genomic sequences from multiple strains of fungi comprise full or partial genomic sequences from 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) different strains and/or species of fungi. In some embodiments, the genomic sequences from multiple strains of fungi comprise full or partial genomic sequences from 10 or more (e.g., 10, 20, 50, 100, 150, 200, 500, or more, or ranges therebetween) different genera and/or families of fungi. In some embodiments, analyzing genomic sequences from multiple strains of fungi comprises identifying BGCs with the genomic sequences. In some embodiments, analyzing genomic sequences from multiple strains of fungi comprises grouping BGCs with the genomic sequences into gene cluster families (GCFs). In some embodiments, analyzing genomic sequences from multiple strains of fungi is based on pairwise comparisons of sequence and/or predicted structural features of the BGCs.
In some embodiments, the mass spectra of extracts from multiple strains of fungi comprise 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) mass spectra. In some embodiments, the mass spectra of extracts from multiple strains of fungi comprise mass spectra from 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) strains or species of fungi. In some embodiments, the mass spectra of extracts from multiple strains of fungi comprise mass spectra from 10 or more (e.g., 10, 20, 50, 100, 150, 200, 500, or more, or ranges therebetween) genera or families of fungi. In some embodiments, analyzing mass spectra of extracts from multiple strains of fungi comprises identifying mass spectrometric features with the mass spectra. In some embodiments, analyzing mass spectra of extracts from multiple strains of fungi comprises grouping mass spectrometric features with the mass spectra into molecular families (MFs). In some embodiments, analyzing mass spectra of extracts from multiple strains of fungi is based on pairwise comparisons of mass spectrometric features of the mass spectra.
In some embodiments, comparing the network of BGCs and network of metabolite features comprises comparing the pairwise distances of BGCs or GCFs within the BGC network with the pairwise distances of metabolite features or MFs within the metabolite feature network to identify correlations that indicate that a BGC or GCF is responsible for the synthesis of a metabolite feature or MF. In some embodiments, comparing the network of BGCs and network of metabolite features comprises comparing the frequency of BGCs or GCFs within the BGC network with the frequency of metabolite features or MFs within the metabolite feature network to identify correlations that indicate that a BGC or GCF is responsible for the synthesis of a metabolite feature or MF.
In some embodiments, provided herein are networks linking metabolite features from 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) mass spectra of extracts from multiple strains of fungi with BGCs from 100 or more (e.g., 100, 200, 500, 1000, 1500, 2000, 5000, or more, or ranges therebetween) genomic sequences from multiple strains of fungi, wherein linking of a mass spectrometric feature with a BGC indicates that the BGC is involved in the synthesis of a metabolite that produced the mass spectrometric feature.
In some embodiments, provided herein are methods of fungal genomic analysis comprising: (a) identifying biosynthetic gene clusters (BGCs) within genomic sequences from multiple strains of fungi; (b) identifying sequence characteristics and predicted structural domains within the BGCs; and (c) comparing the sequence characteristics and predicted structural domains between multiple pairs of BGCs to determine the degree of relatedness between the pairs of BGCs. In some embodiments, methods further comprise generating a network of BGCs based on the degree of relatedness between the pairs of BGCs. In some embodiments, methods further comprise generating grouping the BGCs into gene cluster families based on the degree of relatedness between the pairs of BGCs.
In some embodiments, provided herein are methods of fungal metabolomic analysis comprising: (a) identifying mass spectrometric features within mass spectra of extracts from multiple strains of fungi; (b) comparing characteristics of the mass spectrometric features between multiple pairs of mass spectrometric features to determine the degree of relatedness between the pairs of mass spectrometric features; and (c) generating a network of mass spectrometric features based on the degree of relatedness between the pairs of mass spectrometric features. In some embodiments, methods further comprise grouping the mass spectrometric features into molecular families based on the degree of relatedness between the pairs of mass spectrometric features.
As used herein the term “biosynthetic gene cluster” (“BGC”) refers to a set of several genes that direct the synthesis of a particular metabolite (e.g., a secondary metabolite). The genes are typically located on the same stretch of a genome, often within a few thousand bases of each other. Genes of a BGC may encode proteins which are similar or unrelated in structure and/or function. The encoded proteins are typically either (i) enzymes involved in the biosynthesis of metabolites or metabolite precursors and/or (ii) are involved inter alia in regulation or transport of metabolites or metabolite precursors. Together, the genes of the BGC encode proteins that serve the purpose of the biosynthesis of the metabolite. The term “putative biosynthetic gene cluster” (“pBGC”) refers to a segment of a genome that is suspected of being a BGC or is to be tested for being a BGC. A pBGC may be identified by computational genomic analysis, functional analysis of the genes in a stretch of a genome, other techniques, or combinations thereof.
As used herein, the term “gene cluster family” (“GCF”) refers to a set of two or more biosynthetic gene clusters from one or more genomic sequences (e.g., from the same or different strain, species, genus, etc.) that bear sufficiently similar sequence or structural features (e.g., predicted structural features) to indicate that that the BCGs with in the GCF are involved in or responsible for the synthesis of related metabolites.
As used herein, the term “metabolite” refers to a molecule that is an intermediate or an end product of a metabolic process.
As used herein, the term “primary metabolite” refers to a molecule that is directly involved in normal growth, development, and reproduction of an organism, and is present across the spectrum of cell and organism types. Common examples of primary metabolites include, but are not limited to ethanol, lactic acid, and certain amino acids.
As used herein, the term “secondary metabolite” refers to a molecule that is typically not directly involved in processes central to growth, development, and reproduction of an organism, and is present in a taxonomically restricted set of organisms or cells (e.g., plants, fungi, bacteria, or specific species or genera thereof). Examples of secondary metabolites include ergot alkaloids, antibiotics, naphthalenes, nucleosides, phenazines, quinolines, terpenoids, peptides, and growth factors.
As used herein, the term “small molecule” refers to organic or inorganic molecular species either synthesized or found in nature, generally having a molecular weight less than 10,000 grams per mole, optionally less than 5,000 grams per mole, and optionally less than 2,000 grams per mole.
As used herein, the term molecular family (“MF”) refers to a set of two or more mass spectrometric features from one or more mass spectra (e.g., from the same or different strain, species, genus, etc.), or a set of two or more metabolites from one or metabolite extracts (e.g., from the same or different strain, species, genus, etc.), that bear sufficiently similar mass spectrometric or structural features (e.g., predicted structural features) to indicate that that the mass spectrometric features and/or metabolites within the MF are related or produced by related metabolites.
As used herein, the term “network” refers to a group of nodes (e.g., BGCs, GCFs, MS features, MFs, metabolites, etc.) linked and/or arranged according to the degree of relatedness of the nodes.
Provided herein are method of analyzing genomic and metabolomic data from fungi to identify relationships between biosynthetic gene clusters and mass spectrometric features of metabolites. In some embodiments, provided herein are networks and methods of generating networks of genomic and/or metabolomic analyses.
In some embodiments, provide herein are systems and methods utilizing biosynthetic networking and machine learning predictions, for example, to generate networks of BGCs and GCFs. In some embodiments, fungal genomes are obtained either by whole genome sequencing or through a public database such as GenBank or the Joint Genome Institute's Genome Portal. In some embodiments, biosynthetic gene clusters are identified within these genomes using computational methods (e.g., antiSMASH, an open-source Python program). In some embodiments, a distance metric is applied to pairs of BGCs (e.g., all combination pairs of BGCs in the genome sequences) to construct a biosynthetic network of related gene clusters. In some embodiments, pairs of BGCs with more related sequence and/or predicted structural features (e.g., secondary structures, domains, etc.) receive a small distance score and are closer together within the network. In some embodiments, a distance metric is calculated between every BGC pair in a set of genomic sequences. In some embodiments, a distance metric is calculated based on one or more sub-metrics, such as:
In some embodiments, for each non-ribosomal peptide synthetase gene cluster node in the biosynthetic graph, a random forest classifier is used to predict its amino acid substrates. Experiments were conducted during development of embodiments herein to train this model was on 1200 adenylation domain sequences with known substrate specificities.
In some embodiments, provided herein are systems and methods utilizing metabolomics networking and machine learning predictions, for example, to generate networks of mass spectrometric features, predicted metabolites, and molecular families of metabolites and/or MS features. In some embodiments, metabolomics data is collected using liquid chromatography-mass spectrometry on a high-resolution instrument. Fragmentation spectra are extracted from mass spectrometry files. In some embodiments, for metabolomics network creation, consensus spectra are generated from spectra arising from identical metabolites. In some embodiments, spectra with similar precursor m z values (e.g., within 20 ppm, within 15 ppm, within 10 ppm, within 5 ppm, within 2 ppm, within 1 ppm) of each other and a cosine similarity of at least 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, etc. (e.g., at least 0.6 ppm) are summed to create a consensus spectrum with much higher signal:noise than the original spectra. In some embodiments, a distance matrix is calculated for all consensus spectra. In some embodiments, spectra are binned into fixed-dimension vectors and a cosine similarity matrix is calculated. In some embodiments, distances within this matrix that meet a threshold requirement are added as edges to a graph. In some embodiments, a pruning step trims each subgraph in the graph to a threshold subgraph size parameter. In some embodiments, provided herein are methods of producing a graphical representation of a network where each node represents a metabolite consensus spectrum, edges represent similarity between spectra, and subgraphs represent clusters of structurally and biosynthetically-related metabolites.
In some embodiments, following metabolomic network creation, a neural network model is used to predict substructural features from each node in the network. In experiments conducted during development of embodiments herein, a neural network was trained using ˜24,000 publicly-available reference spectra. Each spectrum is binned and encoded as a 2000-dimensional vector. Each reference spectrum has an associated chemical structure, which is encoded as a vector of substructures and chemical features determined using the tool ClassyFire. The neural network model, trained using these 24,000 spectra, is composed of a single hidden layer with 1024 nodes, ReLU activation functions for the hidden layer, and an output layer computing a sigmoid activation function for each chemical feature. This neural network model thus enables structural predictions for spectral nodes with the metabolomics network.
In some embodiments, provided herein are methods and systems for connecting biosynthetic pathways to metabolites. In some embodiments, correlative statistics are employed for connecting biosynthetic pathways with metabolites. In some embodiments, a correlation matrix is constructed using statistical analysis, for example, a chi-squared test comparing pairwise frequencies of gene cluster family subgraphs from the biosynthetic network with spectral nodes from the metabolomics network. In some embodiments, a Bonferroni correction is used to account for multiple hypothesis testing. In some embodiments, methods provided herein result in a score (e.g., −log10[pvalue]) for each metabolite node-gene cluster family pair, with high scores indicating strong associations. In some embodiments, biosynthetic and metabolomic machine learning predictions are used to identify causal metabolite-gene cluster family pairs.
In some embodiments, a network (e.g., web portal) is utilized to share and/or analyze data produced by the methods herein among researchers (e.g., non-local researchers; at distant locations, etc.).
Prior work has utilized bioactivity-guided fractionation for natural products discovery, rather than a metabolomics, genomics, and machine learning approach. Researchers have focused on synthetic biology and heterologous expression, in contrast to an approach which does not require DNA manipulations. Tools have been developed for clustering metabolomics spectra and performing metabolite machine learning predictions. These tools use different machine learning models and are not integrated into larger genomics workflows. Tools have been developed for predicting adenylation domain substrates and for creating biosynthetic networks from gene clusters; however, these tools are ineffective for fungal genomes. An integrated genomics-metabolomics platform has been developed for natural products discovery; however, this platform is not applicable to fungal genomes.
Systems and method for untargeted metabolomic screening are described, for example in U.S. Pat. No. 10,808,256, which is herein incorporated by reference in its entirety.
Fungal natural products (secondary metabolites) are an invaluable source for pharmaceuticals that act against myriad conditions, including infectious diseases, cancer, and hyperlipidemia (Refs A1-A4; incorporated by reference in their entireties). Indeed, the antibiotics penicillin and cephalosporin, the cholesterol-lowering lovastatin, and the immunosuppressant cyclosporine are derived from fungi (Refs. A5, A6; incorporated by reference in their entireties), and the reservoir of novel scaffolds continues to grow each year (Ref. 7; incorporated by reference in its entirety). Although numerous fungi-derived drugs exist on the market today, genome sequencing has revealed that fungi possess the biosynthetic capacity to produce a far greater number of secondary metabolites than currently accessed (Ref. 8; incorporated by reference in its entirety). Recent studies spanning nearly 600 fungal genomes suggest that a mere 3% of molecules encoded by fungal biosynthetic gene clusters (BGCs) have been explored (Ref. 8; incorporated by reference in its entirety).
Provided herein are methods comprising a discovery pipeline ntly developed to systematically annotate the biosynthetic abilities of fungi using comparative metabolomics and heterologous gene expression (Refs. A9-A12; incorporated by reference in their entireties). With this platform, fungal genomic DNA fragments containing intact BGCs are inserted into fungal artificial chromosomes (FACs) and transformed into a fungal host to discover new chemical scaffolds (Refs. A10-A12; incorporated by reference in their entireties). The pipeline uses a metabolite scoring (MS) system to identify heterologously-expressed metabolites from the thousands of signals originating from the host. By enabling facile linkage between secondary metabolites and their corresponding BGCs, the FAC-MS pipeline facilitates prioritization of target compounds most likely to contain novel scaffolds. Using structural clues provided by BGC data, compounds originating from BGCs containing unusual biosynthetic machinery are targeted (
Aromatic amino acids are fundamental for growth and development across phylogenetic kingdoms. Additionally, catabolism of aromatic amino acids leads to the production of non-proteinogenic amino acids, such as the tryptophan-derived kynurenine, which regulates inflammation and immune responses (Refs. A13, A14; incorporated by reference in their entireties). Kynurenine and its derivatives are biosynthetic intermediates of numerous secondary metabolites, including sibiromycin (Ref. A15; incorporated by reference in its entirety), mycemycin C (Ref. A16; incorporated by reference in its entirety), nidulanin A (Ref. A17; incorporated by reference in its entirety), nidulanin B and nidulanin D (Ref A18; incorporated by reference in its entirety), daptomycin (Ref. A19; incorporated by reference in its entirety), and quinomycin peptide antibiotics (Ref. A20; incorporated by reference in its entirety). Incorporation of kynurenine into secondary metabolites enables differential specificity towards enzyme receptors and targets (Ref. A21; incorporated by reference in its entirety). Daptomycin, for example, shows decreased antimicrobial efficacy when kynurenine is mutated to tryptophan (Refs. A22-A23; incorporated by reference in their entireties). One tactic for creating secondary metabolites with novel scaffolds is to recruit primary metabolic enzymes that modify common precursors into non-proteinogenic precursors into BGCs (Ref. A20; incorporated by reference in its entirety). For example, a tryptophan 2,3-dioxygenase (TDO) located adjacent to the daptomycin-producing non-ribosomal peptide synthase (NRPS) supplies the kynurenine for daptomycin synthesis. This TDO diverges from related proteins in the same genus (29% sequence identity), suggesting it is a paralogous enzyme dedicated to secondary metabolite biosynthesis (Ref. A19; incorporated by reference in its entirety).
In a large-scale analysis of 56 FACs, an unknown metabolite from heterologous expression of a BGC from Aspergillus terreus ATCC 20542 (located on the FAC AtFAC7O19,
indicates data missing or illegible when filed
To determine the structure of the target compound, ˜1.5 mg of material was purified from FAC-transformed A. nidulans extracts and subjected to MS2 analysis, 1H and 13C NMR spectroscopy, and two-dimensional correlation approaches including COSY, HSQC, and HMBC (Table 2 and
13C
1H
To probe terreazepine's biosynthesis, A. terreus (ATCC 20542) was grown using media containing isotopically labeled biosynthetic precursors. Labeling with 13C6-anthranilate resulted in a m z shift of +6 Da (
Homology-based annotation of the FAC-encoded NRPS revealed a domain structure consisting of two adenylation (A), two condensation (C), and three thiolation (T) domains, giving the domain sequence A1-T1-C1-A2-T2-C2-T3. To investigate the function of the seemingly extraneous T3 domain, FAC truncation mutants were constructed either lacking the C2T3 domains (ΔC2T3) or only the T3 domain (ΔT3). These constructs were transformed into A. nidulans and extracted metabolites subjected to LC-MS analysis. A very small amount of the target compound was detected in ΔC2T3 extracts (5000-fold lower than control), indicating that terreazepine formation occurs slowly without catalysis. The presence of any offloaded intermediates was not detected. ΔT3 extracts contained terreazepine levels close to that of the intact NRPS (
Using heterologous expression, stable isotope feeding studies, and NRPS-backbone deletions, a biosynthetic scheme for terreazepine was determined (
TzpA, a two-module NRPS, utilizes anthranilate and kynurenine to assemble terreazepine. The first adenylation domain (TzpA-A1) loads anthranilate onto the T1 domain, while TzpA-A2 loads kynurenine, generated through spontaneous non-enzymatic deformylation of the TzpB-supplied N-formyl-kynurenine. The substrate-binding residues of TzpA-A1 resemble those of other fungal adenylation domains which recognize anthranilate (Table 3). TzpA-A2, responsible for incorporating kynurenine, has a new pocket code quite dissimilar from other kynurenine-binding A-domains (Table 3). However, this disparity may be attributable to evolutionarily distance between source organisms and the unstudied nature of kynurenine incorporation into fungal secondary metabolites. Given that the isolated terreazepine was a 2:1 mixture of S:R enantiomers, TzpA-A2 may accept both (D) and (L) forms of kynurenine. The peptide bond formation between the tethered amino acids is catalyzed by the first condensation domain, TzpA-C1, between anthranilate's carbonyl carbon and kynurenine's aliphatic primary amine. The second C domain (TzpA-C2) catalyzes the final cyclization event between the aromatic amine of kynurenine and the tethered carbonyl carbon, yielding the final terreazepine product.
While the role of the terminal TzpA-T3 domain remains uncertain, insights are available by looking at related NRPSs. For example, the unusual NRPS domain structure of TzpA mirrors that of GliP, the NRPS involved in gliotoxin biosynthesis (Refs. A29-A30; incorporated by reference in their entireties). When studied in vitro, GliP mutants show behavior mirroring that of TzpA deletants: truncated GliP ΔT3 mutants retain dipeptide synthetase activity, while ΔC2T3 mutants show reduced activity (Refs. A29-A30: incorporated by reference in their entireties). However, in vivo, GliP ΔT3 loses activity, indicating that the in vivo pathway involves transfer of the dipeptidyl-S intermediate from T2 to T3 (Ref. 29; incorporated by reference in its entirety). In light of these two possible pathways of cyclization from T2 and T3, as well as a slow reported rate of approximately one per hour, it has been suggested that T3 facilitates interaction with downstream tailoring enzymes (Refs. A29-A30; incorporated by reference in their entireties). Given the lack of downstream tailoring enzymes in the terreazepine pathway, both cyclization pathways may exist. Like the T domains of GliP, TzpA-T2 and T3 possess the predicted active site residue (S1937 and S2473, respectively), indicating that they are both functional (Table 3). Similarly, TzpA-C2 possesses the purported catalytic histidine at position H2137. However, the adjacent residue sequence diverges from the conserved SHXXXDXXS/T (SEQ ID NO: 23) sequence shared by diketopiperazine-forming NRPSs such as GliP and HasD (29), and slightly from the SHXXXD (SEQ ID NO: 24) sequence of NanA (Ref. A26; incorporated by reference in its entirety), indicating it may have different cyclization requirements (Table 3).
The discovery of terreazepine and its BGC revealed that fungal IDOs can play a role in secondary metabolite biosynthesis and that kynurenine incorporation into secondary metabolites can yield novel chemical scaffolds. This indicates that targeted efforts to characterize fungal BGCs containing IDOs may facilitate the discovery of completely new molecules with unique chemical scaffolds and their derivatives. Experiments were conducted during development of embodiments herein to search sequences of 1037 fungal genomes from GenBank and the Joint Genome Institute and located BGCs containing IDOs. Of the ˜38,000 BGCs contained within these genomes, 118 contain an IDO. IDO-containing BGCs were grouped into gene cluster families (GCFs) based on sequence identity and the fraction of protein domains shared between BGC pairs, anticipating that a single GCF groups BGCs that produce similar metabolites. Of the 118 IDO-containing BGCs, 68 were sorted into 16 GCFs. The remaining 50 BGCs represent singletons that had no similar BGC pairs (
Many BGCs originate from phylogenetically diverse Aspergilli, an NRPS-containing subset of which are illustrated in
The discovery of terreazepine provides another example of how fungi repurpose primary metabolism genes for secondary metabolism. Based on this and other examples, two major strategies fungi employ for such repurposing are proposed: Type I repurposing into biosynthetic enzymes and Type II repurposing into resistance genes (
In addition to re-purposing duplicated primary metabolism genes to have a biosynthetic role, fungi also utilize duplicated genes from primary metabolism as a form of self-resistance (Refs. A34, A35; incorporated by reference in their entireties). This Type II repurposing represents a particularly attractive avenue for drug discovery, as the duplicated gene will often provide insight into the mechanism of action of the encoded secondary metabolite. Several examples of such Type II repurposing have been discovered by targeting clusters with duplicate resistance targets. The proteasome inhibitor fellutamide B, for example, was discovered due to the presence of a duplicated proteasome subunit within its BGC (36). Similarly, the BGC encoding the methionine aminopeptidase inhibitor fumagillin contains both type I and type II methionine aminopeptidase genes in the gene cluster (
The concept of a gene cluster family (GCF) has emerged as an approach for large-scale analysis of BGCs (Ref. B5-B8; incorporated by reference in their entireties). The GCF approach involves comparing BGCs using a series of pairwise distance metrics, then creating families of BGCs by setting an appropriate similarity threshold. This results in a network structure that dramatically reduces the complexity of BGC datasets and enables automated annotation based on experimentally characterized reference BGCs. Depending on the similarity threshold, BGCs within a family are expected to encode identical or similar metabolites and therefore serve as an indicator of new chemical scaffolds. The use of GCFs represents a logical shift from a focus on single genomes of interest to large genomics datasets, providing a means of regularizing collections of BGCs and their encoded chemical space (Fig. B1A). The use of GCF networks has been utilized for global analyses of bacterial biosynthetic space (Ref. B6; incorporated by reference in its entirety), bacterial genome mining at the >10,000 genome scale (Refs. B9, B16; incorporated by reference in their entireties), and integrated with metabolomics datasets for large-scale compound and BGC discovery (Refs. B5, B7; incorporated by reference in their entireties). Together with advances to large-scale metabolomics data analysis such as molecular networking (Ref. B17; incorporated by reference in its entirety), the GCF paradigm has helped in the modernization of natural products discovery.
Application of GCFs to fungal genomes has been limited to datasets of <100 genomes from well-studied genera such as Aspergillus, Fusarium, and Penicillium (Refs. B13-B15). Despite the availability of thousands of genomes representing a broad sampling of the fungal kingdom, global analyses of the BGC content of these genomes are lacking. As such, knowledge of the overall phylogenetic distribution of GCFs in fungi is limited, and many taxonomic groups have no experimentally characterized BGCs. Experiments were conducted during development of embodiments herein to perform a global analysis of BGCs and their families from a dataset of 1037 genomes from across the fungal kingdom. Across Fungi, the vast majority of GCFs are species-specific, indicating that species-level sampling for genome sequencing and metabolomics will yield significant returns for natural products discovery.
To relate this now-available set of fungal GCF-encoded metabolites to known fungal scaffolds, network analysis of 15,213 fungal compounds was conducted during development of embodiments herein, organizing these into 2,945 molecular families (MFs) (Fig. B1A). Analysis of this joint genomic-chemical space revealed dramatic differences between both major fungal taxonomic groups, as well as between bacteria versus fungi, thus laying the groundwork for systematic discovery of new compounds and their BGCs from the fungal kingdom.
Despite the availability of thousands of fungal genomes, the biosynthetic space represented within them has not been surveyed systematically, prior to the work described herein. To address this gap, a dataset of 1037 fungal genomes was curated, covering a broad phylogenetic swath (Table 4). This selection includes well-studied taxonomic groups such as Eurotiomycetes (Aspergillus and Penicillium genera) and Sordariomycetes (Fusarium, Cordyceps, and Beauveria genera), and groups for which little is known regarding their BGCs, such as Basidiomycota or Mucoromycota. This genomic sampling covers a large swath of ecological niches, from forest-dwelling mushrooms to plant endophytes to extremophiles (Ref. B18; incorporated by reference in its entirety).
Each of the 1037 genomes was analyzed using antiSMASH (Ref 19; incorporated by reference in its entirety), yielding an output of 36,399 BGCs ranging from 5 to 220 kb in length. As has been previously observed (Ref 20; incorporated by reference in its entirety), the number of BGCs per genome varies dramatically across Fungi (
Organizing Gene Clusters into Families to Map Fungal Biosynthetic Potential
To further assess the ability of fungi to produce new chemical scaffolds, BGCs were grouped into families using the pairwise distance between BGCs and a clustering algorithm to yield GCFs. BGCs from antiSMASH were converted to arrays of protein domains then compared based on the fraction of shared domains and backbone protein domain sequence identity (Refs. B7, B8; incorporated by reference in their entireties). DBSCAN clustering was performed on the resulting distance matrix, resulting in a set of 12,067 GCFs (Fig. B2A) organized into a network (Fig. B3A). Across the fungal kingdom, the distribution of GCFs shows a clear relationship with phylogeny (see yellow streaks in Fig. B2A, Figs. BS1-BS5). In isolated studies of well-characterized strain sets of Aspergillus and Penicillium, GCFs have been thought to be largely genus- or species-specific (Refs. B13, B21, B22); however, here we show that several GCFs span entire subphyla or classes (Fig. B2A). The fraction of GCFs that two organisms share is likewise correlated with phylogenetic distance, evidenced by sets of shared GCFs between closely related taxonomic groups (Fig. BS6; IBG). In order to facilitate visualization of these phylogenetic patterns, a web-based application was developed for hierarchical browsing of GCFs, BGCs, protein domains and annotations for known compound/BGC pairs (http://prospect-fungi.com). Additional details of the site are available in SI Methods.
Experiments were conducted during development of embodiments herein to quantify the relationship between phylogeny and shared GCF content. The protein sequence identity of 290 shared single-copy orthologous genes from the fungal BUSCO dataset (Ref. B23; incorporated by reference in its entirety) was used as a proxy for whole-genome distance. The fraction of GCFs shared within each genome was counted in pairwise comparisons (Fig. B2B). A result was a clear relationship between genomic distance and shared GCF content, with an average of 75% shared GCFs at the species level, but less than 5% shared GCFs at taxonomic ranks higher than family (
Identifying BGCs that have known metabolite products is an important component of genome mining, enabling researchers to prioritize known versus unknown biosynthetic pathways for discovery. These “genomic dereplication” efforts have been bolstered by the development of the MIBiG repository (Ref. B24; incorporated by reference in its entirety), which contained 213 fungal BGCs with known metabolites, as of June 2019. When anchored with known BGCs, the GCF approach enables large-scale annotation of unstudied BGCs based on similarity to reference BGCs, identifying clusters likely to produce known metabolites or derivatives of knowns.
Within the dataset, 154 GCFs contained known BGCs from MIBiG, approximately 1% of the 12,067 total GCFs reported here (Fig. BS9). These families collectively include a total of 2,026 BGCs (Fig. BS9), an approximately 10-fold increase in the number of annotated BGCs over that available in MIBiG (Ref. B24; incorporated by reference in its entirety). This expanded set of annotated BGCs and their families was made available for routine genome mining via the web.
To assess the relationship between GCFs and their chemical repertoire, GCF-encoded scaffolds were compared to a dataset of known fungal scaffolds. Analogous to the GCF analysis, network analysis of fungal metabolites was utilized, organizing these compounds into molecular families (MFs) based on Tanimoto similarity, a commonly used metric for determining chemical relatedness (Refs. B25, B26; incorporated by reference in their entireties). To directly relate GCF and MF-encoded metabolite scaffolds, the relationship between chemical similarity and BGC similarity was determined for a set of 154 fungal GCFs with known metabolite products (Fig. BS10). An MF similarity threshold was selected that resulted in similar levels of chemical similarity represented by GCF and MF metabolite scaffolds.
Using this compound network analysis strategy, a dataset of 15,213 fungal metabolites from the Natural Products Atlas (Ref. B27; incorporated by reference in its entirety) was organized into 2,945 MFs (Fig. B3A). Each compound was annotated within this network with chemical ontology information using ClassyFire, a tool for classifying compounds into a hierarchy of terms associated with structural groups, chemical moieties, and functional groups (Table 5) (Ref. B28; incorporated by reference in its entirety). The number of MF scaffolds (2,945) is only 25% the number of GCF-encoded scaffolds (12,067) in the 1000-genome dataset. This indicates that even this small genomic sampling of the entire fungal kingdom, estimated to have >1 million species (Ref. B29; incorporated by reference in its entirety), possesses biosynthetic potential that significantly dwarfs know fungal chemical space—not only in terms of individual metabolites, but also in terms of metabolite scaffolds. In this joint GCF-MF dataset, molecular families and gene cluster families represent complementary approaches for representing the same metabolite scaffold, such as the tenellin/desmethylbassianin structural class, whose GCF and MF contains both BGCs and compounds, respectively (Fig. B3A, middle).
acid 21.18-actone
Lipids and lipid-like molecules. Steroids and steroid derivatieves. Steroid lactones. Steroid esters, 7- , 3-cis-6-alph-steroids,
acids and derivatives,
Oxacyclic compounds, Organic oxides, Hydrocarbon derivatives
acid amide
Organic oxygen compounds. Organoxygen compound, Alcohols and polyoids, Tertiary alcohols, Carboximodic acids, Organonitrogen compounds, Organonitrogen compounds, Hydrocarbon derivatives
A
Organic oxygen compounds. Organooxygens compounds, Carbonyl compounds, Ketones, Aryl ketones, Phenylketones, Alkyl-phenyketones, Benroyl derivatives, Aryl alkyl ketones, Pyrrolidine-2-ones, Vinylogous esters, Secondary carboxylic acid amides, Secondary alcohols, Lactams, Oxacyclic compounds, Dialkyl ethers, Azacyclic compounds, Organonitrogen compounds, Organic oxides, Hydrocarbon derivatives
Organoheterocyclic compound. Indoles and derivatives. Alpha amino acids and derivatives, Indoles, 2,5-
, Aryl alkyl ketones, Anisoles, N-alkylpiperazines, Alkyl aryl ethers, Vinylogous amides, Tertiary carboxylic acid amides, Pyrroldines,
, Heteroaromatic compounds, Lactams, Dialkyl peroxides, Oxacyclic compound, Azacyclic compounds,
Hydrocarbon derivatives
Lipids and lipid-like molecules. Fattty Acylis, Fatty acids and Medium-chain fatty acids, Fatty acids esters, Epoxy fatty acids,
fatty acid, Unsaturated fatty acids, Dicarboxylic acids and derivatives,
esters, Oxacyclic compounds, Epoxides, Dialkyl esters, Carboxylic acids, Organic nodes, Hydrocarbon derivatives, Carbonyl compounds.
Benzenoids. , Aryl ketones, 1-hydroxyl-4-unsubstituted benzenoids, 1-hydroxy-2-unsubstituted benzenoids, Aryl chlorides, Vinylogous acids,
, Organic oxides, Hydrocarbon derivatives
Organic oxygen compound.
compounds, Carbonyl compounds, Ketones, Cyclic ketones, Quinones, Benzeoquinones, P-benzoquinones, Vinylogous acids,
, Organic oxides, Hydrocarbon derivatives
Benzenoids. and derivatives,
Alpha-acyloxy ketones, Dicarboxlic acids and derivatives, Carboxylic acid esters, Organic oxides, Hydrocarbon derivatives
Organoheterocyclic compounds. Indoles and derivatives, Pyrrolasdicles, Quinazolnes, Indoles, Pysimidones, Benzenoids, Heteraromatic compounds, Pyrrolidine, Pyrroles, Lactams, Azacyclic compounds. Carbonyl compounds, Hydrocarbon derivatives, Organic oxides, Organonitrogen compounds, Organopnitrogen compounds
Organic acids and derivatives. Carboxylic acids and derivatives. Amino acids, peptides, and analogues. Amino acids and derivatives, Alpha amino acids and derivatives, 3-alkylindoles, Hydroxyindoles, 2,5-diaxopiperazines, -hydroxy-2-unsubstituted benzenoids, N-akylpiperazines. Substituted pyrroles, Tertiary carboxylic acid amides, Pyrrolidines, Heteroaromatic compounds. Secondary carboxylic acid amides, Lactams. Azacyclic compounds. Carbonyl compounds. Hydrocarbon derivatives, Organic oxides, Organonitrogen compounds, Organopnitrogen compounds
Organoheterocyclic compounds. Iodotes and derivatives, Alpha amino acids and derivatives, 3-alkylindoles, Anisoles, 2,5-dioxolperazines, N-alkylpiperazines, Alkyl aryl ethers, Substutites pyrroloes, Tertiary carboxylic acid amides, Tertiary alcohols, Pyrrolidines, Hetercaromatic compounds, Secondary alcohols, Lactams, Azacyclic compounds,
compounds, Organic oxides, Hydrocarbon derivatives, Carbonyl compounds
Organic oxygen compounds. Organooxygen compounds, Carbonyl compounds, Ketones, Aryl ketones, , Aryl alkyl ketones, Pyrrolidine-2-ones, Furanones, Vinylogous ester, Secondary carboxylic acid amides, Secondary alcohols, Lactams, Oxacyclic compounds, Azacyclic compounds, Organonitrogen compounds, Organic oxides, Hydrocarbon derivatives
Organoheterocyclic compounds. Isobenzenfurans, Medium-chain fatty acides, Branched fatty acids, Hydroxy fatty acids, Hetercyclic fatty acids, Fatty acid esters, Unsaturated fatty acids Dicarboxylic acids and derivatives, Tetrahydrofurans. Tertiary alcohols, , Secondary alcohols, Cyclic alcohols and derivatives, Oxacyclic compounds, Carboxylic acids, Dialkyl ethers, Organic oxides, Carbonyl compounds, Hydrocarbon derivates
Lignans.
Methoxybenzenos Anisoles, Alkyl aryl ethers, Sulfuric acid monoesters,
, compounds, organic oxides, Hydrocarbon derivatives
Lipids and lipid-like molecules. Fatty Acylis, Fatty acids and conjugates, Long-chain fatty acids, L-alpha-amino acids, Hydroxy fatty acids, Beta hydroxy acids and derivatives, Amino fatty acids, Unsaturated fatty acids, Dicarboxylic acids and derivatives, Secondary alcohols, Carboxylic acid esters, Amino acids, Polyols, Carboxylic acids, compounds, Organic oxides, Monoalkylamines, Hydrocarbon derivatives, Carbonyl compound
Alkaloids and derivatives. Ergoline and derivatives, Clavinas and derivatives, Indoloquinolines, Benzoquinolines, Pyrroloquinolines, 3-alkylindoles and derivatives, Acalkylamines, Substituted ,
Amino acids and derivatives, Carboxylic acid estors, Monocarboxylic acids and derivatives, Azacyclc compounds, Carbonyl compounds, Hydrocarbon derivatives, Onganic oxides, Organopnictogen compounds
Lipids and lipid-like molecules. Prenol lipids, Sesquiterpenoids, Abscisic acids and derivatives, Terpene lactones, Tetracarboxylic acids and derivatives, Ketsis, Carboxylic acid orthoesters, Gamma butyrolactones,
, Enoate esters, Oxocyclic compounds, Organic oxides, Hyrdrocarbon derivatives, Carbonyl compounds
Organic oxygen compounds. Organooxygen compounds, Carbohydrates and carbohydrate corjugat Glycoxyl compounds, Glycosylamines, Hexoses, Quinazetines, Alpha amino acids and derivatives, Indoles and derivatives, , Tertiary carboxylic acid amides, Tertiary alcohols, Heteroarmatic compounds, Cyclic carboximidic acid Secondary alcohols, Lactams, Heriaminals, Propargylatype 1,3-dipolar Polyols, Oxacyclic compounds, Azacyclic compounds, Primary alcohols, Organopnictogen compounds, Organic oxides, Hydrocarbon derivatives, Carbonyl compounds
Organoheterocyclic compounds. Diazanaphtheteres.Benzodiazines, Quinazolines, Alpha amino acids and derivatives, Indoles and derivatives, Pyrimidones, Imidazolidinones. Benzenoids, Tertiary carboxylic acid amides, Tertiary alcohols, Heteroaromatic compounds. Lactams, Secondary carboxylic acid amides, Dialkylamines, Azacyclic compounds, Organopnictogen compounds, Organic oxides, Hydrocarbon derivatives, Carbonyl compounds
Lipids and lipid-like molecules. Prenol lipids, Quinone and hydroquinone lipids, Prenylquinones, Ubiquinones, P-benzoquinones, Vinylogous esters, Vinylogous acids, Carboxylic acid esters, Monocarboxylic acids and derivatives, Organic oxides, Hydrocarbon derivatives
Organoheterocyclic compounds. Tetahydrisoquinolines, Alpha amino acids and derivatives. Pipendinoes, Delta lactams, Azalkylamines, Aminopipetidines, 1-hydroxy-4-unsubstitited benzenoid 1-hydroxy-2-unsubstituted , Tertiary carboxylic acid amides. Secondary alcohols, Polyols, Azacyclic compounds. Organopnictrogen compounds, Organic oxides. Monoalkylamines, Hydrocarbon derivatives, Carbonyl compounds
Organoheterocyclic compounds. Indoles and derivatives, . Alpha amino acids and derivatives, 2,5-dioxopiperazines, Anisoles, Alkyl aryl ethers, N-alkylpiperazines. Heteroaromatic compounds, Pyrroles, Tertiary carboxylic acid amides, Pyrrodidines, Lactams, Oalkyl peroxides, Olalkyl ethers, Azacyclic compounds, Oxacyclic compounds, Akanolamines, Hydrocarbon derivatives, Carbonyl compounds, Organopnictogen compounds
Organoheterocyclic compounds. Naphthopyrans, Naphthalenes, Alkyl aryl ethers, Pyranores and Pyridines and derivatives, Vinylogous esters, Hetoroaromatic compounds, Lactones, Carboxylic acid
Oxacyclic compounds, Monocarboxylic acids and derivatives, Azaoyclic compounds, Organic
, Hydrocarbon derivatives, Carbonyl compounds, Organonitrogen compounds, Organopnictogen
Organic acids and derivatives. Carboxylic acids and derivatives, Amino acids peptides, and analogues, Amino acids and derivatives, Alpha amino acids and derivatives, Thiodioxopiperzines, Indoles and derivatives N-methypiperazines, Tertiary carboxylic acid amides, Pyrrolidines, Secondary alcohols, Lactams, Axacyclic compounds, Primary alcohols, Organonitrogen compounds, Organic aides, Hydrocarbon derivatives, Carbonyl compounds
Organoheteracyclic compounds. Indoles and derivatives, Indoles, 3-alkylindoles, Styrenes, Methoxypyrazines, Alkyl aryl ethers, pyrroles,
compounds, Lactams, Organic transion metal salts, Azaryclic compounds, Organopnictogen compounds, Organonitrogen compounds, Organic oxides, Hydrocarbon derivatives
Organoheterocyclic compounds. Indoles and derivatives, Pyridoinodoles, Pyridoindolones, Alpha Quinazolines, Alpha amino acids and derivatives, Indoles,
, Piperidiones, Delta lactams, Pyridines and derivatives,
Tertiary alcohols, Hetercaromatic compounds, Azacyclic compounds, Organognictogen compounds, Organonitagen compounds, Organic oxides, Hydrocarbon derivatives, Carbonyl compounds
Organoheterocyclic compounds. Benzopyrans, 1-henzopyrans, Dibenzonynans, Xanthenes, Tricarboxylic acids and derivatives , Hydroxy acids and derivatives, Alkyl aryl ethers, 1-hydroxy-4-unsubstituted benzenoids, 1-hydroxy-2-unsubstituted benzenoids, Vinylognus acids, Methyl esters, Secondary alcohols, Ketones, Cyclic alcohols and derivatives, Polyols, Oxacyclic compounds, Enols, Organic oxides, Hydrocarbon derivatives
Benzenoids.
Aryl ketones, 1-hydroxy-2-unsubstituted benzenoids, 1-hydroxy-4-unsubstituted benzenoids, Alkyl aryl ethers,
, Polyols, Oxacyclic compounds, Dialkyl ethers, Organic oxides, Hydrocarbon derivatives, Primary alcohols
Lipid and lipid-like molecules. Steroids and steroid derivatives, Hydrocycsteroids, 1-hydroxysteroids, Naphthopyrans, Naphthalenes, acids and derivatives, Akyl aryl ethers, Pyranones and derivatives, Pyridines and derivatives, Vinylogous esters,
Oxacyclic compounds, Azacyclic compounds, Organic oxides, Hydrocarbon derivatives, Carbonyl compounds, Organonitrogen compounds, Organonictogon compounds
indicates data missing or illegible when filed
Diversification of the Equisetin Scaffold Inferred from Gene Cluster Families
To further explore the link between metabolite scaffolds as represented by molecular and gene cluster families, the decalin-tetramic acids were examined, a structural class well represented in our BGC and metabolite datasets. This structural class, including compounds such as equisetin, altersetin, phomasetin, and trichosetin (Fig. BS11) (Refs. B31-B33; incorporated by reference in their entireties), has a wide range of reported biological activities, including antibiotic, anti-cancer, phytotoxic, and HIV integrase inhibitory activity (Ref. B34; incorporated by reference in its entirety). It was reasoned that further exploration of the decalin-tetramic acid structural class would yield insights into the biosynthetic mechanisms for variation of this bioactive scaffold by BGCs within the GCF.
Two closely related GCFs were identified (HYBRIDS_11/HYBRIDS_610) containing known BGCs responsible for biosynthesis of equisetin (Ref. B35; incorporated by reference in its entirety), trichosetin (Ref. B36; incorporated by reference in its entirety), and phomasetin (Ref. B37; incorporated by reference in its entirety) as well as BGCs from Alternaria likely responsible for the biosynthesis of altersetin found in multiple Alternaria species (Refs. B32, B38; incorporated by reference in their entireties). While most fungal GCFs are confined to single species or genera (Fig. B2), the equisetin GCF has an exceptionally broad phylogenetic distribution, with clusters found in the four Pezizomycotina classes Eurotiomycetes, Dothideomycetes, Xylonomycetes, and Sordariomycetes (Fig. B3B, left). The associated equisetin MF is likewise found in a variety of Dothideomycetes and Sordariomycetes (Fig. B3B, right).
The equisetin biosynthetic pathway involves three major steps: assembly of a decalin core via the action of polyketide synthase (PKS) enzyme domains and a Diels Alderase, formation of an amino acid-derived tetramic acid moiety catalyzed by NRPS domains, and N-methylation of the tetramic acid moiety (Fig. BS12) (Refs. B37, B39; incorporated by reference in their entireties). While the domain structure of the PKS contained in the equisetin GCF remains consistent across fungi, differences in backbone enzyme amino acid sequence and the presence/absence of tailoring enzymes mediate structural variations to the scaffold. The PKS enzymes from Fusarium oxysporum and Pyrenochaetopsis sp. RK10-F058 share 50% sequence identity, which likely result in the additional ketide unit and C-methylation observed in equisetin vs. phomasetin (Fig. B3B). In the NRPS module of the hybrid NRPS-PKS, changes to adenylation domain substrate binding residues are predicted to mediate incorporation of serine (trichosetin, equisetin, and phomasetin) and threonine (altersetin). The Aspergillus desertorum BGC contains adenylation domain substrate binding residues that are highly variant from those found in other clusters within the GCF, indicating its tetramic acid moiety is likely diversified with a different amino acid. The equisetin GCF contains additional variations in the number of enoyl reductase enzymes (one additional in the uncharacterized Penicillium expansum clade), indicating possible differences to degree of saturation, and a methyltransferase that is expected to mediate changes in tetramic acid N-methylation.
This pattern of biosynthetic variation within a GCF resulting in metabolite diversification indicates that exploring such pairs of GCFs and MFs with knowledge of their taxonomic distribution will be valuable to guide genome mining in the identification of new analogs of compounds with proven therapeutic or agrochemical value. The equisetin GCF is one of only 90 GCFs (representing 0.75% of total GCFs) within our dataset that spanned multiple taxonomic classes (Table 6). This includes bioactive scaffolds such PR-toxin, swainsonine, chaetoglobosin, and cytochalasin (Fig. BS13) which contain variations in tailoring enzyme composition expected to diversify these scaffolds. Given the observed biosynthetic diversity within such “multi-class” GCFs, exploring such pairs of GCFs and MFs represents an attractive approach for discovering new analogs of bioactive metabolites.
Aspergillus nidulans
PKS (KS-AT-DH-ER-KR-T),
Fusarium fujikuroi
Cadophora sp. DSE1049
Aspergillus lentulus
Aspergillus arachidicola
Aspergillus sydowii CBS
Aspergillus clavatus
Alternaria alternata
Aspergillus fumigatus
Fusarium verticillioides
Aspergillus fumigatus
Alternaria alternata
Aspergillus clavatus
Metarhizium acridum
Clohesyomyces aquaticus
Oidiodendron maius Zn
Ophiocordyceps
australis (PHH64516)
Colletotrichum
orchidophilum (OHF04557)
Cadophora sp. DSE1049
Meliniomyces bicolor E
Pezoloma ericae
Acremonium
chrysogenum ATCC
Colletotrichum
PKS (KS-AT-DH-MT-ER-KR-
higginsianum IMI
T-C), PKS (KS-AT-DH-MT-
Penicillium griseofulvum
PKS (SAT-KS-AT-PT-MT-R),
Penicillium camemberti
Aspergillus sydowii CBS
Aspergillus uvarum CBS
Colletotrichum
chlorophyti (OLN93260)
Cordyceps sp. RAO-2017
Pseudogymnoascus sp.
Phialocephala subalpina
Fusarium fujikuroi
Aspergillus
ochraceoroseus IBT
Penicillium camemberti
Talaromyces stipitatus
Penicillium
subrubescens (OKP00032)
Pseudogymnoascus sp.
Coniochaeta pulveracea
Phialocephala
PKS (SAT-KS-AT-PT-T-T-TE),
scopiformis (KUJ09200)
Pseudogymnoascus sp.
Pseudogymnoascus sp.
Aspergillus lentulus
Aspergillus kawachii
Endocarpon pusilium
Penicillium nalgiovense
Trichoderma asperellum
Fusarium oxysporum f.
Scedosporium
apiospermum (KEZ41293)
Penicillium griseofulvum
Pseudogymnoascus sp.
Bipolaris victoriae FI3
Coleophoma
PKS (SAT-KS-AT-PT-T-T-TE),
cylindrospora
Aspergillus brasiliensis
PKS (KS-AT-DH-MT-ER-KR-
T-C-A-T-R), NRPS (A-T-C)
Talaromyces stipitatus
PKS (KS-AT-DH-MT-ER-KR-T),
Penicillium steckii
Aspergillus bombycis
Fusarium avenaceum
Helicocarpus griseus
Aspergillus
heteromorphus CBS
Madurella mycetomatis
Fusarium avenaceum
PKS (KS-AT-DH-ER-KR-T),
Pseudogymnoascus sp.
Metarhizium rileyi RCEF
Colletotrichum
graminicola M1.001
Bipolaris sorokiniana
Aspergillus bombycis
Beauveria bassiana
Aspergillus steynii IBT
Coleophoma crateriformis
PKS (KS-AT-DH-MT-ER-KR-T),
Capronia coronata CBS
Aspergillus mulundensis
PKS (KS-AT-DH-MT-ER-KR-T),
Cladophialophora
carrionii (OCT48933)
Hypoxylon sp. CO27-5
PKS (KS-AT-DH-MT-ER-KR-
T-C-A-T-R), NRPS (A-T-R)
Cordyceps fumosorosea
Aspergillus fischeri
NRPS (A-T-C-A-T-R),
Aspergillus costaricaensis
PKS (KS-AT-DH-MT-ER-KR-T),
Aspergillus
ochraceoroseus
Cladophialophora
carrionii CBS 160.54
Aspergillus lentulus
Amorphotheca resinae
Exophiala oligosperma
Neonectria ditissima
Ophiocordyceps australis
Cladophialophora
bantiana CBS 173.52
Penicillium occitanis
NRPS-like (A-T-R),
Cladophialophora
bantiana CBS 173.52
Cladophialophora
carrionii CBS 160.54
Talaromyces marneffei
Exserohilum turcica
Penicillium camemberti
Having surveyed GCFs across the fungal kingdom, experiments were conducted during development of embodiments herein to compare and contrast this genomic and chemical repertoire to the well-established bacterial canon. 5,453 bacterial genomes whose BGCs were publicly available in the antiSMASH bacterial BGCs database (Ref. B40; incorporated by reference in its entirety) were gathered, resulting in a dataset of 24,024 bacterial BGCs to compare to the dataset of 36,399 fungal BGCs. To visualize the biosynthetic space encompassed by these BGCs, the frequency of protein domains within BGCs for each major taxonomic group was determined. Principle Component Analysis (PCA) of these encoded BGCs showed a phylogenetic bias in this biosynthetic space, with bacteria and fungi occupying distinct regions (Fig. B4A).
Dramatic differences in bacterial versus fungal NRPS and PKS assembly line logic were observed. Consistent with prior studies of iterative fungal PKS enzymes (Ref. B41; incorporated by reference in its entirety), fungal PKS BGCs typically encode a single backbone PKS enzyme, while bacterial PKS BGCs contain a median of 1.7 PKS backbone enzymes per cluster (Fig. B4B, right). Fungal NRPS BGCs also usually encode a single backbone enzyme, compared to multiple backbone enzymes more typically observed in bacterial systems (Fig. B4B, left). Fungal NRPS and PKS enzymes also average ˜150% the size of bacterial backbones (Fig. BS14). In addition to these contrasting backbone enzyme compositions, systematic differences were observed in the top NRPS domain organizations (Fig. BS15), particularly in NRPS termination domains (Fig. B4C). The most common fungal NRPS termination domains are C-terminal condensation domains, recently found to catalyze release of peptide intermediates via intramolecular cyclization (Refs. B42-B44; incorporated by reference in their entireties). The next most common are terminal thioester reductase domains that perform either reductive release to aldehydes or alcohols or release via cyclization (Ref. B45; incorporated by reference in its entirety). This is in stark contrast to bacterial NRPS BGCs, which most commonly terminate with type I thioesterase domains that release intermediates as linear or cyclic peptides (Fig. B4C).
These collective differences between fungal and bacterial BGCs show systematic differences in NRPS biosynthetic logic between these two kingdoms. In bacterial NRPS canon, a pathway is comprised of multiple NRPS genes whose chromosomal order (and the order of catalytic domain “modules” within the encoded polypeptide) corresponds to the order of amino acid monomers in the metabolite product (Fig. B4D, right) (Ref. B46; incorporated by reference in its entirety). In the field of bacterial natural products, the use of this “collinearity rule” to predict metabolite scaffolds is commonplace (Refs. B19, B47, B48; incorporated by reference in their entireties); however, the large number of exceptions to this rule reduces the accuracy of these predictions. The prototypical fungal NRPS (Fig. B4D (FIG. B4D) primarily involves the action of biosynthetic domains within the same backbone enzyme, rather than multiple NRPS backbones acting in concert. This indicates that efforts to predict fungal NRPS scaffolds will be able to largely bypass the need to account for permutations of multiple NRPS genes, raising the possibility of increased predictive performance compared to bacteria.
Having shown that fungi and bacteria are distinct biosynthetically, experiments were conducted during development of embodiments herein to compare these genomics-based insights to the chemical space of known metabolites. 9,382 bacterial compounds were added to the dataset of 15,213 fungal metabolites, analyzing these bacterial compounds using the same network analysis and chemical ontology workflow described above. PCA was performed to visualize the chemical space of major fungal and bacterial taxonomic groups within this compound dataset.
PCA of bacterial and fungal compounds (Fig. B5A) revealed a trend that parallels the analysis of fungal and bacterial biosynthetic space (Fig. B4A). Bacteria and fungi occupy separate regions of chemical space, differing dramatically in terms of chemical ontology superclass, a high-level descriptor of general structural type (Fig. B5B). Fungi have twice the frequency of lipids and nearly twice the frequency of heterocyclic compounds, a structural group that includes aromatic polyketide-related moieties such as furans and pyrans. Many of the chemical moieties and structural classes that are highly enriched in bacteria or fungi are vital in bioactive scaffolds. This includes moieties such as the bacterial aminoglycoside antibiotics (Ref. B49; incorporated by reference in its entirety), thiazoles present in the bacterial anti-cancer bleomycin family (Ref. B50; incorporated by reference in its entirety), and the steroid ring that forms the core scaffold of steroid drugs such as the fungal metabolite fusidic acid (Ref. B51; incorporated by reference in its entirety) (Fig. B5B). PCA loadings plots similarly reveal differences between bacterial and fungal chemical space, including a high prevalence of peptide-associated chemical ontology terms in bacteria, and lipid and aromatic polyketide terms in fungi (Fig. BS16).
Within the fungal kingdom, differences in PCA of the chemical repertoire of major taxonomic groups were observed (Fig. BS17). Pezizomycotina classes grouped together in chemical space, largely due to a higher proportion of polyketide and peptide-related chemical moieties (Fig. BS18). Basidiomycota are distinct chemically, possessing a much higher proportion of chemical moieties and descriptors associated with terpenes and other lipids. These observations based on chemical space are consistent with the higher proportion of NRPS and PKS BGCs within Pezizomycotina and the prevalence of terpene BGCs within Basidiomycota groups such as Agaricomycotina (Fig. B2B), and further supported by PCA of fungal BGCs, in which fungal phyla represent distinct groups (Figs. BS19 and BS20).
The GCF approach enables the systematic mapping of the biosynthetic repertoire encoded by large groups of fungal genomes. The fungal kingdom is a wealth of untapped biosynthetic potential, with the 1000 genomes analyzed here representing a reservoir of >12,000 new GCF-encoded scaffolds. This genome dataset is only a small subset of the >1 million predicted fungal species (Ref. B29; incorporated by reference in its entirety), indicating that the total biosynthetic potential of the fungal kingdom far surpasses that assembled here.
By organizing biosynthetically related BGCs into families, the GCF approach provides a means of cataloguing and dereplicating genome-encoded MFs. In the field of bacterial natural products discovery, this GCF paradigm has been expanded for automated linking of GCFs to MFs detected by metabolomics and molecular networking analysis, enabling high-throughput genome mining from industrial-scale strain collections (Refs. B5, B7, B29, B52; incorporated by reference in their entireties). Establishing the GCF approach for fungal genomes lays the groundwork for similar GCF-driven large-scale compound discovery efforts from fungi.
Large-scale genome sequencing projects such as the 1000 Fungal Genomes project, whose stated goal is sampling every taxonomic family within Fungi (Ref. B53; incorporated by reference in its entirety), will uncover a large amount of biosynthetic and chemical novelty. However, as 76% of fungal GCFs are species- and 16% are genus-specific, such genome sequencing efforts focused on taxonomic families will miss the majority of GCFs. Additional large-scale efforts to sample this biosynthetic space based on “depth” rather than “breadth” is suggested to more efficiently access these genomes. Future projects, now feasible for academic research groups due to ever-decreasing genome sequencing costs, should focus on expanding this dataset with species-level sequencing of taxonomic groups.
The GCF approach provides a means of selecting fungi for compound and BGC discovery via approaches such as heterologous expression (Ref. B54; incorporated by reference in its entirety) based not on taxonomic or phylogenetic markers, but with a strategy that focuses on efficient sampling of biosynthetic pathways. The distribution of GCFs shows groups of organisms with shared GCFs (Fig. BS6), and sampling based on these organism “groups” reduces the number of genomes required to capture the majority of fungal biosynthetic space. Simulated sampling based on shared GCFs indicated that 80% of GCFs from the 386 Eurotiomycete genomes are represented in a sample of only 145 genomes. By contrast, to represent the same number of GCFs, species-level sampling required 189 genomes and random sampling required 263 genomes (Fig. BS21). This indicates that the GCF approach provides a roadmap for systematic characterization of new fungal biosynthetic pathways and their compounds.
Analyses of both chemical and biosynthetic space show that bacteria and fungi represent chemically distinct sources for natural products discovery. Fungal compounds are closer to FDA-approved compounds than bacterial compounds in terms of several chemical properties, including three out of four “Lipinsky Rule of Five” properties often used as guidelines for predicting oral bioavailability (Fig. BS22) (Ref. B55; incorporated by reference in its entirety). While many of the most successful natural products violate these rules of thumb, these data indicate that fungal metabolites may be more “druglike” than those occupying bacterial chemical space.
Compound discovery efforts should be initiated with the understanding that different biological sources will yield distinct chemical space and different types of metabolite scaffolds. The fungal kingdom is rich in aromatic polyketides, while bacteria harbor a higher proportion of peptidic scaffolds. Within the fungal kingdom, Basidiomycota is a rich reservoir of terpene scaffolds, while BGC-rich Pezizomycotina classes are a richer source of polyketides and peptides. These data indicate that distinct taxonomic groups not only possess the capacity for different metabolite scaffolds, but also different types of scaffolds.
Rather than strain selection with the goal of maximizing biodiversity (i.e., the stated purpose of the 1000 Fungal Genomes Project), experiments were conducted during development of embodiments herein for selection of strains based on an optimal degree of overlap in genetic content. The approach requires strains to have some BGCs in common; however, also seeks biosynthetic diversity. A goal is to establish an optimal pipeline for strain selection for linked genomics & metabolomics, and offer the study below of genetic markers as a proxy for GCF overlap in fungal strains.
From 1037 fungal genomes, a set of ˜12,000 GCFs was generated and the relationship between GCF similarity and genetic markers was determined. To find genetic marker sequences that could be used as a proxy for GCF overlap in selection of fungal strains, the GCF overlap was plotted vs. three genetic markers that have been previously used for fungal phylogeny (
Experiments were conducted during development of embodiments herein to establish a new fungal bioinformatics pipeline (
The second component of the platform combines state-of-the-art HRMS mass spectrometry with a cheminformatics pipeline for dereplication of known compounds in metabolite extracts. UHPLC-MS metabolomics data was collected for the same 50 Aspergillus and Penicillium strains analyzed using our GCF analysis workflow. Each strain was grown on four media conditions for expression of diverse metabolites. Metabolite extracts were analyzed using an Agilent 1290 UHPLC and Q Exactive mass spectrometer dedicated to natural product extract analysis. Metabolomics data was analyzed using molecular networking, an approach that clusters spectra from related metabolites into molecular families for data visualization and annotation.
The pipeline uses a metabologenomics approach to connect GCFs to their metabolite products for discovery of new compounds and biosynthetic enzymes. The presence/absence of GCFs and molecular families across a strain collection are compared using a chi-squared test, and statistically significant correlations represent putative biosynthetic relationships. These data are visualized using the Prospect web application (prospect-fungi.com/) that allow targeting of specific GCFs and metabolites for further characterization.
Using 50 strains of Aspergillus and Penicillium, a set of 14 experimentally characterized fungal GCFs were examined from the database MIBiG whose metabolite products were detected. After applying the conservative Bonferroni approach to estimate the False Discovery Rate (FDR) and correct for multiple hypothesis testing, statistically-significant correlations for 8/14 knowns was observed, a success rate of ˜60% (
Experiments will be conducted during development of embodiments herein to expand the fungal metabolomics dataset with, minimally, an additional 250 Aspergillus, Penicillium, and Eurotiales strains, resulting in a total of 300 for this project. Metabolomics data from these strains are annotated using an improved version of this molecular networking cheminformatics pipeline and correlated to biosynthetic pathways as demonstrated here in
Experiments conducted during development of embodiments herein have led to the creation of a web tool known as Prospect which provides a variety of views and a page that allows users to browse BGCs in each of the GCFs we have assigned to date. This includes a side panel that displays all gene clusters present within the family, with genes color-coded by detected protein domains. Compounds associated with experimentally characterized clusters are also visible in this alpha-version of Prospect. Upon selecting a specific gene, a page shows detected protein domains, with links to relevant Pfam database entries and the option to download or perform an NCBI BLAST search with a protein or domain sequence. In addition to this page for viewing GCFs, additional pages display tables allowing users to find GCFs based on taxonomy information, Prospect accession number, biosynthetic type, and experimentally characterized status.
The alpha version of Prospect was designed using a combination of programming frameworks and languages chosen based on their ability to scale to large datasets, their level of creator/developer support, their ability to provide interactive user experiences, and their proven track record and popularity with web developers. The frontend visual component was designed using Angular, a framework commonly used in enterprise software development that is designed by and heavily supported by Google. The backend, responsible for accessing a SQL database housing all genomics and metabolomics data, was designed as a RESTful API using Django, a Python framework with strong community support used by organizations such as Instagram, Mozilla, and NASA.
Using the process above on 50 strains of phylogenetically diverse fungi from the Aspergillus and Penicillium genera,
Correlative analysis highlighted the gene cluster family “hybrids_158”; of the 9 strains that have one of the 9 BGCs in this GCF, their expression of a compound detected by mass spec as an ion at 343.129 m z is shown in
The following references, some of which are cited above by number, are incorporated herein by reference in their entireties.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 62/362,437 filed Jul. 14, 2016, which is hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/059502 | 11/6/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62932128 | Nov 2019 | US |