The presently-disclosed subject matter generally relates to nucleic acid molecules, polypeptide molecules, compositions, and methods for use in modifying metabolism, growth, and/or size of a cell. In particular, certain embodiments of the presently-disclosed subject matter relate to unique aphid polypeptides molecules, related nucleic acid molecules, and use thereof in compositions and methods for modifying cells.
Parasites manipulate hosts using many molecular mechanisms. Insect galls represent one of the most extreme forms of such inter-species manipulation. Insect-induced galls are intricately patterned, nutrient rich, protective homes (Mani, 1964), and they do not reflect simply unpatterned cellular over-proliferation, as observed for microbial galls like the crown gall induced by Agrobacterium tumefaciens. Each insect species appears to induce a novel gall, even when related insect species attack the same plant, implying that each insect species provides unique instructions to re-program latent plant developmental networks (Cook and Gullan, 2008; Crespi and Worobey, 1998; Dodson, 1991; Leatherdale, 1955; Martin, 1938; Parr, 1940; Plumb, 1953; Stern, 1995; Stone and Cook, 1998).
The mechanisms used by insects to induce galls have not previously been described. At least some gall-inducing insects produce phytohormones (Dorchin et al., 2009; McCalla et al., 1962; Suzuki et al., 2014; Tanaka et al., 2013; Tooker and Helms, 2014; Yamaguchi et al., 2012), although it is not yet clear that insects inject these hormones or if they contribute to gall growth. However, injection of phytohormones alone probably cannot generate the large diversity of species-specific insect galls. In addition, galling insects induce plant transcriptional changes independent of phytohormone activity (Bailey et al., 2015; Bilder and Irvine, 2017; Nabity et al., 2013; Schultz et al., 2019; Shih et al., 2018; Takeda et al., 2019). Thus, given the complex cellular changes required for gall development, insects probably introduce many molecules to control gall development.
Described herein is a unique aphid gene, determinant of gall color (dgc), that is genetically associated with gall color, providing the first example of an insect gene implicated in gall development.
dgc encodes a “BICYCLE protein,” a new family of diverse secreted proteins produced specifically in salivary glands of galling aphids and that are likely injected into plant cells to modify plant physiology, cell biology, and development. BICYCLE proteins may provide an opportunity to harness their power to manipulate physiology, cell biology, and development of cells in desired ways.
The presently-disclosed subject matter meets some or all of the above-identified needs, as will become evident to those of ordinary skill in the art after a study of information provided in this document.
This Summary describes several embodiments of the presently-disclosed subject matter, and in many cases lists variations and permutations of these embodiments. This Summary is merely exemplary of the numerous and varied embodiments. Mention of one or more representative features of a given embodiment is likewise exemplary. Such an embodiment can typically exist with or without the feature(s) mentioned; likewise, those features can be applied to other embodiments of the presently-disclosed subject matter, whether listed in this Summary or not. To avoid excessive repetition, this Summary does not list or suggest all possible combinations of such features.
The presently-disclosed subject matter includes a polynucleotide molecule, which includes a nucleic acid sequence having at least about 75% identity to SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 7, or SEQ ID NO: 8.
In some embodiments, the polynucleotide molecule includes a nucleic acid sequence having the sequence of SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 7, SEQ ID NO: 8, or a functional fragment thereof.
In some embodiments, the polynucleotide molecule is optimized for expression in Arabidopsis thaliana and comprising the sequence of SEQ ID NO: 7. In some embodiments, the polynucleotide molecule is optimized for expression in Oryza sativa and comprising the sequence of SEQ ID NO: 8.
The presently-disclosed subject matter also includes vectors containing one or more of the polynucleotide molecules as disclosed herein.
The presently-disclosed subject matter also includes a composition containing a vector as disclosed herein.
The presently-disclosed subject matter also includes a polypeptide molecule, which includes the amino acid sequence of SEQ ID NO: 5, SEQ ID NO: 6, or a functional fragment thereof.
The presently-disclosed subject matter also includes a composition containing a polypeptide molecule as disclosed herein.
The presently-disclosed subject matter also includes a method of modifying a cell, which involves administering to the cell a vector comprising or a polypeptide encoded by a polynucleotide molecule as disclosed herein. In some embodiments, the cell is selected from the group consisting of a plant cell, an animal cell, a fungal cell, and a bacterial cell.
The presently-disclosed subject matter also includes a method of modulating production of anthocyanins or their precursors in a plant, which involves administering to the plant a vector comprising or a polypeptide encoded by a polynucleotide as disclosed herein. In some embodiments of the method the production of anthocyanins or their precursors is reduced. In some embodiments of the method the anthocyanins are selected from peonidin-3,5-diglucoside and malvidin-3,5-diglucoside. In some embodiments of the method, the precursors are elected from cynidin-3,5-diglucoside, delphinidin-3,5-diglucoside, and petunidin-3,5-diglucoside.
The presently-disclosed subject matter also includes a method of modulating expression of a gene in a cell, comprising: administering to the cell a vector comprising or a polypeptide encoded by the polynucleotide molecule as disclosed herein, wherein the gene is selected from: FAOMT-1, FAOMT-2, GSTF11, GSTF12, UGT75C1, UFGT, ACCA, and combinations thereof. In some embodiments of the method, the cell is selected from the group consisting of a plant cell, an animal cell, a fungal cell, and a bacterial cell. In some embodiments of the method, the gene expression is downregulated.
The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are used, and the accompanying drawings of which:
The details of one or more embodiments of the presently-disclosed subject matter are set forth in this document. Modifications to embodiments described in this document, and other embodiments, will be evident to those of ordinary skill in the art after a study of the information provided in this document. The information provided in this document, and particularly the specific details of the described exemplary embodiments, is provided primarily for clearness of understanding and no unnecessary limitations are to be understood therefrom. In case of conflict, the specification of this document, including definitions, will control.
The details of one or more embodiments of the presently-disclosed subject matter are set forth in this document. Modifications to embodiments described in this document, and other embodiments, will be evident to those of ordinary skill in the art after a study of the information provided in this document. The information provided in this document, and particularly the specific details of the described exemplary embodiments, is provided primarily for clearness of understanding and no unnecessary limitations are to be understood therefrom. In case of conflict, the specification of this document, including definitions, will control.
The presently-disclosed subject matter includes nucleic acid molecules and polypeptide molecules related to a determinant of gall color (DGC) gene and protein, and methods for modifying a plant.
As disclosed herein, a DGC gene is an aphid gene that can produce a DGC Protein in the salivary glands of galling aphids. As described further herein, a number of single nucleotide polymorphisms (SNPs) have been identified that are regulatory, and result in turning off expression of the DGC gene. When the DGC gene is turned off, such that the DGC protein is absent from the saliva of a galling aphid, the result is a plant gall that is red. When the DGC gene is expressed, such that the DGC protein is present in the saliva of the galling aphid, a green gall results in the plant.
In this regard, and with reference to the studies described hereinbelow, DGC affects the anthocyanin production pathway, having an impact on enzyme classes (ACCA, UFGT, FAOMT, GSTF), anthocyanins that generate the red color in galls (peonidin-3,5-diglucoside(dg) and malvidin-dg), and anthocyanin precursor molecules found in red galls (cynidin-dg, delphinidin-dg, petunidin-dg).
The Hormaphis cornu aphid wild type DGC gene has the nucleotide sequence of SEQ ID NO: 1. The Hormaphis cornu aphid wild type DGC protein has the amino acid sequence of SEQ ID NO: 2. Disclosed herein are modified sequences for use in expressing DGC in plants.
The wild type aphid DGC sequences includes a signal secretion sequence, which can be removed to express DGC in the plant while preventing secretion from the cell of the expressed DGC protein. The signal secretion sequence includes about the first 20 amino acids of a bicycle gene protein, such as the DGC gene protein.
In some embodiments, an N-terminal methionine can be added to the modified sequence to provide for proper translation.
In some embodiments, the codons of the nucleotide sequence can be further modified to optimize expression in a particular plant of interest, which can be accomplished by one of ordinary skill in the art using known methods, after studying the present disclosure. In this regard, in some embodiments, one could generate a DNA sequence that contains, in the following order, an enhancer sequence to drive gene expression in a relevant plant tissue, a plant promoter that supports high levels of gene expression, the nucleotide sequence encoding the protein of interest, and finally a plant 3′ UTR that supports high levels of protein translation.
The optimized gene can be expressed in a plant of interest using a vector known by those of ordinary skill in the art, selected with consideration to the plant species and plant organ of interest.
In some embodiments, the codons of the nucleotide sequence and the enhancer, promoter and 3′ UTR sequences could be specified to allow expression in animal, fungal, or bacterial cells.
The presently-disclosed subject matter includes a polynucleotide molecule, comprising a nucleic acid sequence encoding the sequence of SEQ ID NO: 5 or a functional fragment thereof. In some embodiments, the nucleic acid sequence is modified relative to the sequence of SEQ ID NO: 3 for optimized expression in a particular plant species of interest. In some embodiments, the polynucleotide molecule is optimized for expression in Arabidopsis thaliana and comprising the sequence of SEQ ID NO: 7. In some embodiments, the polynucleotide molecule is optimized for expression in Oryza sativa and comprising the sequence of SEQ ID NO: 8.
The presently-disclosed subject matter further includes a polynucleotide molecule, comprising a nucleic acid sequence having at least about 98, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, or 25% identity to SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 7, or SEQ ID NO: 8. In some embodiments, the nucleic acid sequence has at least 75% identity to SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 7, or SEQ ID NO: 8. In some embodiments, the nucleic acid sequence has the sequence of SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 7, or SEQ ID NO: 8. In some embodiments, the nucleic acid sequence is modified relative to SEQ ID NO: 3 to optimize expression in a particular plant species of interest. Such optimization can be achieved, upon study of this document, by one of ordinary skill.
The presently-disclosed subject matter further includes a vector, comprising a polynucleotide molecule as disclosed herein.
The presently-disclosed subject matter further includes a composition comprising a polypeptide molecule comprising the sequence of SEQ ID NO: 5 or SEQ ID NO: 6, formulated for administration to a plant. The presently-disclosed subject matter further includes a composition comprising a polynucleotide molecule as disclosed herein. The presently-disclosed subject matter further includes a composition comprising a vector as disclosed herein.
The presently-disclosed subject matter includes methods of modifying a plant. In some embodiments, the method involves administering to the plant a polynucleotide molecule, a vector, and/or a composition as disclosed herein.
The presently-disclosed subject matter further includes a method of modulating production of anthocyanins or their precursors in a plant, comprising: administering to the plant a polynucleotide molecule, a vector, and/or a composition as disclosed herein. In some embodiments of the method, the production of anthocyanins or their precursors is reduced. In some embodiments of the method, the anthocyanins are selected from peonidin-3,5-diglucoside and malvidin-3,5-diglucoside. In some embodiments of the method, the precursors are elected from cynidin-3,5-diglucoside, delphinidin-3,5-diglucoside, and petunidin-3,5-diglucoside.
The presently-disclosed subject matter further includes a method of modulating expression of a gene in a plant, comprising: administering to the plant a polynucleotide molecule, a vector, and/or a composition as disclosed herein, wherein the gene is selected from: FAOMT-1, FAOMT-2, GSTF11, GSTF12, UGT75C1, UFGT, ACCA, and combinations thereof.
In some embodiments of the method, the gene expression is downregulated
While the terms used herein are believed to be well understood by those of ordinary skill in the art, certain definitions are set forth to facilitate explanation of the presently-disclosed subject matter.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of skill in the art to which the invention(s) belong.
All patents, patent applications, published applications and publications, GenBank sequences, databases, websites and other published materials referred to throughout the entire disclosure herein, unless noted otherwise, are incorporated by reference in their entirety.
Where reference is made to a URL or other such identifier or address, it understood that such identifiers can change and particular information on the internet can come and go, but equivalent information can be found by searching the internet. Reference thereto evidences the availability and public dissemination of such information.
As used herein, the abbreviations for any protective groups, amino acids and other compounds, are, unless indicated otherwise, in accord with their common usage, recognized abbreviations, or the IUPAC-IUB Commission on Biochemical Nomenclature (see, Biochem. (1972) 11(9):1726-1732).
Although any methods, devices, and materials similar or equivalent to those described herein can be used in the practice or testing of the presently-disclosed subject matter, representative methods, devices, and materials are described herein.
In certain instances, nucleotides and polypeptides disclosed herein are included in publicly-available databases, such as GENBANK® and SWISSPROT. Information including sequences and other information related to such nucleotides and polypeptides included in such publicly-available databases are expressly incorporated by reference. Unless otherwise indicated or apparent the references to such publicly-available databases are references to the most recent version of the database as of the filing date of this Application.
The terms “polypeptide fragment” or “fragment”, when used in reference to a reference polypeptide, refers to a polypeptide in which amino acid residues are deleted as compared to the reference polypeptide itself. A fragment can also be a “functional fragment,” in which case the fragment retains some or all of the activity of the reference polypeptide as described herein.
As used herein, an “activity” of a polypeptide refers to any activity exhibited by the reference full-length BICYCLE protein. Activity of a modified polypeptide can be any level of percentage of activity of the full-length polypeptide.
The present application can “comprise” (open ended) or “consist essentially of” the components of the present invention as well as other ingredients or elements described herein. As used herein, “comprising” is open ended and means the elements recited, or their equivalent in structure or function, plus any other element or elements which are not recited. The terms “having” and “including” are also to be construed as open ended unless the context suggests otherwise.
Following long-standing patent law convention, the terms “a”, “an”, and “the” refer to “one or more” when used in this application, including the claims. Thus, for example, reference to “a cell” includes a plurality of such cells, and so forth.
Unless otherwise indicated, all numbers expressing quantities of ingredients, properties such as reaction conditions, and so forth used in the specification and claims are to be understood as being modified in all instances by the term “about”. Accordingly, unless indicated to the contrary, the numerical parameters set forth in this specification and claims are approximations that can vary depending upon the desired properties sought to be obtained by the presently-disclosed subject matter.
As used herein, the term “about,” when referring to a value or to an amount of mass, weight, time, volume, concentration or percentage is meant to encompass variations of in some embodiments ±20%, in some embodiments ±10%, in some embodiments ±5%, in some embodiments ±1%, in some embodiments ±0.5%, in some embodiments ±0.1%, in some embodiments ±0.01%, and in some embodiments ±0.001% from the specified amount, as such variations are appropriate to perform the disclosed method.
As used herein, ranges can be expressed as from “about” one particular value, and/or to “about” another particular value. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed.
As used herein, “optional” or “optionally” means that the subsequently described event or circumstance does or does not occur and that the description includes instances where said event or circumstance occurs and instances where it does not. For example, an optionally variant portion means that the portion is variant or non-variant.
The presently-disclosed subject matter is further illustrated by the following specific but non-limiting examples. The following examples may include compilations of data that are representative of data gathered at various times during the course of development and experimentation related to the present invention.
The aphid, Hormaphis cornu, induces galls on the leaves of witch hazel, Hamamelis virginiana, in the Eastern United States (
After several days, the basal side of the gall encloses the fundatrix and the gall continues to grow apically and laterally, providing the fundatrix and her offspring with protection and abundant food. After several weeks, the basal side of the gall opens to allow aphids to remove excreta (honeydew) and molted nymphal skins from the gall and, eventually, to allow winged migrants to depart. Continued gall growth requires the constant presence of the fundatrix and gall tissue dies in her absence (Lewis and Walton, 1958; Rehill and Schultz, 2001), suggesting that the fundatrix must continuously inject material into plant tissue to overcome plant defenses.
A Natural Gall Color Polymorphism is Linked to Regulatory Variation in a Novel Aphid Gene, Determinant of Gall Color
Populations of H. cornu include approximately 4% red galls and 96% green galls (
Since GWAS can sometimes produce spurious associations, an independent replication study was performed and it was found that all 11 SNPs were highly significantly associated with gall color in fundatrices isolated from 435 green and 431 red galls (LOD=191-236;
Based on these genetic associations and further evidence presented below, the name determinant of gall color (dgc) to Horco_16073 was assigned. dgc encodes a predicted protein of 23 kDa with an N-terminal secretion signal sequence (
Most SNPs associated with green or red galls were found in one of two predominant haplotypes (
Chromosomal Aberrations Cannot Explain Association of 11 SNPs with Gall Color
The possibility was considered that DNA aberrations linked to the 11 associated SNPs might explain the red-green gall polymorphism. No evidence was found for large insertions, deletions, or other chromosomal aberrations in this region (
Gene Ontology Analysis of Plant Genes Differentially Expression in Galls is Consistent with the Observed Cell Biology of Gall Development
Genes associated with cell division were strongly upregulated in galls (
Regulatory Variants at Dgc Dominantly Silence Dgc Expression
Since all 11 dgc polymorphisms associated with gall color occur outside of dgc exons (
dgc expression is reduced approximately 20-fold in fundatrix salivary glands with dgcRed/dgcGreen (27±22.6 CPM, mean±SD) versus dgcGreen/dgcGreen genotypes (536±352.3 CPM, mean±SD). Thus, dgcRed alleles probably downregulate both the dgcRed and dgcGreen alleles in heterozygotes. By examining exonic SNPs in dgc, it was confirmed that the dgc transcript is strongly downregulated from both dgcRed and dgcGreen alleles in heterozygotes (
High Levels of Dgc Transcription are Associated with Downregulation Specifically of Plant Anthocyanin Genes and Two Anthocyanins
Since red galls are associated with downregulation of only dgc, consideration was given to how the plant responds to changes in this single putative signal. To examine this question, whole-genome differential expression was performed on plant mRNA isolated from galls induced by aphids with dgcRed/dgcGreen versus dgcGreen/dgcGreen genotypes (Supp info). Only eight plant genes were differentially expressed between green and red galls and all eight genes were downregulated in green galls (
The seven most strongly downregulated plant genes all encode enzymes of the anthocyanin biosynthetic pathway (
Six of these enzymes are required for final steps of anthocyanin production and deposition (
Taken together, these observations suggest that dgc represses transcription of seven anthocyanin biosynthetic enzymes. It is not clear how dgc induces specific transcriptional changes in seven plant genes and it may act by altering activity of an upstream regulator of these plant genes.
Aphids Induce Widespread Transcriptomic Changes in Galls
The gall color polymorphism reflects only a small part of the gall phenotype. To estimate how many plant genes are differentially expressed during gall development, differential expression analysis of plant genes in galls versus the surrounding leaf tissue was performed. Approximately 31% of plant genes were upregulated and 34% were downregulated at FDR=0.05 in galls versus leaf tissue (
Thus, approximately 15,000 plant genes are differentially expressed in galls, representing a dramatic re-programming of plant cell biology. If other aphid effector molecules act in ways similar to dgc, which is associated with differential expression of only eight plant genes, then gall development may injection of hundreds or thousands of effector molecules.
dgc is the Founding Member of a Large Class of Novel Bicycle Genes Expressed Specifically in the Salivary Glands of Gall-Inducing Aphids
To identify additional proteins that aphids may inject into plants to contribute to gall development, the fact that only some individuals in the complex life cycle of H. cornu induce galls was exploited (
Half of these genes (744) were homologous to previously identified genes, many of which had functional annotations. Gene Ontology analysis of the “annotated” genes suggests that they contribute mostly to the demands for high levels of protein secretion in fundatrix salivary glands (
Hierarchical clustering of the unannotated genes by sequence similarity identified one large (476 genes) and one small (43 genes) cluster of related genes, and 222 genes sharing few or no homologs (
Proteins encoded by the large cluster display conservation mainly of a pair of widely spaced cysteine-tyrosine-cysteine (CYC) motifs (
Thus, dgc, which is associated with gall color, is the founding member of a diverse family of genes encoding secreted proteins expressed specifically in the salivary glands of gall forming generations. Bicycle genes are therefore good candidates to encode many of the molecules required to generate the extensive transcriptional changes observed in galls.
It was found that eleven derived regulatory polymorphisms at dgc in the gall forming aphid H. cornu are associated with almost complete silencing of dgc in aphid salivary glands (
BICYCLE Protein Functions
dgc likely encodes a protein that is deposited by aphids into gall tissue, but the mechanism by which this novel aphid protein specifically and dramatically downregulates seven anthocyanin biosynthetic genes remains to be determined. The primary sequences of DGC and other BICYCLE proteins provide few clues to their molecular mode of action. Outside of the N-terminal secretion signal, BICYCLE proteins possess no similarity with previously reported proteins and display no conserved domains that might guide functional studies. The relatively well-conserved C-Y-C motif appears to define a pair of ˜50-80 aa domains in each protein and the paired cysteines may form disulfide bonds, which is commonly observed for secreted proteins.
Methods
Imaging of Leaves and Fundatrices Inside Developing Galls
Young Hamamelis virginiana (witch hazel) leaves or leaves with early stage galls of Hormaphis cornu were fixed in Phosphate Buffered Saline (PBS: 137 mM NaCl, 2.7 mM KCl, 10 mM Na2HPO4, 1.8 mM KH2PO4, pH 7.4) containing 0.1% Triton X-100, 2% paraformaldehyde and 0.5% glutaraldehyde (paraformaldehyde and glutaraldehyde were EM grade from Electron Microscopy Services) at room temperature for two hours without agitation to prevent the disruption of the aphid stylet inserted into leaf tissue. Fixed leaves or galls were washed in PBS containing 0.1% Triton X-100, hand cut into small sections (˜10 mm2), and embedded in 7% agarose for subsequent sectioning into 0.3 mm thick sections using a Leica Vibratome (VT1000s). Sectioned plant tissue was stained with Calcofluor White (0.1 mg/mL, Sigma-Aldrich, F3543) and Congo Red (0.25 mg/mL, Sigma-Aldrich, C6767) in PBS containing 0.1% Triton X-100 with 0.5% DMSO and Escin (0.05 mg/mL, Sigma-Aldrich, E1378) at room temperature with gentle agitation for 2 days. Stained sections were washed with PBS containing 0.1% Triton X-100. Soft tissues were digested and cleared to reduce light scattering during subsequent imaging using a mixture of 0.25 mg/mL collagenase/dispase (Roche #10269638001) and 0.25 mg/mL hyaluronidase (Sigma Aldrich #H3884) in PBS containing 0.1% Triton X-100 for 5 hours at 37° C. To avoid artifacts and warping caused by osmotic shrinkage of soft tissue and agarose, samples were gradually dehydrated in glycerol (2% to 80%) and then ethanol (20% to 100%) (Ott 2008) and mounted in methyl salicylate (Sigma-Aldrich, M6752) for imaging. Serial optical sections were obtained at 2 μm intervals on a Zeiss 880 confocal microscope with a Plan-Apochromat 10×/0.45 NA objective, at 1 μm with a LD-LCI 25×/0.8 NA objective or at 0.5 μm with a Plan-Apochromat 40×/0.8 NA objective. Maximum projections of confocal stacks or rotation of images were carried out using FIJI (Schindelin et al. 2012).
Hormaphis cornu Genome Sequencing, Assembly, and Annotation
H. cornu aphids were collected from a single gall for genome sequencing. All aphids within the gall were presumed to be clonal offspring of a single fundatrix, because all H. cornu galls that were examined contain only a single fundatrix and the ostiole of the galls was closed at the time this gall was collected, so there is little chance of inter-gall migration. High molecular weight (BMW) DNA was prepared by gently grinding aphids with a plastic pestle against the inside wall of a 2 mL Eppendorf tube in 1 mL of 0.5% SDS, 200 mM Tris-HCl pH 8, 25 mM EDTA, 250 mM NaCl with 10 uL of 1 mg/mL RNAse A. Sample was incubated at 37° C. for 1 hour and then 30 uL of 10 mg/mL ProteinaseK was added and the sample was incubated for an additional 1 hr at 50° C. with gentle agitation at 300 RPM. 1 mL of Phenol:Chloroform:Isoamyl alcohol (25:24:1) was added and the sample was centrifuged at 16,000 RCF for 10 min. The supernatant was removed to a new 2 mL Eppendorf tube and the Phenol:Chloroform:Isoamyl alcohol extraction was repeated. The supernatant was removed to a new 2 mL tube and 2.5× volumes of absolute ethanol were added. The sample was centrifuged at 16,000 RCF for 15 min and then washed with fresh 70% ethanol. All ethanol was removed with a pipette and the sample was air dried for approximately 15 minutes and DNA was resuspended in 50 uL TE. This sample was sent to HudsonAlpha Institute for Biotechnology for genome sequencing.
DNA quality control, library preparation, and Chromium 10× linked read sequencing were performed by HudsonAlpha Institute for Biotechnology. Most of the mass of the BMW DNA appeared greater than approximately 50 kb on a pulsed field gel and paired end sequencing on an Illumina HiSeq×10 yielded 816 M reads. The genome was assembled using Supernova (Weisenfeld et al. 2017) using 175 M reads, which generated the best genome N50 of a range of values tested. This 10× genome consisted of 21,072 scaffolds of total length 319.352 MB. The genome scaffold N50 was 839.101 KB and the maximum scaffold length was 3.495 MB.
Dovetail Genomics was engaged to apply HiC and Chicago to generate larger scaffolds. BMW gDNA was submitted from the same sample used for 10× genome sequencing for Chicago and a separate sample of frozen aphids for HiC. The Dovetail genome consisted of 11,244 scaffolds of total length 320.34 MB with a scaffold N50 of 36.084 Mb. This genome, named hormaphis_cornu_26Sep2017_PoQx8, contains 9 main scaffolds, each longer than 17.934 Mb, which appear to represent the expected 9 chromosomes of H. cornu (Blackman and Eastop 1994). This assembly also includes the circular genome of the bacterial endosymbiont Buchnera aphidicola of 643,259 bp. BUSCO analysis (Simao et al. 2015) using the gVolante web interface (Nishimura et al., 2017) with the Arthropod gene set revealed that the genome contains 1026 of 1066 (96.25%) complete core genes and 1038 of 1066 (97.37%) partial core genes.
This genome was annotated for protein-coding genes using RNA-seq data collected from salivary glands and carcasses of many stages of the H. cornu life cycle using BRAKER (Altschul et al. 1990; Barnett et al. 2011; Camacho et al. 2009; Hoff et al. 2016, 2019; Li 2013; Lomsadze, Burns, and Borodovsky 2014; Stanke et al. 2006). To increase the efficiency of mapping RNA-seq reads for differential expression analysis, 3′ UTRs were predicted using UTRme (Radío et al. 2018). It was found that UTRme sometimes predicted UTRs within introns. Therefore, a custom R script was applied to remove such predicted UTRs. Later, after discovering the bicycle genes, all predicted bicycle genes were manually annotated, including 5′ and 3′ UTRs, in APOLLO (Lewis et al. 2002) by examining evidence from RNAseq reads aligned to the genome with the Integrative Genomics Viewer (Robinson et al. 2011; Thorvaldsdóttir, Robinson, and Mesirov 2013). It was found that the start sites of many bicycle genes were incorrectly annotated by BRAKER at a downstream methionine, inadvertently excluding predicted putative signal peptides from these genes. RNAseq evidence often supported transcription start sites that preceded an upstream methionine and these exons were corrected in APOLLO. The combined collection of 18,895 automated and 687 manually curated gene models (19,582 total) were used for all subsequent analyses of H. cornu genomic data.
Genome-Wide Association Study of Aphids Inducing Red and Green Galls
Galls produced by H. cornu were collected in the early summer (Table 1) and dissected by making a single vertical cut down the side of each gall with a razor blade to expose the aphids inside. DNA was extracted using the Zymo ZR-96 Quick gDNA kit from the foundress of each gall, because foundresses do not appear to travel between galls. Tagmentation was performed of genomic DNA derived from 47 individuals from red galls and 43 from green galls using barcoded adaptors compatible with the Illumina sequencing platform (Hennig et al. 2018). Tagmented samples were pooled without normalization, PCR amplified for 14 cycles, and sequenced on an Illumina NextSeq 500 to generated paired end 150 bp reads to an average depth of 2.9× genomic coverage. The average sequencing depth before filtering was calculated by multiplying the number of read pairs generated by SAMtools flagstat version 1.3 (Li et al. 2009) by the read length of 150 bp, then dividing by the total genome size (323,839,449 bp).
Hormaphis
cornu
Hormaphis
cornu
Hormaphis
cornu
Hormaphis
hamamelidis
Hamamelis
Hamamelis
virginiana
virginiana
Principal component analysis was performed on the genome-wide polymorphism data to detect any potential population structure that might confound GWAS study. Reads were mapped using bwa mem version 0.7.17-r1188 (Li 2013) and joint genotyped using SAMtools mpileup version 1.3, with the flag -ugf followed by BCFtools call version 1.9 (Li 2011), with the flag -m. Genotype calls were then filtered for quality and missingness using BCFtools filter and view version 1.9, where only SNPs with MAF>0.05, QUAL>20, and genotyped in at least 80% of the individuals were kept. To limit the number of SNPs for computational efficiency, the SNPs were additionally thinned using VCFtools—thin version 0.1.15 to exclude any SNPs within 1000 bp of each other. PCA was performed using the snpgdsPCA function from the R package SNPRelate version 1.20.1 in R version 3.6.1 (Zheng et al. 2012).
Genome-wide association mapping was performed with these low coverage data by mapping reads with bwa mem version 0.7.17-r1188, and then calculated the likelihood of association with gall color with SAMtools mpileup version 0.1.19 and BCFtools view -vcs version 0.1.19 using BAM files as the input. Association for each SNP was measured by the likelihood-ratio test (LRT) value in the INFO field of the output VCF file, which is a one-degree association test p-value. This method calculates association likelihoods using genotype likelihoods rather than hard genotype calls, ameliorating the issue of low-confidence genotype calls resulting from low-coverage data (Li 2011). The false discovery rate was set as the Bonferroni corrected value for 0.05, which was calculated as 0.05/50,957,130 (the total number of SNPs in the genome-wide association mapping).
Enrichment and Sequencing of the Genomic Region Containing Highly Significant GWAS Hits
The low coverage GWAS identified multiple linked SNPS on chromosome 1 that were strongly associated with gall color (
Reads were mapped with bwa mem version 0.7.17-r1188 and sorted bam files with SAMtools sort version 1.7, marked duplicate reads with Picard MarkDuplicates version 2.18.0 (http://broadinstitute.github.io/picard/), re-aligned indels using GATK IndelRealigner version 3.4 (McKenna et al. 2010), and called variants using SAMtools mpileup version 1.7 and BCFtools call version 1.7 (https://github.com/SAMtools/bcftools). This genotyping pipeline is available at https://github.com/YourePrettyGood/PseudoreferencePipeline (thereafter referred to as PseudoreferencePipeline). SNPs were quality filtered from the VCF file using BCFtools view version 1.7 at DP>10 and MQ>40 and merged using BCFtools merge.
For PCA analysis, the joint genotype calls were filtered for quality and missingness using BCFtools filter and view version 1.9, where only SNPs with MAF>0.05 and genotyped in at least 80% of the individuals were retained. PCA was performed using the snpgdsPCA function from the R package SNPRelate version 1.20.1 in Rstudio version 3.6.1.
Association testing was performed using PLINK version 1.90 (Purcell et al. 2007) with minor allele frequency filtered at MAF>0.2. A Hardy-Weinberg equilibrium filter was not applied because the samples were not randomly collected from nature. Red galls are rare in the population and fundatrices from red galls were oversampled to roughly match the number of fundatrices sampled from green galls. Results of the GWAS were plotted using the plotManhattan function of Sushi version 1.24.0 (Phanstiel et al. 2014).
To calculate LD for the 45 kbp region surrounding the 11 GWAS SNPs in
To plot LD for the entire target enrichment in
The LD heatmaps were generated using the R packages vcfR version 1.10.0 (Knaus and Grünwald 2017), snpStats version 1.36.0 (Clayton 2019) and LDheatmap version 0.99.8 (Shin et al. 2006) in Rstudio version 3.6.1. The R code used to generate the LD heatmap figure was adapted from code provided at sfustatgen.github.io/LDheatmap/articles/vcfOnLDheatmap.html. The gene models were plotted using the plotGenes function from the R package Sushi version 1.24.0.
Lack of Evidence for Chromosomal Aberrations
To identify possible chromosomal rearrangements or transposable elements that might be linked to the GWAS SNPs, adapters were first trimmed from the H. cornu target enrichment data using Trim Galore! version 0.6.5 and cutadapt version 2.7 (Martin 2011). The trimming pipeline is available at github.com/YourePrettyGood/ParsingPipeline. Reads were then mapped the to the H. cornu reference genome with bwa mem version 0.7.17, sorted BAM files with SAMtools sort version 1.9, and marked duplicate reads with Picard MarkDuplicates version 2.22.7 (http://broadinstitute.github.io/picard/), all done with the MAP function of the PseudoreferencePipeline. The analysis includes 43 high coverage red individuals and 42 high coverage green individuals. The five individuals isolated from red galls that did not carry the associated GWAS SNPs in dgc were excluded since the genetic basis for their gall coloration is unknown.
The BAM file for each individual was subset to contain only the target enrichment region on chromosome 1 from 40,092,625 to 40,892,915 bp using SAMtools view version 1.9 and generated merged BAM file for each color using SAMtools merge. The discordant reads were then extracted from each BAM file using SAMtools view with flag -F 1286 and the percentage of discordant reads was calculated as the ratio of the number of discordant reads over the total number of mapped reads for each 5000 bp window.
To further explore the possibility that chromosomal aberrations near the GWAS signal might differ between red- and green-gall producing individuals, the mapping locations of discordant reads in the 100 kbp region near the 11 GWAS hits (40,450,000-40,550,000 bp) were plotted for red individuals, since the H. cornu reference was made from a green individual. The read ID was obtained for all the discordant reads within the 100 kbp region and extracted all occurrences of these reads from the whole genome BAM file, regardless of their mapping location. The paired-end reads were then extracted from the discordant reads BAM file using bedtools bamtofastq version 2.29.2 and used bwa mem to map these reads as single-end reads for read 1 and read 2 separately to a merged reference containing the H. cornu genome and the 343 Acyrthosiphon pisum transposable elements annotated in RepBase (Jurka 1998). Duplicates were removed and the BAM file was sorted using SAMtools rmdup and sort and determined the mapping location of all discordant reads using SAMtools view. The windows were masked from chromosome 1: 40,400,000-40,499,999 bp and 40,500,000-40,599,999 bp in the genome-wide scatter plot of discordant reads mapping because the majority of the discordant reads are expected to map to these regions and displaying their counts would obscure potential signals in the rest of the genome.
Large Scale Survey of 11 dgc SNPs Associated with Gall Color
Aphids were collected from red and green galls as described above for the GWAS study directly into Zymo DNA extraction buffer and ground with a plastic pestle. DNA was prepared using the ZR-96 Quick gDNA kit. qPCR assays and amplicon-seq assays were developed to genotype all individuals at all 11 SNPs (primers, etc below). PCR amplicon products were barcoded and samples were pooled for sequencing on an Illumina platform.
Adaptors were trimmed from amplicon reads using Trim Galore! version 0.6.5 and cutadapt version 2.7. The wrapper pipeline is available at github.com/YourePrettyGood/ParsingPipeline. Reads were mapped to a 34 kbp region of chromosome 1 of the H. cornu genome that includes the amplicon SNPs (40,477,000-40,511,000 bp) with bwa mem version 0.7.17, sorted BAM files with SAMtools sort version 1.9, and re-aligned indels using GATK IndelRealigner version 3.4. No marking of duplicates was done given the nature of amplicon sequencing data. To maximize genotyping efficiency and improve accuracy, variant calling was performed with two distinct pipelines: SAMtools mpileup version 1.7 (Li et al., 2009) plus BCFtools call version 1.7, and GATK HaplotypeCaller version 3.4. The mapping and indel re-alignment pipelines are available as part of the MAP (with flag only_bwa) and IR functions of the PseudoreferencePipeline. Using the same PseudoreferencePipeline, variant calling was performed using the MPILEUP function of BCFtools and HC function of GATK. FASTA sequences for each individual, where the genotyped SNPs were updated in the reference space, were then generated for both BCFtools and GATK variant calls using the above PseudoreferencePipeline's PSEUDOFASTA function, with flags “MPILEUP, no_markdup” and “HC, no_markdup” respectively. The BCFtools SNP updating pipeline used bcftools filter, query, and consensus version 1.9, and all sites were masked where MQ<=20 or QUAL<=26 or DP<=5. The HC SNP updating pipeline used GATK SelectVariants and FastaAlternateReferenceMaker version 3.4, and all sites where MQ<50, DP<=5, GQ<90 or RGQ<90 were masked.
The variant calls were then merged from BCFtools and GATK, as well as the qPCR genotyping results, and manually identified all missing or discrepant genotypes. These missing or discrepant genotype calls were manually curated from the indel realigned BAM files using the following criteria: for heterozygous calls, the site has to have at least two reads supporting each allele, and for homozygous calls, the site has to have at least ten reads supporting the allele and no reads supporting alternative alleles.
RNA-Seq of Salivary Glands from Aphids Inducing Red and Green Galls
Salivary glands were dissected from fundatrices isolated from green and red galls in PBS, gently pipetted the salivary glands from the dissection tray in <0.5 uL volume of PBS, and deposited glands into 3 uL of Smart-seq2 lysis buffer (0.2% Triton-X 100, 0.1 U/uL RNasin® Ribonuclease Inhibitor). RNA-seq libraries were prepared with a single-cell RNA-seq method developed by the Janelia Quantitative Genomics core facility and described previously (Cembrowski et al. 2018).
RNAseq libraries were prepared as described above for red and green gall samples except that the entire 3 uL sample of salivary glands in lysis buffer was provided as input. Barcoded samples were pooled and sequenced on an Illumina NextSeq 550. 9.0 million reads per sample were detected on average. The original oligonucleotides were replaced with the following oligonucleotides to generate unstranded reads from the entire transcript:
Samples were PCR amplified for 18 cycles and the library was prepared using ¼ of the standard Nextera XT sample size and 150 pg of cDNA.
Differential Expression Analyses of Fundatrix Salivary Glands from Red and Green Galls
All differential expression analyses for plant and aphid samples were performed in R version 3.6.1 (R Core Team 2018) and all R Notebooks are provided on Github (will be posted prior to publication). Adapters were trimmed using cutadapt version 2.7 and read counts per transcript were calculated by mapping reads to the genome with hisat2 version 2.1.0 (Kim, Langmead, and Salzberg 2015) and counting reads per gene with htseq-count version 0.12.4 (Anders, Pyl, and Huber 2015). In R, technical replicates were examined and pooled, since all replicates were very similar to each other. Exploratory data analysis was performed using interactive multidimensional scaling plots, using the command glMDSPlot from the package Glimma (Su et al. 2017), and outlier samples were excluded from subsequent analyses. Differentially expressed genes were identified using the glmQLFTest and associated functions of the package edgeR (Robinson, McCarthy, and Smyth 2009). Volcano plots were generated using the Enhanced Volcano command from the package EnhancedVolcano version 1.4.0 (Blighe, Rana, and Lewis 2020). Mean-Difference (MD) plots and multidimensional scaling plots were generated using the plotMD and plotMDS functions, respectively, from the package limma version 3.42.0 (Ritchie et al. 2015).
Hamamelis virginiana Genome Sequencing, Assembly, and Annotation
Leaves from a single tree of Hamamelis virginiana were sampled from the Janelia Research Campus forest as follows. Branches containing leaves that were less than 50% expanded were wrapped with aluminum foil and harvested after 40 hours. Leaves were cleared of obvious contamination, including aphids and other insects, and then plunged into liquid N2. Samples were stored at −80° C. and sent to the Arizona Genomics Institute, University of Arizona on dry ice, which prepared HMW DNA from nuclei isolated from the frozen leaves. The Janelia Quantitative Genomics core facility generated a 10× chromium linked-read library from this DNA and sequenced the library on an Illumina NextSeq 550 to generate 608M linked reads.
The H. virginiana genome was assembled with the supernova commands run and mkoutput version 2.1.1, with options minsize=1000 and style=pseudohap (Weisenfeld et al. 2017). 332 M reads were used in the assembly to achieve raw coverage of 56× as recommended by the supernova instruction manual. The assembled genome was 886 Mb and the scaffold N50 was 472 Kb. BUSCO completeness analysis (Simao et al. 2015) was performed using the gVolante Web interface (Nishimura, Hara, and Kuraku 2017) using the plants database. BUSCO_v2/v3 showed 1309 of 1440 (90.9%) completely detected core genes and 1365 of 1440 (94.8%) partially detected core genes in the assembled H. virginiana reference genome.
The assembled genome reference was repeat masked with soft masking using RepeatMasker version 4.0.9 (Smit n.d.). Twenty-five RNA-seq libraries from galls and leaves were used for genome annotation. RNA-seq reads were adapter trimmed using cutadapt version 2.7 and mapped to the genome using HISAT2 version 2.1.0. Genome annotation was performed with BRAKER version 2.1.4 using the RNAseq data to provide intron hints (Altschul et al. 1990; Barnett et al. 2011; Camacho et al. 2009; Hoff et al. 2016, 2019; Li et al. 2009; Lomsadze et al. 2014; Ritchie et al. 2015; Stanke et al. 2006, 2008) and 3′ UTRs were predicted using UTRme (Radío et al. 2018).
H. virginiana RNA Extraction and RNA-Seq Library Preparation
RNA was extracted from frozen H. virginiana leaf or gall tissue as follows. Plant tissue frozen at −80° C. was placed into ZR BashingBead Lysis Tubes (pre-chilled in liquid N2) and pulverized to a fine powder in a Talboys High Throughput homogenizer (Troemer) with minimal thawing. Powdered plant tissue was suspended in extraction buffer (100 mM Tris-HCl, pH 7.5, 25 mM EDTA, 1.5 M NaCl, 2% (w/v) Hexadecyltrimethylammonium bromide, 10% Polyvinylpyrrolidone (w/v) and 0.3% (v/v) β-mercaptoethanol) and heated to 55° C. for 8 min followed by centrifugation at 13,000×g for 5 minutes at room temperature to remove insoluble debris (Jordon-Thaden et al. 2015). Total RNA was extracted from the supernatant using the Quick-RNA Plant Miniprep Kit (Zymo Research) with the inclusion of in-column DNAse I treatment. RNA-seq libraries were prepared with the Universal Plus mRNA-Seq kit (Nugen).
Differential Expression Analysis of Red Versus Green Galls
Differential expression analysis was performed on red and green galls by collecting paired red and green gall samples from the same leaves. In total, 17 red galls and 22 green galls were collected from 17 leaves. RNA was prepared as described above for plant material and RNA-seq libraries were prepared with a single-cell RNA-seq method modified for plant material described above.
These RNAseq libraries contained on average 4.4 million mapped reads per sample. Reads were quality trimmed and mapped to the transcriptome as described above. Only genes with greater than 1 count per million in at least 15 samples were included in subsequent analyses. The expression analysis model included the effect of leaf blocking.
Gall Pigment Extraction and Analysis
Frozen gall tissue was ground to a powder under liquid nitrogen. Approximately 20 mg of ground gall tissue was suspended in 100 μL methanol (Optima grade, Fisher Scientific) and 400 μl of 5% aqueous formic acid (Optima grade, Fisher Scientific), vortexed for 30 seconds and centrifuged at 8000×g for 2 min at 10° C. The supernatant containing pigment was filtered using a 0.2 μm, 13 mm diameter PTFE syringe filter to remove debris. Colorless pellet was discarded. Authentic anthocyanin pigment standards for malvidin 3,5-diglucoside chloride (Sigma Aldrich, St. Louis, Mo., USA) and peonidin-3,5-diglucoside chloride (Carbosynth LLC, San Diego, Calif., USA) were prepared at 1 mg/mL in 5% aqueous formic acid.
Pigment separation and identification alongside standards was performed on a reverse phase C18 column (Acquity Plus BEH, 50 mm×2.1 mm, 1.7 μm particle size, Waters, Milford, Mass.) using an Agilent 1290 UHPLC coupled to an Agilent 6545 quadrupole time-of-flight mass spectrometer (Agilent Technologies, Santa Clara, Calif., USA) using an ESI probe in positive ion mode. Five μl of filtered pigment extract or a 1:100 dilution of anthocyanin standard was injected Solvent (A) consisted of 5% aqueous formic acid and (B) 1:99 water/acetonitrile acidified with 5% aqueous formic acid (v/v) (B). The gradient conditions were as follows: 1 min hold at 0% B, 4 min linear increase to 20% B, 5 min linear increase to 40% B, ramp up to 95% B in 0.1 min, hold at 95% B for 5 min, return to 0% B and hold for 0.9 min. The column flow rate was 0.3 mL/min, and the column temperature 30° C. The MS source parameters for initial anthocyanin detection were as follows: capillary=4000 V, nozzle=2000 V, gas temperature=350° C., gas flow=13 L/min, nebulizer=30 psi, sheath gas temperature=400° C., sheath gas flow=12 L/min. DAD detection at 300 nm and 520 nm, and MS scanning from 50-1700 m/z at a rate of 2 spectra per second. Iterative fragmentation, followed by targeted MS/MS experiments, were performed using a collision energy=35. Authentic standards confirmed the presence of peonidin-3,5-diglucoside and malvidin 3,5-diglucoside. The remaining anthocyanin species were identified using UV-Vis spectra, retention time relative to the other species in the sample, [M]+ precursor ions and aglycone fragment ions matching the respective entries in the RIKEN database (Sawada et al. 2012).
Differential Expression Analysis of Galls Versus Leaves
RNA seq was performed on 36 gall samples and 17 adjacent leaf samples. These gall samples did not overlap with the gall samples used in red versus green gall comparison described earlier. For larger galls, RNA was isolated separately from basal, medial, and apical gall regions. Libraries were sequenced on an Illumina NextSeq 550 to generate 150 bp paired-end reads with an average of 8.1 million mapped reads per sample. Only genes expressed at greater than 1 count per million in at least 18 samples were included in subsequent analysis. Only samples where there are paired gall and leaf samples from the same leaf were included and potential leaf effects were modeled in the different expression analysis.
To facilitate Gene Ontology (GO) analysis, the UniProt IDs of the differentially expressed genes were obtained by mapping the coding sequence of the H. virginana genome to the UniProt/Swiss-Prot database (Bateman 2019) using Protein-Protein BLAST 2.7.1 (Altschul et al. 1990; Camacho et al. 2009) and extracting the differentially expressed genes. The WebGestalt 2019 webtool (Liao et al. 2019) was used to perform GO analysis on the differentially expressed genes.
Differential Expression Analysis of H. cornu Organs and Life Stages
RNAseq libraries were generated for fundatrix salivary glands (N=20) and whole bodies (N=8), G2 salivary glands (N=6) and carcasses (N=3), G5 salivary glands (N=6) and carcasses (N=2), and G7 salivary glands (N=5). Libraries were generated as described above for salivary glands except that RNA samples of carcasses and whole bodies were prepared using the Arcturus PicoPure RNA Isolation Kit (Applied Biosystems). Only genes expressed with at least 1 count per million in at least 29 samples were included in subsequent analyses.
Bioinformatic Identification of Bicycle Genes in H. cornu
Genes that were upregulated specifically in the salivary glands of the fundatrix generation were prime candidates for inducing galls. Genes that were upregulated both in salivary glands of fundatrices versus sexuals (G7) and in fundatrix salivary glands were therefore identified. These differentially expressed genes were then separated into genes with and without homologs containing some functional annotation. Homologs with previous functional annotations were identified using three methods: the following were performed (1) translated query-protein (blastx) and (2) protein-protein (blastp) based homology searches using BLAST 2.7.1 (Altschul et al. 1990; Camacho et al. 2009) against the UniProt/Swiss-Prot database (Bateman 2019) (Bateman 2019), and (3) Hidden-Markov based searches with the predicted proteins using hmmscan in HMMER version 3.1b2 (Eddy 2011) against the pfam database (Finn et al. 2014). For all predicted proteins, SignalP-5.0 (Almagro Armenteros et al. 2019) was used to search for secretion signal peptides and tmhmm version 2.0 was used to search for transmembrane domains (Krogh et al. 2001). Gene Ontology analysis of genes with annotations that were enriched in fundatrix salivary glands was performed by searching for Drosophila melanogaster homologs of differentially expressed genes and using these D. melanogaster homologs as input into gene ontology analysis.
To determine whether any of these genes without detectable homologs in existing protein databases were homologous to each another, sensitive homology searches of all-against-all of these genes were performed using jackhmmer in HMMER version 3.1b2. Hierarchical clustering was performed on the quantitative results of the jackhmmer analysis by first calculating distances amongst genes with the dist function using method canberra and clustering using the hclust function with method ward.D2, both from the library stats in R (R Core Team 2018). Sequences of the clustered homologs were aligned using MAFFT version 7.419 with default parameters (Katoh 2002; Katoh and Standley 2013), trimmed aligned sequences using trimAI (Capella-Gutiérrez, Silla-Martínez, and Gabaldón 2009) with parameters -gt 0.50, and generated sequence logos by importing alignments using the functions read.alignment and ggseqlogo in the R packages seqinr (Charif and Lobry 2007) and ggseqlogo (Wagih 2017). After identification of the bicycle genes, additional bicycle genes were searched for in the entire H. cornu genome, which might not have been enriched in fundatrix salivary glands, using jackhmmer followed by hierarchical clustering to identify additional putative homologs. As described above, all these putative bicycle genes were manually annotated.
All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference, including the references set forth in the following list:
It will be understood that various details of the presently disclosed subject matter can be changed without departing from the scope of the subject matter disclosed herein. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation.
This application claims priority from U.S. Provisional Application Ser. No. 63/092,942 filed Oct. 16, 2020, the entire disclosure of which is incorporated herein by this reference.
Number | Date | Country | |
---|---|---|---|
63092942 | Oct 2020 | US |