The Sequence Listing associated with this application is provided in XML format in lieu of a paper copy and is hereby incorporated by reference into the specification. The name of the file containing the Sequence Listing is W149-0053US_Seq.xml. The file is 2,642 bytes, was created Jun. 4, 2024, and is being submitted electronically via Patent Center.
This application relates to methods for characterizing the spatial arrangement and sequence of nucleic acids, and more particularly, with single cell resolution.
Mammalian genomes are highly organized in the three-dimensional (3D) nuclear space (Dekker et al. 2017), characterized by forming various architectural structures at different genomic scales, such as chromosome territories (CTs) (Cremer and Cremer 2001), large-scale active or repressed compartments (A/B compartments) (Rao et al. 2014), subcompartments (Rao et al. 2014; Xiong and Ma 2019), topologically associating domains (TADs) (Dixon et al. 2012; Nora et al. 2012) and subTADs (Phillips-Cremins et al. 2013; Beagan and Phillips-Cremins 2020), and chromatin loops (Salameh et al. 2020; Tang et al. 2015). Growing evidence has suggested that these genome structural features are intertwined with multiple layers of gene regulation and other genome functions (Oudelaar and Higgs 2021), playing crucial roles in development and disease (Marchal, Sima, and Gilbert 2019; J. Ma and Duan 2019; Zheng and Xie 2019; Misteli 2020; Spielmann, Lupiáñez, and Mundlos 2018). However, it remains poorly understood how the changes of multiscale 3D genome structure in a given single cell inform the cell's transcriptional programming and thereby impact cellular phenotypes in health and disease.
Many of the drawings submitted herein are better understood in color. Applicant considers the color versions of the drawings as part of the original submission and reserves the right to present color images of the drawings in later proceedings.
Molecular and cellular heterogeneity is intrinsic to cell differentiation and tissue development. The recent advent of single-cell technologies has been transformative in overcoming cells' heterogeneous nature. For example, high-throughput single-cell RNA-seq (scRNA-seq) analyses have enabled the identification of cell subtypes at unprecedented resolution in complex tissues (Cao et al. 2020; Calderon et al. 2022). Single-cell Hi-C (scHi-C) technologies, which map chromatin interactions in individual cells (Nagano et al. 2013; Ramani et al. 2017; Nagano et al. 2017; Flyamer et al. 2017; Stevens et al. 2017; Tan et al. 2018, 2021; Li et al. 2019), have allowed the characterization of the 3D genome architecture of distinct cell types in complex tissues (Tan et al. 2021). However, to fully understand the causal dependencies between the 3D genome organization and transcriptional activities in a cell, it concurrent measurement of the two molecular properties in the same cell(s) may be required. Although computational methods are able to provide integrative analysis of scHi-C and scRNA-seq to some degree (Tan et al. 2021), (Zhang, Zhou, and Ma 2021), it was not previously possible to faithfully match a cell's 3D genome organization with its gene regulation programs based on separately generated scHi-C and scRNA-seq data. While imaging-based technologies can simultaneously visualize and measure both genome architecture and transcripts in single cells, these methods rely on highly specific equipment and are currently limited in throughput (Cardozo Gizzi et al. 2019; Mateo et al. 2019; Su et al. 2020; Takei et al. 2021). Thus, new high-throughput genomic technologies that are able to co-assay 3D genome and gene expression in the same cell are urgently needed.
It has been shown that single-cell multimodal technologies, which can jointly profile multiple molecular phenotypes/genotypes from the same cell, are able to uncover the underneath connections between the different molecular properties of the cell (Macaulay, Ponting, and Voet 2017; Zhu, Preissl, and Ren 2020; Hao et al. 2021). To interrogate the relationship between genome structure and gene regulation at the single cell level, the present disclosure describes techniques related to GAGE-seq (Genome Architecture and Gene Expression by SEQuencing), a highly scalable approach for individual or joint mapping of single-cell landscapes of chromatin interactions and gene expression at low cost. Implementations of GAGE-seq described herein provide high-throughput, single-cell co-assay methods for concurrent measurement of genome-wide 3D chromatin interactions and transcriptome in the same single cells.
This disclosure also describes experimental validation for GAGE-seq. Using GAGE-seq, four different cell lines and two tissue types, including mouse brain and human bone marrow, were profiled. High-quality GAGE-seq datasets were generated with a wide variety of mouse and human cell lines and primary tissue cells, including GM12878, K562, MDS-L, NIH3T3, mouse brain cortex and human bone marrow CD34+ cells. Both single-cell Hi-C and the scRNA-seq in GAGE-seq show high robustness, specificity, sensitivity and reproducibility. Importantly, GAGE-seq was shown to uniquely reveal genome structure-function relationship in primary tissue context, leading to intricate and dynamic connections between cell type-specific 3D genome features and cell type-specific gene expression in single cells that may inform cell fate decision-making during hematopoiesis. Combining GAGE-seq and in situ spatial transcriptome data in the mouse brain further demonstrates the potential of integrative and multi-omic delineation of complex tissues.
In some implementations, GAGE-seq can be implemented in a microfluidic platform or a microfluidic circuit. The term “microfluidic circuit,” and its equivalents, as used herein, can refer to an apparatus that channels, manipulates, or otherwise is configured to contain volumes of a fluid (e.g., sample and/or reagent) in a range from 0.1 microliters (μL) to 999 μL, such as from 1-100 μL, or from 2-25 μL. Similarly, a “microfluidic cartridge,” and its equivalents, may include various components and channels that are configured to accept, retain, or facilitate passage of microfluidic volumes of sample or reagent. Certain implementations described herein can also function with nanoliter volumes (in the range of 10-500 nanoliters (nL), such as 100 nL).
Various implementations described herein relate to techniques for generating DNA libraries that are indicative of 3D organization of chromatin and simultaneous transcriptional activity. In various cases, permeabilized cells are subjected to a protocol that generates both transcriptomic DNA ((DNA) and spatial DNA (sDNA) before the cells are fully lysed and/or the cellular components are removed. As used herein, the terms “transcriptomic DNA,” “tDNA,” and their equivalents, may refer to DNA molecules whose sequences are indicative of the sequences of RNA present in a cell. As used herein, the terms “spatial DNA,” “sDNA,” and their equivalents, may refer to DNA molecules whose sequences are indicative of 3D chromatin structure and genetic sequences in the cell. For instance, sDNA can be utilized in a HI-C workflow to determine 3D chromatin structure of the cell.
In various implementations, the permeabilized cells/nuclei include cross-linked chromatin and RNA. The DNA in the chromatin, as well as the RNA, may be cross-linked to cellular proteins via protein-protein, protein-DNA and protein-RNA interactions. Thus, the DNA and the RNA are fixed at their respective original position (in situ) within the nucleus. In various implementations, the tDNA may be generated by reverse transcribing the RNA in the permeabilized cells with a primer. The sDNA may be generated by fragmenting the cross-linked chromatin in the cell using at least one first restriction enzyme (RE). Proximity ligation may be performed on the fragments, such that portions of the DNA that are spatially close to one another can be ligated with each other. Subsequently, the sDNA may be generated by further fragmenting the ligated fragments using at least one second restriction enzyme (RE). The tDNA and sDNA may be reverse crosslinked. Reverse crosslinking, for instance, may set free sDNA and tDNA from cellular protein. The tDNA and sDNA may be subsequently isolated from one another into separate libraries, that can be subsequently sequenced (e.g., using nanopore sequencing, sequencing-by-synthesis, or other sequencing techniques known in the art). The sequence read data indicating the sequences of the tDNA library and the sDNA library can be further analyzed in order to determine correlations between 3D chromatin structure in the nucleus of a cell and expression by that cell, for example.
Implementations of the present disclosure are different in several ways from existing technologies. For example, the tDNA and sDNA are generated while the cellular components of the permeabilized cells are present. Thus, various steps of the protocol are performed at temperature ranges, salt concentrations, and other conditions that mimic the physiological conditions of the source of the cells. In some cases, the first RE(s) and/or second RE(s) described herein are different than enzymes used in other techniques. In various examples, techniques described herein are capable of generating tDNA and sDNA in parallel protocols, such that the resulting tDNA and sDNA libraries indicate simultaneous biological processes.
The cells 106, in various cases, include genetic material in the nuclei 107. For instance, the cells 106 may include nucleic acids, such as DNA (e.g., genomic DNA 111, exogenous DNA, or other DNA) and RNA 108 (e.g., messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and the like). In some examples, the cells 106 include chromatin 110. The chromatin 110, in various cases, includes genomic DNA 111 and at least one protein (e.g., histones, such as Histone H2A, Histone H2B, Histone H1, Histone H3, Histone H4, etc.).
The genomic DNA 111, in some examples, wraps around histone proteins to form a nucleosome. The nucleosome structure can be, in various cases, stabilized by additional histone proteins (e.g., H1), and the stabilized structure, for instance, may coil to form a compact structure. Accordingly, the genomic DNA 111 can be accessed, based on the structure of the histones, and used to generate the RNA 108 through the process of transcription, which is indication of various expressive characteristics of the subject 104. Accessing the genomic DNA 111 from chromatin to generate the RNA 108 depends on various factors that are independent of the DNA sequence, including the chromatin structure, various chromatin remodeling complexes, histone modifications, and other factors. These factors have been implicated in a variety of pathologies, including cancer, neurological diseases, cardiovascular diseases, inflammatory and autoimmune diseases, and development disorders, among others.
In various implementations, it may be beneficial to determine the spatial organization and the gene expression of the genetic material of the subject 104. For instance, it may be beneficial to understand the relationship between the three-dimensional organization of the genome in a single cell and the transcriptional activities in the single cell. In some cases, understanding the relationship may facilitate the development of diagnostic, research, and therapeutic tools. These issues can be addressed, in some implementations, by using various methods described herein to characterize the spatial organization of the genome and the gene expression within single cells.
In various implementations of the present disclosure, the cells 106 and/or the nuclei 107 in the sample 102 are crosslinked and permeabilized. For example, the chromatin 110 in the sample 102 may be crosslinked to preserve the spatial organization. Crosslinking may be achieved, in various cases, chemically (e.g., using formaldehyde, or the like) or using another suitable method. Based on crosslinking the cells 106 or the nuclei 107, the cells 106 may be permeabilized via fixation (e.g., acetone fixation, methanol fixation, or the like), using a detergent, or another method known in the art. In some examples, the cells 106 are permeabilized and the nuclei 107 are intact.
The RNA 108 in the permeabilized cells 106 or nuclei 107 is, in some examples, reverse transcribed with a primer 112 that is configured to generate cDNA 114 that includes a tag 116. In various cases, the primer 112 is a poly-thymine (poly-T) primer. The primer 112, in some instances, is a random hexamer primer. The tag 116 may include a biotin, an avidin, a polyhistidine tag, a m6A methyl, an amine group, or the like. For instance, the primer 112 may include a biotinylated nucleotide. The tag 116 may be configured to ligate with first barcodes 118.
In various implementations, the chromatin 110 in the permeabilized cells 106 or nuclei 107 is fragmented using first enzymes 120. The first enzymes 120 are configured to fragment the chromatin 110 to facilitate proximity ligation of the chromatin 110. The first enzymes 120, in some cases, include two four-cut restriction enzymes. In particular examples, the first enzymes 120 include CviQI and MseI. Applying the first enzymes 120, in various cases, is performed at a temperature in a range of 20 degrees Celsius (20° C.) to 30° C. In some cases, the first enzymes 120 are configured to generate a thymine-adenine (−TA) at the 5′ end of the fragmented chromatin 110. Based on the fragmentation, the chromatin 110 may undergo proximity ligation, enabling identification of chromatin interactions based on the ligated DNA sequence. In various cases, the proximity ligation may be performed at a temperature in a range of 10° C. to 20° C.
In various implementations, the chromatin 110 is fragmented using at least one second enzyme 122. The second enzyme(s) 122, in some examples, are configured to fragment the chromatin 110 to generate an adhesive end of the fragmented chromatin 110, enabling ligation to the first barcodes 118. In various examples, the second enzyme(s) 122 include a restriction enzyme, such as Ddel, or any other suitable enzyme. Applying the second enzyme(s) 122, in various cases, is performed at a temperature in a range of 30° C. to 40° C.
Based on the fragmentation with the second enzymes 122, the first barcodes 118 are applied to the sample 102. The first barcodes 118, in some cases, are configured to ligate to the cDNA 114 and the genomic DNA 111 derived from the fragmented chromatin 110. The first barcodes 118, in various examples, include a plurality of polynucleotides. The DNA sequences may, for instance, have a length in a range of 2 to 20 nucleotides. A well plate, a microfluidic devices, tubes, or another mechanism may be used to separate the sample into first volumes. One of the first barcodes 118 may be added to each of the first volumes. For instance, the sample 102 may be added to a 96-well plate, and a distinct barcode may be added to each of 96 wells. In some examples, the same first barcode(s) 118 may be added to more than one of the 96 wells.
After applying the first barcodes 118, in various implementations, second barcodes 123 are applied to the sample 102. The second barcodes 123, in some cases, are configured to ligate to at least one of the first barcodes 118, the cDNA 114, or the genomic DNA 111. For instance, based on applying the first barcodes 118 to the first volumes, the sample 102 may be pooled from the first volumes and redistributed into second volumes. A well plate, a microfluidic devices, tubes, or another mechanism may be used to separate the sample into second volumes. The second barcodes 123, in some cases, include a plurality of polynucleotides. The DNA sequences may, for instance, have a length in a range of 2 to 20 nucleotides. In some examples, the first barcodes 118 and the second barcodes 123 are the same. In some examples, some of the sequences of the first barcodes 118 are the same as some of the sequences of the second barcodes 123.
In response to applying the second barcodes 123, the cDNA 114 may be separated from the genomic DNA 111 using the tag 116. For instance, the sample 102 may be pooled from the second volumes, and the sample 102 may undergo reverse crosslinking. The reverse crosslinking may be achieved using proteinase K, or another suitable technique. After the reverse crosslinking process, the cDNA 114 may be separated from the genomic DNA 111 using a complementary tag. For instance, the cDNA 114 may include biotin, and the complementary tag may include streptavidin. The complementary tag may include any agent configured to bind to the tag. For instance, the complementary tag may include streptavidin, biotin, avidin, nitrilotriacetic acid, or the like. The complementary tag may be linked to beads, a surface, a support, or the like that is configured to be isolated from the sample 102. For instance, the sample 102 may be applied to magnetic beads that are conjugated to the complementary tag, and the beads may be isolated from the sample 102 by applying a magnetic field to the sample 102 to isolate the cDNA 114. The sample 102 that remains, in various examples, includes the genomic DNA 111.
In particular implementations, additional barcodes may be applied. For instance, the second volumes may be pooled and separated into third volumes. Third barcodes may be applied to each of the third volumes. In various examples, three, four, five, six, or more than six barcodes may be applied. The number of barcodes, in some cases, may be determined based on the volume of the sample 102 or the number of cells 106 in the sample 102.
A sequencer 124, in various implementations, generates a transcriptomic library and a spatial library by sequencing the cDNA 114 and the genomic DNA 111, respectively. The transcriptomic and spatial libraries may be sequenced, and the first barcodes 118 and second barcodes 123 may be used to determine which of the sequences are associated (e.g., from the same cell, cell population, physical region of a sample, etc.). The sequencing technique may include at least one of a massively parallel sequencing (MPS) technique, next generation sequencing, targeted sequencing, direct sequencing, Sanger sequencing, sequencing-by-synthesis, nanopore sequencing on the tDNA library, or any other suitable nucleic acid sequencing technique.
In some implementations, at least some of the methods and/or reagents described herein may be incorporated into a medical device 126. For instance, the medical device 126 may include reservoirs configured to hold at least one of the sample 102, the primer 112, the first enzymes 120, the second enzyme(s) 122, the first barcodes 118, the second barcodes 123, or any other reagent used in the methods described herein. In some examples, the medical device 126 is configured to introduce a reagent to the sample 102. For example, the medical device 126 may be configured to introduce the sample 102 and the tagged primer 112, the first enzymes 120, the second enzyme(s) 122, or any other reagent described herein to a container. In some examples, the medical device 126 is configured to separate the sample 102 into the first volumes and/or the second volumes. The medical device 126 may be configured to introduce the first barcodes 118 to the first volumes and/or the second barcodes 123 to the second volumes. In some examples, the medical device 126 may be configured to sequence the cDNA 114 and/or the genomic DNA 111. The medical device 126 may be configured to analyze the first and second barcodes 118 and 123 to determine the sequences corresponding to a single cell, a cell population, or the like. In various examples, the medical device 126 may be configured to analyze the transcriptomic and spatial libraries and output a report to a user (e.g., a laboratory technician, a researcher, a trained user, a clinician, a nurse, or the like) or an external device. The medical device 126, in some examples, includes a fluidic device, a microfluidic device, a robotic device, a computer, a processor, the sequencer 124, or any other device that can execute the methods described herein.
In some implementations of the present disclosure, spatial barcodes that are associated with a physical location within the sample 102 are added to the sample 102. For example, the sample 102 may include a tissue slice, and each of the spatial barcodes may be applied to distinct physical regions of the tissue slice. Based on applying the spatial barcodes, the sample 102 may be undergo the processes described herein to generate the transcriptomic library and the spatial library. In various implementations, the physical distribution of the transcriptomic and spatial libraries may be determined using the spatial barcodes.
At 202, transcriptomic DNA is generated by applying a primer (e.g., the primer 112) to a sample (e.g., the sample 102). In various examples, the sample includes cells (e.g., the cells 106). The cells may be derived from a subject, or the cells may include synthetic cells. The sample may include permeabilized cells and crosslinked chromatin (e.g., the chromatin 110). In some implementations, the process 200 may include permeabilizing the cells and/or the nuclei (e.g., the nuclei 107) of the cells. In some implementations, the process 200 may include crosslinking the chromatin in the cells. The primer, in various examples, is configured to generate tagged cDNA (e.g., the cDNA 114) by reverse transcribing RNA in the cell. The cDNA may include a tag (e.g., the tag 116) that is configured to isolate the cDNA from the sample. The primer, in some instances, is configured to facilitate the ligation of first barcodes (e.g., the first barcodes 118) to the cDNA.
At 204, first fragments of genomic DNA (e.g., the genomic DNA 111) are generated by applying first enzymes (e.g., the first enzymes 120) to the sample. In various cases, the first enzymes are configured to fragment the crosslinked chromatin in the sample. In particular implementations, the first enzymes include two four-cut restriction enzymes (e.g., Msel, CviQI, or the like). In various examples, the first fragments include a TA at the 5′ end. The first fragments may be ligated by performing proximity ligation.
At 206, second fragments are generated by applying at least one second enzyme (e.g., the second enzyme(s) 122) to the sample. In some examples, the second enzyme(s) are configured to facilitate the ligation of first barcodes to the second fragments.
At 208, the first barcodes are applied to the sample. In some examples, the sample is separated into first volumes and each of the first barcodes are applied to the first volumes. A particular barcode of the first barcodes may be applied to each of the first volumes. In various cases, a particular barcode of the first barcodes may be applied to more than one of the first volumes. Based on applying the first barcodes, the first volumes may be pooled into the sample.
At 210, second barcodes (e.g., the second barcodes 123) are applied to the sample. In some examples, the sample is separated into second volumes and each of the second barcodes are applied to the second volumes. A particular barcode of the second barcodes may be applied to each of the second volumes. In various cases, a particular barcode of the second barcodes may be applied to more than one of the second volumes. Based on applying the second barcodes, the second volumes may be pooled into the sample.
At 212, a transcriptomic library and a spatial library of the sample are generated. For example, based on applying the second barcodes, the chromatin in the sample may be reverse crosslinked to generate genomic DNA sequences that include the first and second barcodes. The cDNA that includes the first and second barcodes, in some cases, is isolated from the sample by using a complementary tag. The complementary tag, in various examples, is configured to bind to the tag. For instance, a complementary tag may be linked to a magnetic bead and applied to the sample. A magnetic field may be applied to the sample to isolate the magnetic beads, thereby isolating the cDNA from the genomic DNA sequences. Based on isolating the cDNA, the transcriptomic library can be generated by sequencing the cDNA, and the spatial library can be generated by sequencing the genomic DNA sequences (e.g., by using the sequencer 124). In various implementations, the associated cDNA and genomic DNA sequences (e.g., from the same cell, cell population, physical region of the sample, etc.) can be determined using the first and second barcodes.
As illustrated, the system 300 can include a memory 302. In various implementations, the memory 302 is volatile (including a component such as Random Access Memory (RAM)), non-volatile (including a component such as Read Only Memory (ROM), flash memory, etc.) or some combination of the two. The memory 302 may include various data, such as at least one component 304. The component(s) 304 can include methods, threads, processes, applications, or any other sort of executable instructions. For instance, the component(s) 304 may include instructions for performing any of the functionality described above with reference to
The memory 302 may include various instructions (e.g., among the component(s) 304), which can be executed by at least one processor 306 to perform operations. In some implementations, the processor(s) 306 includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or both CPU and GPU, or other processing unit or component known in the art.
The system 300 can also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage can include removable storage 308 and non-removable storage 310. Tangible computer-readable media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. The memory 302, removable storage 308, and non-removable storage 310 are all examples of computer-readable storage media. Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Discs (DVDs), Content-Addressable Memory (CAM), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the system 300. Any such tangible computer-readable media can be part of the system 300.
The system 300 also can include input device(s) 312, such as a keypad, a cursor control, a touch-sensitive display, voice input device, etc., and output device(s) 314 such as a display, speakers, printers, etc. In various implementations, the input device(s) 312 can include the sequencer 124 and/or the medical device 126. These devices are well known in the art and need not be discussed at length here. In particular implementations, a user can provide input to the system 300 via a user interface associated with the input device(s) 312 and/or the output device(s) 314.
The system 300 can also include one or more wired or wireless transceiver(s) 316. For example, the transceiver(s) 316 can include a Network Interface Card (NIC), a network adapter, a Local Area Network (LAN) adapter, or a physical, virtual, or logical address to connect to the various base stations or networks contemplated herein, for example, or the various user devices and servers. To increase throughput when exchanging wireless data, the transceiver(s) 316 can utilize Multiple-Input/Multiple-Output (MIMO) technology. The transceiver(s) 316 can include any sort of wireless transceivers capable of engaging in wireless, Radio Frequency (RF) communication. The transceiver(s) 316 can also include other wireless modems, such as a modem for engaging in Wi-Fi, WiMAX, Bluetooth, or infrared communication.
In some implementations, the transceiver(s) 316 can be used to communicate between various functions, components, modules, or the like, that are included in the system 300. For instance, the transceiver(s) 316 can be used to transmit data between the system 300 and the sequencer 124, the medical device 126, an external user equipment (UE), analysis device, or the like.
Implementations of the present disclosure will now be described with reference to an Experimental Example.
This Experimental Example describes GAGE-seq (genome architecture and gene expression by sequencing), a highly scalable and cost-effective method for simultaneously profiling of chromatin interactions and gene expression in single cells. GAGE-seq, due to its combinatorial barcoding strategy, offers higher methodological throughput, as well as greater efficiency and effectiveness than recent technologies such as HiRES (Liu Z, et al. Science 2023; 380:1070-6). GAGE-seq was applied to profile 9,190 cells across diverse mammalian cell lines and tissues, including mouse brain and human bone marrow. Specifically, an experimental and analytical framework was developed to elucidate the connections between multiscale 3D genome features and cell type-specific gene expression, as well as their spatial and temporal interplay.
The present study complies with all pertinent ethical regulations. All the mice used in this study received humane care in compliance with the principles stated in the Guide for the Care and Use of Laboratory Animals, NIH Publication, 1996 edition, and the protocols were approved by the Institutional Animal Care Committee (IACUC) at the University of Washington (Seattle, WA).
Cell lines used.
K562 (#CCL-243, ATCC), GM12878 (#GM12878, Coriell) and NIH3T3 cells (CRL-1658, ATCC) were purchased from the respective vendors. The myelodysplastic cell line MDS-L was a gift from Dr. Kaoru Tohyama (Kawasaki University of Medical Welfare, Japan).
GAGE-seq experimental details.
Preparation of the 96-well plates of barcoded adaptors. Two separate barcoding rounds of ligation reactions are used in GAGE-seq. The design of the scRNA-seq part barcodes resembles that of Split-seq (Rosenberg A B, et al. Science 2018; 360:176-82) and SHARE-seq (Ma S, et al. Cell 2020; 183:1103-1116). The molecular structure of the scHi-C part barcodes is depicted in (
Cell lysis. Crosslinked cells of K562, NIH3T3, GM12878, MDS-L, human bone marrow Cd34+ cells were thawed from −80° C. or liquid nitrogen. 0.2 ml of high-salt lysis buffer 1 (50 mM HEPES pH 7.4, 1 mM EDTA pH 8.0, 1 mM EgTA pH 8.0, 140 mM NaCl, 0.25% Triton X-100, 0.5% IGEPAL CA-630, 10% glycerol, and 1× proteinase inhibitor cocktail (PIC)) was added per 1×106 cells. The cell solution was mixed thoroughly and incubated on ice for 10 min. After this, cells were pelleted at 500×gravity (×g) for 2 min at 4° C. and then resuspended in 0.2 ml of high-salt lysis buffer 2 (10 mM Tris-HCl PH 8, 1.5 mM EDTA, 1.5 mM EgTA, 200 mM NaCl, 1×PIC). The solution was incubated on ice for 10 min. Following this, cells were then pelleted at 500×g for 2 min at 4° C. and then resuspended in 200 ul of 1×T4 DNA ligase buffer (NEB, B0202S) containing 0.2% SDS. They are then incubated at 58° C. for 10 min. To quench the reaction, 200 μL ice-cooled 1×NWB and 10 μl 10% Triton X-100 (MilliporeSigma, 93443) were added to the tube. Finally, cells were spun at 500×g for 4 min at 4° C. For crosslinked mouse brain cortex cells, the treatment was simplified. The step involving high-salt lysis buffer 1 and high-salt lysis buffer 2 was omitted, and 0.1% SDS was used for cell lysis.
Reverse transcription. SDS treated cells were resuspended in 400 μL of RT mix (final concentration of 1×RT buffer, 500 mM dNTP, 10 mM Biotinylated RT primers, 7.5% PEG 6000 (VWR, 101443-484), 0.4 U/ml SUPERase⋅In™ RNase Inhibitor, and 25 U/ml Maxima H Minus Reverse Transcriptase (ThermoFisher Scientific of Waltham, MA, EP0752)). The RT primers contain a poly dT tail, a biotin molecule, and a universal ligation overhang. The sample then underwent a series of heating cycles. Initially, it was heated at 50° C. for 10 minutes, then it went through 3 thermal cycles (8° C. for 12 s, 15° C. for 45 s, 20° C. for 45 s, 30° C. for 30 s, 42° C. for 2 min and 50° C. for 3 min). Afterwards, the sample was again incubated at 50° C. for 10 minutes. After reverse transcription, 600 μL of 1×NWB was added, the sample was centrifuged at 500×g for 3 minutes, and the supernatant was then removed.
1st-round chromatin fragmentation, proximity ligation, and 2nd-round chromatin fragmentation. Cells were resuspended in 400 μL of restriction enzyme (RE) digestion mix (1×T4 ligase buffer (New England Biosciences (NEB) of Ipswich, MA, B0202S), 500 U Msel (NEB, R0525M), 240 U CviQI (NEB, R0639L), 0.32 U/mL Enzymatics RNase Inhibitor, 0.05 U/ml SUPERase RNase Inhibitor), and incubated at room temperature (25° C.) for 2 hr. Cells were then centrifuged at 500×g for 3 minutes at 4° C., and the supernatant was removed. The remaining cell pellet was washed twice with 300 μL of 1×NWB, and as much supernatant was removed as possible. Next, the pellet was resuspended in 200 μL of ligation mix (1×T4 ligation buffer (NEB, B0202S), 50 Units T4 DNA ligase (ThermoFisher Scientific, EL0012), 0.32 U/ml Enzymatics RNase Inhibitor, 0.05 U/ml SUPERase RNase Inhibitor) and incubated at 16° C. overnight. This was followed by adding 20 μL 10×T4 ligation buffer, 1 μL SUPERase RNase Inhibitor and 20 μL Ddel (NEB, R0175L). The sample was then incubated at 37° C. for 1 hr and centrifuged at 500×g for 3 minutes, with the supernatant removed afterwards.
Combinatorial cellular barcoding. Cells were resuspended in 330 μL of ligation mix (1×T4 ligase buffer (NEB, B0202S), 100 Units T4 DNA ligase (ThermoFisher Scientific, EL0012), 0.25 mg/ml BSA (ThermoFisher Scientific, AM2618), 5% PEG-4000 (ThermoFisher Scientific, EL0012), 0.32 U/ml Enzymatics RNase Inhibitor, 0.05 U/ml SUPERase RNase Inhibitor) and distributed into each well (3 μL/well) of the first-round barcoding plate, which already contained 2 μL of CARE-seq 1st-round adaptors in each well. This barcoding plate was then incubated at 25° C. for 3 hr. Afterwards, cells from all 96 wells were pooled into three 1.5 ml tubes, and 5 μl of 10% NP-40 (ABCam of Waltham, MA, ab142227) was added to each tube. This is followed by centrifuging at 500×g for 3 minutes at 4° C. The supernatant was then removed and cells were resuspended in 300 μL 1×NWB containing 0.033% SDS and combined into one 1.5 ml tube. Cells were then pelleted at 500×g for 2 minutes at 4° C. After three additional rounds of washing with 300 μL 1×NWB containing 0.033% SDS, cells were resuspended in 200 μl 1×NWB containing 0.1% SDS and filtered with with 10 μm or 20 μm cell ministrainer (PluriStrainer, pluriSelect of Leipzig, Germany, 43-10010-50 or 43-10020-40). Cells were inspected under a microscope and counted with a hemocytometer. 7,500 cells were diluted with 1.25 ml of a dilution buffer containing 0.4×NEBuffer 2 (NEB, B7002S), 2 mg/ml BSA (ThermoFisher Scientific, AM2618), and 0.08 UM RNA ligation-1 block, and distributed into each well (3 μL/well) of a 96-well plate (the 2nd-round barcoding plate). Then, 2 MI of cell lysis buffer (5×NEBuffer 2, 0.625% SDS) were then added to each well of the 2nd-round barcoding plate. The plate was incubated at 60° C. for at least 24 hr.
For the 2nd-round barcoding, 1.5 μL of pre-mixed GAGE-seq adapters (0.2 μM Hi-C-AD2 and 0.17 μM RNA-AD2) were added to the plate, followed by 23.5 μL of ligation mix (3 μL 1×T4 ligase buffer (NEB, B0202S), 0.15 μL 50 mg/ml BSA (ThermoFisher Scientific, AM2618), 1 μL 10% Triton X-100 (MilliporeSigma, 93443), 0.03 μL 20 μM 5′-P-TNA-Nextera-P5-AD, 0.03 μL 20 μM 5′-P-TA-Nextera-P5-AD, 0.03 μl 10 μM RNA ligation-1 block, and 0.8 μl T4 DNA ligase (ThermoFisher Scientific, EL0012)). The ligation was carried out at 25° C. for 24 hr, and then stopped by adding 2 μL of proteinase K digestion mix (0.2 μL proteinase K (ThermoFisher Scientific, AM2546), 0.5 μL 10% SDS and 1.8 μL water) to each well. A reverse crosslinking was carried out at 60° C. for 20 hr.
Reverse crosslinking and separation of scHi-C and scRNA-seq libraries. After reverse crosslinking, the sample in each 96-well plate was pooled into 12 DNA low-binding 1.5 ml tubes (Eppendorf of Hamburg, Germany, 022431021). Genomic DNA (gDNA) and cDNA were precipitated by adding 66 μL 3M Sodium Acetate Solution (pH 5.2) (MilliporeSigma of Burlington, MA, 127-09-3), 1 μL GlycoBlue (ThermoFisher Scientific, AM9515) and 720 μl iso-propanol (MilliporeSigma, 19516) to each tube, followed by incubating at −80° C. for at least 1 hr. The samples were then centrifuged at 15000 rotations per minute (rpm) for 10 min and the pellet in each tube were resuspended in 30 μL 1×NEBuffer 2 containing 0.15% SDS. After incubation at 37° C. for 10 min, the samples were combined into one DNA low-binding tube. gDNA and cDNA were precipitated by adding 66 μL 3M Sodium Acetate Solution (pH 5.2) and 720 μL iso-propanol, followed by incubating at −80° C. for at least 1 hr. The sample was then centrifuged at 15000 rpm for 10 min and the pellet was resuspended in 100 μl buffer EB (Qiagen of Hilden, Germany, 19086). For each sample of a 96-well plate, 5.5 μL of MyOne C1 Dynabeads were washed twice with 1×B&W-T buffer (5 mM Tris pH 8.0, 1M NaCl, 0.5 mM EDTA, and 0.05% Tween 20) and resuspended in uL of 2×B&W buffer (10 mM Tris pH 8.0, 2M NaCl, and 1 mM EDTA) and added to the sample tube. The mixture was incubated at room temperature for 60 min and put on a magnetic stand to separate supernatant and beads.
Library construction and sequencing. Both scHi-C and scRNA-seq libraries were pooled and paired end sequencing (PE 150) were performed on the HiSeq, NextSeq, or NovaSeq platform (Illumina).
Demultiplexing. DNA and RNA reads were assigned to wells based on the two rounds of barcodes. For DNA reads, only read 2 was used for demultiplexing, allowing at most 1 mismatch in each of the two rounds of barcodes. DNA reads with more than 5 mismatches in the region between the two rounds of barcodes (the 9th-23rd nucleotides (nt)) were discarded. After demultiplexing, the first 12 nt were removed from read 1 and the first 35 nt were removed from read 2. For RNA reads, only read 1 was used for demultiplexing, allowing at most 1 mismatch in each barcode round. RNA reads with more than 6 mismatches in the region between the two rounds of barcodes (the 19th-48th nt) or with more than 6 mismatches in the region downstream of the first round of barcode (the 57th-71th nt) were discarded.
The two reference genomes were combined into a single reference genome file used for all GAGE-seq libraries. For DNA reads, Burrows-Wheeler Aligner (BWA) (0.7.17) (Li H, Durbin R. Bioinformatics 2009; 25:1754-60) was used for alignment. The combined reference genome was indexed using command bwa index-a bwtsw. Paired, trimmed DNA reads were aligned to the combined reference genome using command bwa mem-SP5M. For RNA reads, Spliced Transcripts Alignment to a Reference (STAR) (2.7.8a) (Dobin A, et al. Bioinformatics 2013; 29:15-21) was used for alignment. The GENCODE annotation files for human (v36) and mouse (vM25) were downloaded and concatenated. The combined reference genome was indexed using command--runMode genomeGenerate--sjdbOverhang 100 with the combined gencode annotation file. Only read 2 of RNA reads was aligned with the command STAR--outSAMunmapped Within.
Identification of contact pairs from DNA reads. Pairtools (0.3.1.dev1) (Goloborodko A, et al. mirnylab/pairtools: v0.2.0. 2018) was used to identify contact pairs from paired DNA reads with command pairtools parse--walks-policy all--no-flip--min-mapq=10. After that, walk reads (i.e., DNA reads containing multiple ligation sites) were further processed. In this Example, it was assumed that any pair of loci in the same DNA read forms a valid contact pair, and these contact pairs were included in the results.
Deduplication of contact pairs. The contact pairs were deduplicated. The genomic positions of the two ends of each contact pair was extracted. Two contact pairs are defined as directly duplicated if the two contact pairs' first ends lie within 500 nt apart and their second ends also within 500 nt. If two contact pairs are not directly duplicated, but are directly or indirectly duplicated with a third contact pair, the first two contact pairs are defined as indirectly duplicated. Among each cluster (i.e., connected component) of (in) directly duplicated contact pairs, the one with the largest difference between its two ends' genomic positions was retained, and the rest were marked as duplicates.
Deduplication of RNA reads. The RNA reads were deduplicated. Two RNA reads are defined as directly duplicated if there is at most 1 mismatch in their UMI and if their genomic positions differ by at most 5 nt. The rest of the process is similar to the deduplication of contact pairs. Only one RNA read from each duplicate cluster is retained.
Integration with Multiplexed Error-Robust Fluorescence in situ Hybridization (MERFISH) data. Integration of GAGE-seq data and MERFISH data was done with Seurat (4.1.1) (Chidester B, et al. Nat Genet 2023; 55:78-88). Only scRNA-seq profiles from the GAGE-seq data were used for this integration. In the GAGE-seq mouse brain cortex data, the following cell types of excitatory neurons were used: L2/3 IT CTX a, L2/3 IT CTX b, L2/3 IT CTX c, L4 IT CTX, L4/5 IT CTX, L5 IT CTX, L6 IT CTX, L6 CT CTX a, L6 CT CTX b, L5/6 NP CTX, and L6b CTX. In the MERFISH data, cells from L2/3 IT, L4/5 IT, L5 IT, L5/6 NP, L6 CT, L6 IT, and L6b were used. Each time, the selected cells from GAGE-seq were integrated with one slice from the MERFISH data. All genes detected and expressed in both GAGE-seq and MERFISH were used. The ‘FindIntegrationAnchors’ and ‘IntegrateData’ functions were used with default parameters, except that the number of dimensions was set to 10.
Inference of whole-transcriptome expression and 3D genome features for MERFISH cells. The integrated single-cell expression profiles of GAGE-seq data and MERFISH data were scaled by the ‘ScaleData’ function from Seurat with default parameters, and the first 30 PCs were calculated by the ‘RunPCA’ function. A 50-nearest neighbor regressor was created to estimate whole-transcriptome expression and 3D genome features from the 30-dimensional PC space. The regressor was trained on GAGE-seq data and then applied to the MERFISH data. The Gaussian kernel was used as the weight function. For each MERFISH cell, the bandwidth was defined as the 0.3 quantile of the distances to the 50 nearest neighbors.
Integration with Paired-seq data. The integration of GAGE-seq data with Paired-seq data52 was done using Seurat. Only scRNA-seq profiles from the GAGE-seq data and the Paired-seq data were used for this integration. In the GAGE-seq mouse brain cortex data, three cell types were excluded: L2 IT RvPP, L2/3 IT RSP, and L5 IT RSP. In the Paired-seq data, cells from BR_NonNeu_Endothelial, HC_ExNeu_CA1, HC_ExNeu_CA23, HC_ExNeu_DG, HC_ExNeu_Subiculum, and HC_NonNeu_Ependymal were excluded. The ‘SelectIntegrationFeatures’, ‘FindIntegrationAnchors’ and ‘IntegrateData’ functions were used with default parameters.
Inference of accessibility for GAGE-seq cells. The integrated single-cell expression profiles of GAGE-seq data and Paired-seq data were scaled by the ‘ScaleData’ function from Seurat with default parameters. The first 20 PCs were calculated by the ‘RunPCA’ function. To estimate whole-transcriptome expression and 3D genome features from the 40-dimensional PC space, a 50-nearest neighbor regressor was created, which was trained on Paired-seq data and then applied to the GAGE-seq data. The Gaussian kernel was used as the weight function. For each GAGE-seq cell, the bandwidth was set based on the 0.3 quantile of the distances to the 40 nearest neighbors.
Trajectory and pseudotime. The pseudotime of human bone marrow cells was inferred by the ‘sc.tl.diffmap’ and ‘sc.tl.dpt’ function in Scanpy (1.9.3) (Wolf F A, et al. Genome Biol 2018; 19:15), jointly from the paired scRNA-seq profiles and scHi-C profiles. Specifically, cells in the HSC, MPP, MLP, and B-NK clusters were included. The first 5 PCs of the scRNA-seq profiles were used for the scRNA-based pseudotime and the first 2 PCs of the Fast-Higashi embeddings of the scHi-C profiles were used for the scHi-C-based pseudotime. The 5 scRNA-seq PCs and the 2 scHi-C PCs were then concatenated and used for the joint pseudotime. The ‘sc.pp.neighbors’ function was used to construct the neighbor graph with 30 (scRNA-based and joint pseudotime) or 20 (scHi-C-based pseudotime) nearest neighbors per cell. The ‘sc.tl.diffmap’ and ‘sc.tl.dpt’ function was applied with 10 diffusion components to learn a latent representation focusing on the trajectory and to infer the pseudotime for single cells. The origin of the trajectory was set based on the average expression level of HSC marker genes previously identified (Zhang Y, et al. Dev Cell 2022; 57:2745-60).
Unsupervised clustering of genes. The clustering of genes was based on the expression and scA/B value. Genes expressed in at least 20 cells were included. To generate features for genes, 1) the expression levels and scA/B values were z-score normalized per gene among all cells. 2) cells were evenly divided into 10 bins based on the pseudotime, and 3) the average values of the expression and scA/B value in each bin were calculated for each gene. This process led to 20 features for each gene. The Louvain clustering algorithm was then applied to genes with 20 neighbors, a resolution of 1.5. The correlation was used as the distance metric.
Boxplots in all figures show the median, first, and third quartiles, and whiskers extend no further than 1.5× interquartile range. The robustness and reproducibility of GAGE-seq were validated extensively by using multiple cell lines and primary tissue cells (both mouse and human). Blinding was not relevant to the study, thus data collection and analysis were not performed blind to the conditions of the experiments. No statistical method was used to predetermine sample size. The experiments were not randomized.
All sequencing data from this study have been submitted to GEO under the accession #GSE238001. The following publicly available datasets were used in this Example: in situ Hi-C datasets from Rao et al. (Cell 2014; 159:1665-80) (GSE: GSE63525); scHi-C datasets from Nagano et al. (Nature 2013; 502:59-64) (GEO: GSE48262), Nagano et al. (Nature 2017; 547:61-7) (GEO: GSE94489), Ramani et al. (Nat Methods 2017; 14:263-6) (GEO: GSE84920), Kim et al. (PLOS Comput Biol 2020; 16) (4DN Data Portal: 4DNES4D5MWEZ, 4DNESUE2NSGS, 4DNESIKGI39T, 4DNES1BK1RMQ, and 4DNESTVIP977), Tan et al. (Science 2018; 361:924-8) (GEO: GSE117876), Tan et al. (Nat Struct Mol Biol 2019; 26:297-307) (GEO: GSE121791), Tan et al. (Cell 2021; 184:741-758) (GEO: GSE162511), Flyamer et al. (Nature 2017; 544:110-4) (GEO: GSE80006), Gassler et al. (EMBO J 2017; 36:3600-18) (GEO: GSE100569), Stevens et al. (Nature 2017; 544:59-64) (GEO: GSE80280), Collombet et al. (Nature 2020; 580:142-6) (GEO: GSE129029), Lee et al. (Nat Methods 2019; 16:999-1006) (GEO: GSE124391), Liu et al. (Nature 2021; 598:120-8) (GEO: GSE132489), and Mulqueen et al. (Nat Biotechnol 2021; 39:1574-80) (GEO: GSE174226); scRNA-seq datasets from Chen et al. (Nat Biotechnol 2019; 37:1452-7) (GEO: GSE126074), Plongthongkum et al. (Nat Protoc 2021; 16:4992-5029) (GEO: GSE157660), Chen et al. (Nat Methods 2022; 19:547-53) (GEO: GSE178707), Ma et al. (Cell 2020; 183:1103-1116) (GEO: GSE140203), Xu et al. (Nat Methods 2022; 19:1243-9) (ArrayExpress: E-MTAB-11264), Xiong et al. (Nat Methods 2021; 18:652-60) (GEO: GSE158435), Zhu et al. (Nat Struct Mol Biol 2019; 26:1063-70) (GEO: GSE130399), Zhu et al. (Nat Methods 2021; 18:283-92) (GEO: GSE152020), Cao et al. (Science 2018; 361:1380-5) (GEO: GSE117089), Mimitou et al. (Nat Methods 2019; 16:409-12) (GEO: GSE126310), and Zhang et al. (Dev Cell 2022; 57:2745-2760) (GEO: GSE137864); HiRES co-assayed scHi-C and scRNA-seq datasets from Liu et al. (Science 2023; 380:1070-6) (GEO: GSE223917); MERFISH spatial transcriptome datasets from Zhang et al. (Nature 2021; 598:137-43) (Brain Image Library: cf1c1a431ef8d021); Paired-seq co-assayed scRNA-seq and scATAC-seq from Zhu et al. (Nat Methods 2021; 18:283-92) (GEO: GSE152020).
The source code of the GAGE-seq data processing and analysis workflows can be accessed at: https://github.com/ma-compbio/GAGE-seq, which has also been deposited via Zenedo (https://doi.org/10.5281/zenodo.10888453) (Zhou T. GAGE-seq analysis workflow. 2024). In a GitHub repository, notebooks have been provided (https://github.com/ma-compbio/GAGE-seq/tree/main/scripts_analysis) that detail the integration between GAGE-seq and Paired-seq data for single-cell joint analysis of 3D genome structure, chromatin accessibility, and gene expression.
GAGE-seq is a high-throughput, effective, and robust single-cell multiomics technology that simultaneously profiles the 3D genome and transcriptome in individual cells (
To assess the quality and specificity of GAGE-seq data, experiments were performed using a mixture of human (K562) and mouse (NIH3T3) cell lines (
Validating GAGE-seq in additional cell lines, GM12878 and MDS-L, further confirmed its robustness, specificity, sensitivity, and reproducibility (
To demonstrate the utility of GAGE-seq in unveiling complex cell types based on single-cell 3D genome features and gene expression within a tissue context, the focus was turned to the adult mouse brain cortex, known for its cell type diversity. Applying GAGE-seq on cells from the mouse cortex (8-9 weeks old), 3,296 high-quality joint single-cell profiles of chromatin interactions and transcriptomes were generated. On average, each cell displayed 231,136 chromatin contacts (at 50% duplication rate), with 20,160 UMIs and 1,883 genes per cell (59% duplication rate), in line with the adult mouse whole brain data from the recently published HiRES data (
The disclosed GAGE-seq scRNA-seq data identified 28 known cell types across three major lineages in the mouse cortex, including 15 excitatory neuron subtypes, 8 inhibitory neuron subtypes, and 5 glial cell subtypes, such as astrocytes and oligodendrocytes (
Using GAGE-seq to map the 3D genome and transcriptome of single cells, the in situ variation of the 3D genome in the adult mouse cortex was explored. GAGE-seq scRNA-seq was leveraged as a “bridge” for this analysis. Recently, the spatial transcriptomics method MERFISH successfully discerned the spatial organization of distinct cell populations in the mouse primary motor cortex (Zhang M, supra). This was started by integrating the disclosed GAGE-seq scRNA-seq data with the MERFISH data using Seurat (Chidester B, et al. Nat Genet 2023; 55:78-88), enabling the establishment of a connection between the two datasets.
The excitatory neuron cell types present in both GAGE-seq and MERFISH datasets were focused on. Within the integrated embedding space, cells primarily clustered by cell type, and cells from both datasets integrated cohesively, indicating high correlation between cell types identified by the two methods (
Next the relationship between gene expression and various multiscale 3D genome features was rigorously examined in single cells, including A/B compartments, TAD-like domains, and chromatin loops.
The analysis of the 3,461 genes expressed in inhibitory neurons (n=508) or excitatory neurons (n=1,938) revealed a strong correlation between cell type-specific gene expression and scA/B value, reflecting compartmentalization variations (Tan L, et al. Cell, supra; Zhang R, supra) (
Subsequently the relationship between single-cell insulation score surrounding the gene body and the potential occurrence of domain melting was investigated within the diverse collection of cell types revealed by GAGE-seq. The four genes (Grik2, Dscam, Rbfox1, and Nrxn) known to undergo domain melting were focused on (Winick-Ng W, supra), profiling their scA/B value, single-cell insulation score, and single-cell gene expression. Notably, these genes manifested high expression across almost all 28 cell subtypes revealed by GAGE-seq, with the exception of Dscam and Grik2 in VLMC and Micro cells (
Next the above observed connection between multiscale 3D genome features and gene expression was further confirmed at single cell resolution. Higher gene expression in a cell often corresponded to a higher scA/B value and lower single-cell insulation score in the same cell (
The observations were then confirmed on single loci. As a proof of principle, the Pvalb inhibitory subtype was focused on (including both Pvalb a and Pvalb b). First genes were selected that have 1) significantly higher scA/B values and expression in inhibitory neurons compared to excitatory neurons (
It was next aimed to demonstrate how integrating GAGE-seq with chromatin accessibility data enhances the connection between CREs and target genes. For this, GAGE-seq was integrated with Paired-seq data (from the same mouse cortex region) (Zhu C, supra). Overall, genes with distinct contributions from 3D genome and chromatin accessibility show varied functions (
The integrative analysis of GAGE-seq and chromatin accessibility enhances the connection of CREs to their target genes. The gene expression and transcription start site (TSS)-CRE interaction frequency correlation decreases with greater genomic distance between TSS and CRE (
The joint regulation of gene expression by 3D genome and chromatin accessibility at individual gene loci was explored. A strong correlation was found between Epha4 gene expression and the chromatin interaction frequency with a distal CRE, as well as between Epha4 gene expression and chromatin accessibility at the TSS and the distal CRE in different excitatory neuron subtypes (
Hematopoiesis is a classic model system with well-characterized cell type changes and their associated gene expression signatures, making it an ideal model for exploring the dynamic relationship between 3D genome structure and gene expression. GAGE-seq profiles of 2,815 human bone marrow (BM) CD34+ cells were generated after stringent quality filtering, obtaining an average of 265,336 chromatin contacts (at 50% duplication rate) and detecting on average 5,504 μMIs and 985 genes per cell (at 63% duplication rate), which is in line with the publicly available scRNA-seq datasets. To mitigate the potential impact of 3D genome's cell-cycle dynamics (Nagano T, supra), the analysis was restricted to high-quality GO/G1 phase cells (837 cells).
Unsupervised clustering of GAGE-seq scRNA-seq data revealed six clusters (five clusters with continuous diffusion and one distinct cluster), each displaying unique gene signatures (
Focusing on four of the six identified cell types (HSC, MPP, MLP and B-NK), which represent early B-NK lineage, GAGE-seq was used to reconstruct the developmental trajectory, demonstrating the dynamic interplay between genome structure and gene expression along this trajectory. Transcriptome and 3D genome-based pseudotime trajectories, inferred from GAGE-seq data, were highly congruent (
Comparisons between marker gene expression and 3D genome features in individual cell types during differentiation pseudotime suggest complex temporal interplay between both scA/B values and single-cell insulation scores with marker gene expressions.
An unsupervised clustering was then performed to further unravel relationships between gene expression and 3D genome features in the B-NK differentiation, based on all genes expressed in at least twenty single cells in the trajectory. 11 distinct gene clusters were identified (
Regarding chromatin domains, a uniform temporal trend was observable in the aggregated single-cell insulation scores across all gene clusters, mirroring the pattern seen in the marker gene sets (
The described high-throughput multiomic single-cell technology, GAGE-seq, delivers an integrative approach to co-assay 3D genome structure and gene expression in individual cells with high resolution. In this Example, it is demonstrated that GAGE-seq can reveal complex cell types from complex tissues not identified by other existing methods. Additionally, its data integration with spatial transcriptomic data points to great potential to reach a deeper understanding of 3D genome variation within complex tissues. Importantly, GAGE-seq also facilitates the reconstruction of differentiation trajectories based on 3D genome features, transcriptomes, or both. The disclosed integration of GAGE-seq with single-cell chromatin accessibility data further highlights the advantage of GAGE-seq in linking CREs and their target genes. The high congruence between these modalities underscores the intimate connection between the temporal variations of the 3D genome and transcriptional rewiring during cell differentiation. GAGE-seq has revealed much more nuanced relationships between 3D genome features and gene expression during bone marrow B-NK lineage differentiation, creating a resource for future studies to disentangle causal gene regulatory changes in differentiation through the lens of 3D genome in single cells.
GAGE-seq is characterized by its efficiency, scalability, robustness, cost-effectiveness, and adaptability. GAGE-seq, along with the described analytical tools, could significantly enhance the current toolkit for single-cell epigenomics. With wide-ranging applications, GAGE-seq can deepen the understanding of genome structure and function, providing insights into normal development and disease pathogenesis. GAGE-seq can be integrated with spatial labeling technologies, producing spatially-resolved scHi-C and scRNA-seq data. GAGE-seq offers the opportunity to integrate different molecular features in single cells, leading to a more comprehensive understanding of genome structure, cellular function, and their spatiotemporal variability.
The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be used for realizing implementations of the disclosure in diverse forms thereof.
As will be understood by one of ordinary skill in the art, each implementation disclosed herein can comprise, consist essentially of or consist of its particular stated element, step, or component. Thus, the terms “include” or “including” should be interpreted to recite: “comprise, consist of, or consist essentially of.” The transition term “comprise” or “comprises” means has, but is not limited to, and allows for the inclusion of unspecified elements, steps, ingredients, or components, even in major amounts. The transitional phrase “consisting of” excludes any element, step, ingredient or component not specified. The transition phrase “consisting essentially of” limits the scope of the implementation to the specified elements, steps, ingredients or components and to those that do not materially affect the implementation. As used herein, the term “based on” is equivalent to “based at least partly on,” unless otherwise specified.
Unless otherwise indicated, all numbers expressing quantities, properties, conditions, and so forth used in the specification and claims are to be understood as being modified in all instances by the term “about.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the specification and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by the present disclosure. At the very least, and not as an attempt to limit the application of the doctrine of equivalents to the scope of the claims, each numerical parameter should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. When further clarity is required, the term “about” has the meaning reasonably ascribed to it by a person skilled in the art when used in conjunction with a stated numerical value or range, i.e. denoting somewhat more or somewhat less than the stated value or range, to within a range of ±20% of the stated value; ±19% of the stated value; ±18% of the stated value; ±17% of the stated value; ±16% of the stated value; ±15% of the stated value; ±14% of the stated value; ±13% of the stated value; ±12% of the stated value; ±11% of the stated value; ±10% of the stated value; ±9% of the stated value; ±8% of the stated value; ±7% of the stated value; ±6% of the stated value; ±5% of the stated value; ±4% of the stated value; ±3% of the stated value; ±2% of the stated value; or ±1% of the stated value.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the disclosure are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
The terms “a,” “an,” “the” and similar referents used in the context of describing implementations (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein is intended merely to better illuminate implementations of the disclosure and does not pose a limitation on the scope of the disclosure. No language in the specification should be construed as indicating any non-claimed element essential to the practice of implementations of the disclosure.
Groupings of alternative elements or implementations disclosed herein are not to be construed as limitations. Each group member may be referred to and claimed individually or in any combination with other members of the group or other elements found herein. It is anticipated that one or more members of a group may be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims.
Variants of the sequences disclosed and referenced herein are also included. Guidance in determining which amino acid residues can be substituted, inserted, or deleted without abolishing biological activity can be found using computer programs well known in the art, such as DNASTAR™ (Madison, Wisconsin) software. Preferably, amino acid changes in the protein variants disclosed herein are conservative amino acid changes, i.e., substitutions of similarly charged or uncharged amino acids. A conservative amino acid change involves substitution of one of a family of amino acids which are related in their side chains.
Variants of the protein, nucleic acid, and gene sequences disclosed herein also include sequences with at least 70% sequence identity, 80% sequence identity, 85% sequence, 90% sequence identity, 95% sequence identity, 96% sequence identity, 97% sequence identity, 98% sequence identity, or 99% sequence identity to the protein, nucleic acid, or gene sequences disclosed herein.
“% sequence identity” refers to a relationship between two or more sequences, as determined by comparing the sequences. In the art, “identity” also means the degree of sequence relatedness between protein, nucleic acid, or gene sequences as determined by the match between strings of such sequences. “Identity” (often referred to as “similarity”) can be readily calculated by known methods, including those described in: Computational Molecular Biology (Lesk, A. M., ed.) Oxford University Press, NY (1988); Biocomputing: Informatics and Genome Projects (Smith, D. W., ed.) Academic Press, NY (1994); Computer Analysis of Sequence Data, Part I (Griffin, A. M., and Griffin, H. G., eds.) Humana Press, NJ (1994); Sequence Analysis in Molecular Biology (Von Heijne, G., ed.) Academic Press (1987); and Sequence Analysis Primer (Gribskov, M. and Devereux, J., eds.) Oxford University Press, NY (1992). Preferred methods to determine identity are designed to give the best match between the sequences tested. Methods to determine identity and similarity are codified in publicly available computer programs. Sequence alignments and percent identity calculations may be performed using the Megalign program of the LASERGENE bioinformatics computing suite (DNASTAR, Inc., Madison, Wisconsin). Multiple alignment of the sequences can also be performed using the Clustal method of alignment (Higgins and Sharp CABIOS, 5, 151-153 (1989) with default parameters (GAP PENALTY=10, GAP LENGTH PENALTY=10). Relevant programs also include the GCG suite of programs (Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison, Wisconsin); BLASTP, BLASTN, BLASTX (Altschul, et al., J. Mol. Biol. 215:403-410 (1990); DNASTAR (DNASTAR, Inc., Madison, Wisconsin); and the FASTA program incorporating the Smith-Waterman algorithm (Pearson, Comput. Methods Genome Res., [Proc. Int. Symp.] (1994), Meeting Date 1992, 111-20. Editor(s): Suhai, Sandor. Publisher: Plenum, New York, N.Y., Within the context of this disclosure it will be understood that where sequence analysis software is used for analysis, the results of the analysis are based on the “default values” of the program referenced. As used herein “default values” will mean any set of values or parameters, which originally load with the software when first initialized.
Variants also include nucleic acid molecules that hybridizes under stringent hybridization conditions to a sequence disclosed herein and provide the same function as the reference sequence. Exemplary stringent hybridization conditions include an overnight incubation at 42° C. in a solution including 50% formamide, 5×SSC (750 mM NaCl, 75 mM trisodium citrate), 50 mM sodium phosphate (pH 7.6), 5×Denhardt's solution, 10% dextran sulfate, and 20 μg/ml denatured, sheared salmon sperm DNA, followed by washing the filters in 0.1×SSC at 50° C. Changes in the stringency of hybridization and signal detection are primarily accomplished through the manipulation of formamide concentration (lower percentages of formamide result in lowered stringency); salt conditions, or temperature. For example, moderately high stringency conditions include an overnight incubation at 37° C. in a solution including 6×SSPE (20×SSPE=3M NaCl; 0.2M NaH2PO4; 0.02M EDTA, pH 7.4), 0.5% SDS, 30% formamide, 100 μg/ml salmon sperm blocking DNA; followed by washes at 50° C. with 1×SSPE, 0.1% SDS. In addition, to achieve even lower stringency, washes performed following stringent hybridization can be done at higher salt concentrations (e.g. 5×SSC). Variations in the above conditions may be accomplished through the inclusion and/or substitution of alternate blocking reagents used to suppress background in hybridization experiments. Typical blocking reagents include Denhardt's reagent, BLOTTO, heparin, denatured salmon sperm DNA, and commercially available proprietary formulations. The inclusion of specific blocking reagents may require modification of the hybridization conditions described above, due to problems with compatibility.
Unless otherwise indicated, the practice of the present disclosure can employ conventional techniques of immunology, molecular biology, microbiology, cell biology and recombinant DNA. These methods are described in the following publications. See, e.g., Sambrook, et al. Molecular Cloning: A Laboratory Manual, 2nd Edition (1989); F. M. Ausubel, et al. eds., Current Protocols in Molecular Biology, (1987); the series Methods IN Enzymology (Academic Press, Inc.); M. MacPherson, et al., PCR: A Practical Approach, IRL Press at Oxford University Press (1991); MacPherson et al., eds. PCR 2: Practical Approach, (1995); Harlow and Lane, eds. Antibodies, A Laboratory Manual, (1988); and R. I. Freshney, ed. Animal Cell Culture (1987).
This disclosure refers to several references, including articles, books, references, conferences, and other publications. Each one of these references is incorporated by reference herein in its entirety.
Certain implementations are described herein, including the best mode known to the inventors for carrying out implementations of the disclosure. Of course, variations on these described implementations will become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventor expects skilled artisans to employ such variations as appropriate, and the inventors intend for implementations to be practiced otherwise than specifically described herein. Accordingly, the scope of this disclosure includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by implementations of the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
This application claims the priority of U.S. Provisional Patent Application No. 63/471,495, filed on Jun. 6, 2023, and which is incorporated by reference herein in its entirety.
This invention was made with government support under Grant No. 1R01HG012303, awarded by the National Human Genome Research Institute and Grant No. 1R61DA047010, awarded by the National Institute of Drug Abuse. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63471495 | Jun 2023 | US |