The present invention generally relates to methods of selectively altering gene expression within, for example, insulated neighborhoods formed by the looping of two CTCF interaction sites occupied by cohesion.
The specification includes lengthy Tables: Table S1E and Table S2A. Lengthy Table S1E has been submitted via EFS-Web in electronic format as follows: File name: S1ETBL.txt, Date created: Sep. 30, 2016; File size: 2,482,827 Bytes and is incorporated herein by reference in its entirety. Lengthy Table S2 has been submitted via EFS-Web in electronic format as follows: File name: S2ATBL.txt, Date created: Sep. 30, 2016; File size: 360,209 Bytes and is incorporated herein by reference in its entirety.
Please refer to the end of the specification for access instructions.
Embryonic stem cells depend on active transcription of genes that play prominent roles in pluripotency (ES cell identity genes) and on repression of genes encoding lineage-specifying developmental regulators (Ng and Surani, 2011; Orkin and Hochedlinger, 2011; Young, 2011). The master transcription factors (TFs) OCT4, SOX2, and NANOG (OSN) form super-enhancers at most cell identity genes, including those encoding the master TFs themselves; these super-enhancers contain exceptional levels of transcription apparatus and drive high-level expression of associated genes (Hnisz et al., 2013; Whyte et al., 2013).
Maintenance of the pluripotent ESC state also requires that genes encoding lineage-specifying developmental regulators remain repressed, as expression of these genes can stimulate differentiation and thus loss of ESC identity. These repressed lineage specifying genes are occupied by polycomb group proteins in ESCs (Boyer et al., 2006; Lee et al., 2006; Margueron and Reinberg, 2011; Squazzo et al., 2006). The ability to express or repress these key genes in a precise and sustainable fashion is thus essential to maintaining ESC identity.
Recent pioneering studies of mammalian chromosome structure have suggested that they are organized into a hierarchy of units, which include topologically associating domains (TADs) and gene loops (
TADs, also known as topological domains, are defined by DNA-DNA interaction frequencies, and their boundaries are regions across which relatively few DNA-DNA interactions occur (Dixon et al., 2012; Nora et al., 2012). TADs average 0.8 Mb, contain approximately seven protein-coding genes, and have boundaries that are shared by the different cell types of an organism (Dixon et al., 2012; Smallwood and Ren, 2013). The expression of genes within a TAD is somewhat correlated, and thus some TADs tend to have active genes and others tend to have repressed genes (Cavalli and Misteli, 2013; Gibcus and Dekker, 2013; Nora et al., 2012).
Gene loops and other structures within TADs are thought to reflect the activities of transcription factors (TFs), cohesin, and CTCF (Baranello et al., 2014; Gorkin et al., 2014; Phillips-Cremins et al., 2013; Seitan et al., 2013; Zuin et al., 2014). The structures within TADs include cohesin-associated enhancer-promoter loops that are produced when enhancer-bound TFs bind cofactors such as Mediator that, in turn, bind RNA polymerase II at promoter sites (Lee and Young, 2013; Lelli et al., 2012; Roeder, 2005; Spitz and Furlong, 2012). The cohesin-loading factor NIPBL binds Mediator and loads cohesin at these enhancer-promoter loops (Kagey et al., 2010). Cohesin also becomes associated with CTCF-bound regions of the genome, and some of these cohesin-associated CTCF sites facilitate gene activation while others may function as insulators (Dixon et al., 2012; Parelho et al., 2008; Phillips-Cremins and Corces, 2013; Seitan et al., 2013; Wendt et al., 2008).
The chromosome structures anchored by Mediator and cohesin are thought to be mostly cell-type-specific, whereas those anchored by CTCF and cohesin tend to be larger and shared by most cell types (Phillips-Cremins et al., 2013; Seitan et al., 2013). Despite this picture of cohesin-associated enhancer-promoter loops and cohesin-associated CTCF loops, we do not yet understand the relationship between the transcriptional control of cell identity and the sub-TAD structures of chromosomes that may contribute to this control. Furthermore, there is limited evidence that the integrity of sub-TAD structures is important for normal expression of genes located in the vicinity of these structures.
To gain insights into the cohesin-associated chromosome structures that may contribute to the control of pluripotency in ESCs, we generated a large cohesin ChIA-PET data set and integrated this with other genome-wide data to identify local structures across the genome.
The results show that super enhancer-driven cell identity genes and repressed genes encoding lineage-specifying developmental regulators occur within insulated neighborhoods formed by the looping of two CTCF interaction sites occupied by cohesin.
Perturbation of these structures demonstrates that their integrity is important for normal expression of genes located in the vicinity of the neighborhoods.
The present disclosure provides compositions and methods for regulating gene expression in a directed fashion.
In one embodiment is provided a method of altering the expression of a gene in an insulated neighborhood (IN) of the genome of a cell comprising contacting an organism comprising said cell with a gene modulatory molecule. Such molecules include, but are not limited to, small molecules, lipid, proteins, peptides, nucleic acids, such as RNA, DNA or any modified version thereof, and combinations thereof.
In one embodiment, expression of the gene is increased.
In one embodiment, the cell is selected from the group consisting of stem cells, bone marrow cells, testis cells, olfactory cells, lung cells, thymus cells, cells of the central nervous system, cells of the brain, spleen cells, MEF cells, MEL cells, heart cells, somatic cells of the limbs, liver cells, and kidney cells.
In one embodiment, the cells are stem cells and said stem cells are embryonic stem cells.
In one embodiment, the insulated neighborhood comprises a topologically active domain (TAD).
In one embodiment, the topologically active domain is a super-enhancer domain (SD) and such SDs may be selected from any known SD or any disclosed herein such as those in Table S4A and S4B.
In one embodiment, the gene is selected from the group consisting of those in Table S4C.
In one embodiment is provided a method of altering the expression of a gene located in an insulated neighborhood (IN) of the genome of a cell comprising altering the sequence of one or more of the CTCF boundaries of said insulated neighborhood.
In one embodiment, the CTCF boundary is altered via CRISPR technology.
Such alteration may involve either or both of the boundaries of the insulated neighborhood.
Additional embodiments of the present compositions and methods, and the like, will be apparent from the following description, drawings, examples, and claims. As can be appreciated from the foregoing and following description, each and every feature described herein, and each and every combination of two or more of such features, is included within the scope of the present disclosure provided that the features included in such a combination are not mutually inconsistent. In addition, any feature or combination of features may be specifically excluded from any embodiment of the present invention. Additional aspects and advantages of the present invention are set forth in the following description and claims, particularly when considered in conjunction with the accompanying examples and drawings.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Provided herein are compositions and methods for the controlled or selected regulation of gene expression such as those genes found in insulated neighborhoods within the genome.
As used herein, an “insulated neighborhood” is a region of a chromosome bounded by one or more markers.
Modulation of gene expression in an insulated neighborhood can be effected by administration of a gene modulatory compound.
In one embodiment, administration of a gene modulatory compound increases the level of gene expression by 5%, 10%, 15%, 20%, 25%, 30%, 33%, 35%, 40%, 45%, 50%, 52% 55%, 60%, 65%, 67%, 69%, 70%, 74%, 75%, 76%, 77%, 80%, 85%, 90%, 95% or more than 95%.
In one embodiment, gene expression may be increased by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 21, 22, 23, 24, 15, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 1-5, 1-10, 1-20, 1-30, 1-40, 1-50, 2-5, 2-10, 2-20, 2-30, 2-40, 2-50, 3-5, 3-10, 3-20, 3-30, 3-40, 3-50, 4-6, 4-10, 4-20, 4-30, 4-40, 4-50, 5-7, 5-10, 5-20, 5-30, 5-40, 5-50, 6-8, 6-10, 6-20, 6-30, 6-40, 6-50, 7-10, 7-20, 7-30, 7-40, 7-50, 8-10, 8-20, 8-30, 8-40, 8-50, 9-10, 9-20, 9-30, 9-40, 9-50, 10-20, 10-30, 10-40, 10-50, 20-30, 20-40, 20-50, 30-40, 30-50 or 40-50 times the wild type level or such level as is presented by a subject having a disease or disorder associated with the aberrant expression of that gene.
Understanding how the ESC pluripotency gene expression program is regulated is of considerable interest because it provides the foundation for understanding gene control in all cells. There is much evidence that cohesin and CTCF have roles in connecting gene regulation and chromosome structure in ESCs (Cavalli and Misteli, 2013; Dixon et al., 2012; Gibcus and Dekker, 2013; Gorkin et al., 2014; Merkenschlager and Odom, 2013; Phillips-Cremins and Corces, 2013; Phillips-Cremins et al., 2013; Sanyal et al., 2012; Sofueva et al., 2013) but limited knowledge of these structures across the genome and scant functional evidence that specific structures actually contribute to the control of important ESC genes.
We describe here organizing principles that explain how a key set of cohesin-associated chromosome structures contributes to the ESC gene expression program. To gain insights into the relationship between transcriptional control of cell identity and control of chromosome structure, we carried out cohesin ChIA-PET and focused the analysis on loci containing super-enhancers, which drive expression of key cell identity genes.
We found that the majority of super enhancers and their associated genes occur within large loops that are connected through interacting CTCF sites co-occupied by cohesin. These super-enhancer domains, or SDs, typically contain one super-enhancer that loops to one gene within the SD. The SDs appear to restrict super-enhancer activity to genes within the SD because the cohesin ChIA-PET interactions occur primarily within the SD and loss of a CTCF boundary tends to cause inappropriate activation of nearby genes located outside that boundary.
The proper association of super-enhancers and their target genes in such “insulated neighborhoods” is of considerable importance, as the mistargeting of a single super enhancer is sufficient to cause leukemia (Groschel et al., 2014). The cohesin ChIA-PET data and perturbation of CTCF sites suggest that genes that encode repressed, lineage-specifying, developmental regulators also occur within insulated neighborhoods in ESCs. Maintenance of the pluripotent ESC state requires that genes encoding lineage-specifying developmental regulators are repressed, and these repressed lineage-specifying genes are occupied by nucleosomal histones that carry the polycomb mark H3K27me3 (Boyer et al., 2006; Bracken et al., 2006; Lee et al., 2006; Ne'gre et al., 2006; Schwartz et al., 2006; Squazzo et al., 2006; Tolhuis et al., 2006).
The majority of these genes were found to be located within a cohesion-associated CTCF-CTCF loop, which we call a polycomb domain (PD). The perturbation of CTCF PD boundary sites caused derepression of the polycomb-bound gene within the PD, suggesting that these boundaries are important for maintenance of gene repression within the PD. CTCF has previously been shown to be associated with boundary formation, insulator activity, and transcriptional regulation (Bell et al., 1999; Denholtz et al., 2013; Felsenfeld et al., 2004; Handoko et al., 2011; Kim et al., 2007; Phillips and Corces, 2009; Schwartz et al., 2012; Sexton et al., 2012; Soshnikova et al., 2010; Valenzuela and Kamakaka, 2006).
Previous report shave also demonstrated that cohesin and CTCF are associated with large loop substructures within TADs, whereas cohesin and Mediator are associated with smaller loop structures that sometimes form within the CTCF-bound loops (de Wit et al., 2013; Phillips-Cremins et al., 2013; Sofueva et al., 2013). CTCF-bound domains have been proposed to confine the activity of enhancers to specific target genes, thus yielding proper tissue-specific expression of genes (DeMare et al., 2013; Handoko et al., 2011; Hawkins et al., 2011).
Our genome-wide study extends these observations by connecting such structures with the transcriptional control of specific super-enhancer-driven and polycomb-repressed cell identity genes and by showing that these structures can contribute to the control of genes both inside and outside of the insulated neighborhoods that contain key pluripotency genes.
The organization of key cell identity genes into insulated neighborhoods may be a property common to all mammalian cell types. Indeed, several recent studies have identified CTCF-bound regions whose function is consistent with ESC SDs (Guo et al., 2011; Wang et al., 2014).
For example, in T cell acute lymphocytic leukemia, Notch1 activation leads to increased expression of a super-enhancer-driven gene found between two CTCF sites that are structurally connected but does not affect genes located outside of the two CTCF sites (Wang et al., 2014).
Future studies addressing the mechanisms that regulate loop formation should provide additional insights into the relationships between transcriptional control of cell identity genes and control of local chromosome structure.
The following examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the drawings.
Compounds useful in the invention include those described herein in any of their pharmaceutically acceptable forms, including isomers such as diastereomers and enantiomers, salts, solvates, and polymorphs, as well as racemic mixtures and pure isomers of the compounds described herein, where applicable.
While a number of exemplary aspects and embodiments have been discussed herein, those of skill in the art will recognize certain modifications, permutations, additions and sub-combinations thereof. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions and sub-combinations as are within their true spirit and scope.
All patents, patent applications, patent publications, scientific articles and the like, cited or identified in this application are hereby incorporated by reference in their entirety in order to describe more fully the state of the art to which the present application pertains.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments in accordance with the invention described herein. The scope of the present invention is not intended to be limited to the above Description, but rather is as set forth in the appended claims.
In the claims, articles such as “a,” “an,” and “the” may mean one or more than one unless indicated to the contrary or otherwise evident from the context. Claims or descriptions that include “or” between one or more members of a group are considered satisfied if one, more than one, or all of the group members are present in, employed in, or otherwise relevant to a given product or process unless indicated to the contrary or otherwise evident from the context. The invention includes embodiments in which exactly one member of the group is present in, employed in, or otherwise relevant to a given product or process. The invention includes embodiments in which more than one, or the entire group members are present in, employed in, or otherwise relevant to a given product or process.
It is also noted that the term “comprising” is intended to be open and permits but does not require the inclusion of additional elements or steps. When the term “comprising” is used herein, the term “consisting of” is thus also encompassed and disclosed.
Where ranges are given, endpoints are included. Furthermore, it is to be understood that unless otherwise indicated or otherwise evident from the context and understanding of one of ordinary skill in the art, values that are expressed as ranges can assume any specific value or subrange within the stated ranges in different embodiments of the invention, to the tenth of the unit of the lower limit of the range, unless the context clearly dictates otherwise.
In addition, it is to be understood that any particular embodiment of the present invention that falls within the prior art may be explicitly excluded from any one or more of the claims. Since such embodiments are deemed to be known to one of ordinary skill in the art, they may be excluded even if the exclusion is not set forth explicitly herein. Any particular embodiment of the compositions of the invention (e.g., any nucleic acid or protein encoded thereby; any method of production; any method of use; etc.) can be excluded from any one or more claims, for any reason, whether or not related to the existence of prior art.
All cited sources, for example, references, publications, databases, database entries, and art cited herein, are incorporated into this application by reference, even if not expressly stated in the citation. In case of conflicting statements of a cited source and the instant application, the statement in the instant application shall control.
Section and table headings are not intended to be limiting.
The following examples are illustrative in nature and are in no way intended to be limiting.
V6.5 murine ESCs were grown on irradiated murine embryonic fibroblasts (MEFs) under standard ESC conditions, as described previously (Whyte et al., 2012). V6.5 murine ESCs were grown on irradiated murine embryonic fibroblasts (MEFs). Cells were grown under standard ESC conditions as described previously (Whyte et al., 2012). Cells were grown on 0.2% gelatinized (Sigma, G1890) tissue culture plates in ESC media; DMEM-KO (Invitrogen, 10829-018) supplemented with 15% fetal bovine serum (Hyclone, characterized SH3007103), 1,000 U/ml LIF (ESGRO, ESG1106), 100 μM nonessential amino acids (Invitrogen, 11140-050), 2 mM L-glutamine (Invitrogen, 25030-081), 100 U/ml penicillin, 100 μg/ml streptomycin (Invitrogen, 15140-122), and 8 nl/ml of 2-mercaptoethanol (Sigma, M7522).
The CRISPR/Cas9 system was used to create ESC lines with CTCF site deletions. Target-specific oligonucleotides were cloned into a plasmid carrying a codon-optimized version of Cas9 (pX330, Addgene: 42230). The genomic sequences complementary to guide RNAs in the genome editing experiments were:
Cells were transfected with two plasmids expressing Cas9 and sgRNA targeting regions around 200 base pairs up- and downstream of the CTCF binding site, respectively. A plasmid expressing PGK-puroR was also cotransfected, using X-fect reagent (Clontech) according to the manufacturer's instructions. One day after transfection, cells were replated on DR4 MEF feeder layers. One day after replating, puromycin (2 ug/ml) was added for 3 days. Subsequently, puromycin was withdrawn for 3-4 days. Individual colonies were picked and genotyped by PCR.
For the Prdm14 (C1-2), mir-290-295, Pou5f1 and Nanog SDs and Tcfap2e (C1) PD boundary CTCF site deletions, at least two independent clones were expanded and analyzed. Data on
PRDM14 Locus Reference Sequence:
PRDM14 C1-2 Deletion Allele Sequence:
MIR290 Locus Reference Sequence:
MIR290 C1 Deletion Allele Sequence:
POU5F1 Locus Reference Sequence:
POU5F1 C1 Deletion Allele Sequence:
NANOG Locus Reference Sequence:
NANOG C1 Deletion Allele Sequence:
TDGF1 Locus Reference Sequence:
TDGF1 C1 Deletion Allele Sequence:
TCFAP2E Locus Reference Sequence (C1):
TCFAP2E C1 Deletion Allele Sequence:
TCFAP2E Locus Reference Sequence (C2):
TCFAP2E C2 Deletion Allele Sequence:
The CTCF-deletion lines at the Pou5f1 and Prdm14 (C1-2) loci are heterozygous, while the CTCF-deletion lines at the Nanog, Tdgf1, Prdm14 (C1) and miR-290-295 loci are homozygous for the mutation. Gene Expression Analysis ESC lines were split off MEFs for two passages. RNA was isolated using Trizol reagent (Invitrogen) or RNeasy purification kit (Promega), and reverse transcribed using oligo-dT primers and SuperScript III reverse transcriptase (Invitrogen) according to the manufacturers' instructions. Quantitative real-time PCR was performed on a 7000 AB Detection System using the following Taqman probes, according to the manufacturer's instructions (Applied Biosystems).
Based on RNA-seq data (Shen et al., 2012), the genes are expressed at the following levels prior to deletion of the CTCF site:
Pou5f1: 79.4 RPKM (rank among 24,827 Refseq transcripts: 232, top 1%)
Prdm14: 2.21 RPKM (rank: 9,745, 39th %)
Slco5a1: 0.93 RPKM (rank: 12,277, 50th %)
miR-295: 18.9 RPKM (rank: 1,902, 8th %)
H2-Q10: 0.48 RPKM (rank: 13,782, 56th %)
Tcf19: 1.03 RPKM (rank: 12,011, 49th %)
Nlrp12: 0.06 RPKM (17,108, 69th %)
AU018091: 17.1 RPKM (rank: 2,150, 9th %)
Myadm: 14.6 RPKM (mean of multiple splice isoforms) (rank: 2610, 11th %)
Dppa3: 25 RPKM (rank: 1,320, 5th %)
Tdgf1: 92 RPKM (rank: 167, top 1%)
Lrrc2: 1.2 RPKM (rank: 10,292, 42nd %)
Rtp3: 0.01 RPKM (rank: 14,587 59th)
Sox2: 122 RPKM (rank: 100, top 1%)
Nanog: 122 RPKM (rank: 99, top 1%)
Pax6: 0.07 RPKM (rank: 16,941, 68th %)
Gata6: 0.25 RPKM (rank: 14,981, 60th %)
Sox17: 0.15 RPKM (rank: 15,754, 64th %)
Psmb2: 85 RPKM (rank: 203, top 1%)
Tcfap2e: 0.19 RPKM (rank: 15,402, 62nd %)
Ncdn: 3.19 RPKM (rank: 8,388, 24th %)
Purified DNA from a H3K27me3 ChIP was used to prepare a library for Illumina sequencing. The library was prepared following the Illumina TruSeq DNA Sample Preparation v2 kit protocol as previously described (Whyte et al., 2012).
All ChIP-Seq data sets were aligned using Bowtie (version 0.12.2) (Langmead et al., 2009) to build version MM9 of the mouse genome with parameter -k 1 -m 1 -n 2. Data sets used in this manuscript can be found in Table S6. We used the MACS version 1.4.2 (model-based analysis of ChIP-seq) (Zhang et al., 2008) peak finding algorithm to identify regions of ChIP-seq enrichment over input DNA control. A p value threshold of enrichment of 1e-09 was used for all data sets. For the histone modification H3K27me3 whose signal tends to be broad across large genomic regions, we used MACS (Zhang et al., 2008) with the parameter “-p 1e-09 -no-lambda -no-model”. UCSC Genome Browser (Kent et al., 2002) tracks were generated using MACS wiggle outputs with parameters “-w -S -space=50”.
Enrichment Heatmap
All gene-centric analyses in ESCs were performed using mouse (mm9/NCBI37) RefSeq annotations downloaded from the UCSC genome browser (genome.ucsc.edu). For counting purposes and for assignment of enhancers to target genes (Table S2A-C), we collapsed multiple identical TSS into one gene level TSS. Genes were separated into classes of activity as follows: A gene was defined as active if an enriched region for either H3K4me3 or RNA Pol II was located within +/−2.5 kb of the TSS and lacked an enriched region for H3K27me3 therein. H3K4me3 is a histone modification associated with transcription initiation (Guenther et al., 2007). A gene was defined as Polycomb-occupied if an enriched region for H3K27me3 (representing Polycomb complexes) but not RNA Pol II was located within +/−2.5 kb of the TSS. H3K27me3 is a histone modification associated with Polycomb complexes (Boyer et al., 2006; Lee et al., 2006). A gene was defined as silent if H3K4me3, H3K27me3, or RNA Pol II enriched regions was absent from +/−2.5 kb of the TSS. Remaining genes to which we were unable to assign a state were left as unclassified. Overall, there were 15,312 unique active TSSs, 1,091 unique Polycomb-occupied TSSs, 8,477 unique silent TSSs, and 616 unclassified TSSs in mouse ES cells.
Co-occupancy of ESC genomic sites by the OCT4, SOX2, and NANOG transcription factors is highly predictive of enhancer activity (Chen et al., 2008) and Mediator is typically associated with these sites (Kagey et al., 2010). We first pooled the reads of ChIP-seq profiles of transcription factors OCT4, SOX2, and NANOG, which were performed in parallel, to create a merged “OSN” ChIP-seq experiment (Whyte et al., 2013). These reads were processed by MACS to create an OSN binding profile for visualization. To define active enhancers, we first identified enriched regions for the merged “OSN” ChIP-seq read pool, and for both Mediator complex components MED1 and MED12 using MACS. Then we used the union of these five sets of enriched ChIP-Seq regions that fell outside of promoters (e.g., a region not overlapping with ±2.5 kb region flanking the RefSeq transcriptional start sites) as putative enhancers.
All ChIA-PET datasets were processed with a method adapted from a previous computational pipeline (Li et al., 2010). The raw sequences were analyzed for linker barcode composition and separated into non-chimeric PETs with homodimeric linkers (AA or BB linkers) derived from specific ligation products, or chimeric PETs (AB linkers) with heterodimeric linker derived from nonspecific ligation products. We trimmed the PETs immediately before a perfect match of the first 10 nt of the linker sequences (Linker A with CTGCTGTCCG; Linker B with CTGCTGTCAT). After removing the linkers, only the 5′ ends of the trimmed PETs of at least 27 bp were retained, because the restriction enzyme EcoP151 cuts 27 bp away from its recognition sequence.
The sequences of the two ends of PETs were separately mapped to the mm9 mouse genome using the bowtie algorithm with the option “-k 1 -m 1 -v 1” (Langmead et al., 2009). These criteria retained only the uniquely mapped reads, with at most a single mismatch for further analysis. Aligned reads were paired with mates using read identifiers and, to remove PCR bias artifacts, were filtered for redundancy. PETs with identical genomic coordinates and strand information at both ends were collapsed into a single PET. The PETs were further categorized into intrachromosomal PETs, where the two ends of a PET were on the same chromosome, and interchromosomal PETs, where the two ends were on different chromosomes. The two ends of all non-chimeric PETs were used to call PET peaks that represent local enrichment of the PET sequence coverage by using MACS 1.4.2 (Zhang et al., 2008) with the parameters “-p 1e-09 -no-lambda -no-model -keepdup=2”.
Chimeric PETs with heterodimeric linkers can be used to estimate the degree of noise in the ChIA-PET dataset. 7% of paired-end ligations involved heterodimeric linkers (AB linkers Table S1A). Since the frequency of ligations involved heterodimeric linkers (AB linkers) gave an estimate of non-specific homodimeric ligations (AA or BB linkers), we estimated that less than 14% of total homodimeric ligations (AA and BB linkers) were nonspecific. We also counted the chimeric PETs that overlapped with PET peaks at both ends by at least 1 bp. These chimeric PETs represented “non-specific” chromatin interactions. We found that more than 99.8% “non-specific” chromatin interactions derived from chimeric PETs overlapping with PET peaks had only 1 chimeric PET; 0.1% “nonspecific” interactions had 2 chimeric PETs. We thus used a 3 PET cut-off for our high-confidence interactions (Figure S1F). Since contact frequency is expected to inversely scale with genomic distance, we examined the relationship between PET frequencies over genomic distance between the two ends of intrachromosomal PETs. The frequency of non-chimeric PETs with homodimeric linkers was plotted over genomic span in increments of 100 bp (Figure S1E). The scatter plot suggested two populations within intra-chromosomal PETs and showed that the vast majority of these PETs were within 4 kb (Figure S1E). We thus used a 4 kb cutoff to remove those PETs that may originate from self ligation of DNA ends from a single chromatin fragment in the ChIA-PET procedure. In contrast, chimeric PETs with heterodimeric linkers did not show an inverse relationship with genomic distance (Figure S1E, Table S1A).
To identify long-range chromatin interactions, we first removed intrachromosomal PETs of length <4 kb because these PETs may originate from self-ligation of DNA ends from a single chromatin fragment in the ChIA-PET procedure (Figure S1E). We next identified PETs where each end overlapped with a different PET peak (overlap of at least 1 bp).
Operationally, these PETs were defined as putative interactions. Applying a statistical model based upon the hypergeometric distribution identified high-confidence interactions, representing high-confidence physical linking between the PET peaks. Specifically, we first counted the number of PETs originating from each PET peak. We then asked, given the numbers of PETs originating from any two PET peaks, what was the likelihood of seeing the observed number of PETs linking the two PET peaks, using a hypergeometric distribution to generate a p value for each potential interaction. To correct for multiple hypothesis testing, we derived a background distribution for p-values of interactions through random shuffling of the links between PET ends. Using this background distribution, we controlled the number of false positives in our interaction set by setting a p-value cutoff threshold such that only the top 1% of simulated interactions from the background dataset would be called significant. This threshold, which we term the false positive likelihood in figure legends, was then applied to the actual data. This method did not make any assumption of the distribution of p-values as the Benjamini-Hochberg procedure (Benjamini, 1995); both methods for multiple hypothesis testing yielded similar number of interactions (Noble, 2009). For each of the two SMC1 ChIA-PET replicates, two independent PETs were required to call high-confidence interactions between pairs of interacting sites (Table S1C, S1D (not shown); merged data in Table S1E). For the merged SMC1 ChIA-PET dataset, non-chimeric PETs from two replicates were pooled together and three independent PETs were required to call high-confidence interactions (Table S1E).
To determine the degree of saturation within our ChIA-PET library (Figure S1H), we modeled the number of sampled genomic positions as a function of sequencing depth by the Michaelis-Menten model. Intrachromosomal PETs with a distance span above our self-ligation cutoff of 4 kb were subsampled at varying depths, and the number of unique genomic positions (defined as the start and end coordinates of the paired PETs) that they occupy were counted. Model fitting using non-linear least-squares regression suggested that we have sampled approximately 70% of the available intrachromosomal PET space, encompassing 2.22/3.17 million positions (Figure S1H).
We considered whether ChIA-PET data limitations might limit detection of longer range interactions. If sparseness of data were a significant problem, resulting in under-calling of long-range interactions, we would likely miss previously detected long-range interactions. Instead, we detect previously known long-range interactions, e.g. the interaction between Sonic Hedgehog (Shh) and its enhancer in the intron of the nearby Lmbr1 gene (1 Mb away), interactions between the HoxD gene cluster and its distal regulatory sequences (>300 kb away), and interactions between the HoxA gene cluster and its distal regulatory sequences (>500 kb away) (Lehoczky et al., 2004; Lettice et al., 2003; Spitz et al., 2003).
Saturation analysis suggested that each of the two SMC1 ChIA-PET replicates sampled only ˜50% of the available intrachromosomal PET space (data not shown). We thus investigated the reproducibility of SMC1 ChIA-PET replicates by examining how often high-confidence interactions from one of the two SMC1 ChIA-PET replicates were supported by PET interactions from the other replicate. Operationally, we counted the percentage of high-confidence interactions from one replicate whose individual end reads overlapped with those from high-confidence interactions identified in the other replicate by at least 1 bp (Figure S1D).
To compare the replicates' genome-wide interaction frequency (Figure S1C), inter-chromosomal PETs and intra-chromosomal PETs below the self-ligation cutoff (4 kb) were filtered. Each chromosome was partitioned into 10 kb bins and 21 symmetric two-dimensional matrices (all bins×all bins) were constructed for each replicate. These matrices were populated such that bin ai,j represented the number of PETs in that replicate with one end in bin i and the other in bin j. PET counts were separately normalized by the number of mapped reads in each replicate as well as the bin size*1000.
This resulted in an RPKM-like metric for all bins in both matrices. Figure S1C represents the relationships between each replicate where the X axis represents bin ai.j in replicate 1 and the Y axis represents bin ai,j in replicate 2. This relationship was also analyzed using the Pearson r.
To identify the association of long-range chromatin interactions to different regulatory elements, we assigned the PET peaks of interactions to different regulatory elements, including active enhancers, promoters (+/−2.5 kb of the Refseq TSS), and CTCF ChIP-seq binding sites. Operationally, an interaction was defined as associated with the regulatory element if one of the two PET peaks of the interaction overlapped with the regulatory element by at least 1 base-pair.
Our analysis identified 2,921 high-confidence interactions involving an enhancer (contains an OCT4/SOX2/NANOG or MED1 or MED12 enriched region and is not located within +/−2.5 kb of an annotated TSS) and a promoter (+/−2.5 kb of an annotated TSS) (
We identified 216 enhancer-promoter interactions that involved super-enhancers (Table S2B), as defined in (Whyte et al., 2013). The high-confidence enhancer-promoter interactions were used to assign super enhancers and typical enhancers to their target genes (Table S2B, S2C). Multiple enhancer constituents that are in close proximity can be computationally stitched together into enhancer regions (true for typical and super-enhancers) as described previously (Hnisz et al., 2013; Whyte et al., 2013).
We identified high confidence interactions overlapping with a super-enhancer or typical enhancer region at one end and a promoter (+/−2.5 kb of a TSS) at the other end (Table S2B, S2C). For 151 super-enhancers with sufficient interaction data, we found that 83% of enhancer assignments to the nearest active gene (including Polycomb-occupied genes) were confirmed/supported by high-confidence interactions.
For typical enhancers with 1477 sufficient interaction data, we found that 87% of enhancer assignments to the nearest active gene (including Polycomb-occupied genes) were confirmed/supported by high-confidence interaction data.
Genome-wide average representations of ChIA-PET interactions at TADs were created by mapping high-confidence ChIA-PET interactions across TADs (Dixon et al., 2012) (
We next counted the interaction frequency between any two bins in each TAD to produce a 60 by 60 interaction matrix using a method as previously described in Dixon et al., 2012 The numbers in the interaction matrices represent interaction frequencies at the diagonals originating from two bins on the x- and y-axis. Average interaction frequencies across 2,200 TAD interaction matrices were calculated. The upper triangular matrix of the average interaction frequencies was displayed in the units of interactions per bin in
Typical enhancer and super-enhancer regions in murine embryonic stem cells were described previously (Hnisz et al., 2013; Whyte et al., 2013), and their genomic coordinates were downloaded (Table S2B, S2C). The 231 super enhancers were assigned to genes with a combination of ChIA-PET interactions and proximity to their nearest active transcriptional start sites (TSSs). We first used high-confidence SMC1 PET interactions (FDR 0.01, 3 PETs) between super-enhancers and TSS regions (+/−2.5 kb of a TSS) to identify their target genes.
When super-enhancers did not have PET interactions to any TSS regions, they were assigned the nearest active TSSs (including Polycomb occupied genes) by proximity. Super-enhancers and the TSS regions (+/−2.5 kb of a TSS) of their target genes are considered as SE-gene units. All 231 super enhancers were assigned to target genes with this method. This approach resulted in a total of 302 SE-gene units because a SE occasionally interacted with multiple genes.
We next identified SMC1 PET interactions between two CTCF-enriched regions (regardless of whether these CTCF regions were at promoters or enhancers) that encompass these SE-gene units, which we called super-enhancer domains—we call these regions “CTCF-CTCF PET interactions.” The CTCF-CTCF PET interactions defining super-enhancer domains were required to encompass the TSS regions (+/−2.5 kb of a TSS) and the super enhancer for each SE-gene unit. When multiple nested CTCF-CTCF PET interactions encompassed a SE-gene unit, we used the smallest CTCF-CTCF PET interactions for simplicity.
We identified 193 Super-enhancer Domains (SDs) containing a total of 191 super-enhancers. We noted that the boundaries of super-enhancer are sensitive to the algorithm that computationally defines super enhancers. For 4 super-enhancers, one super-enhancer constituent out of multiple constituent enhancers that define the super enhancers fall outside of the CTCF-CTCF PET interactions. These 4 CTCF-CTCF PET interactions encompass the target gene TSS regions (+/−2.5 kb of a TSS) and more than 50% of the genomic space covered by the super-enhancer. Therefore, we qualified these 4 CTCF-CTCF PET interactions as Super-enhancer Domains.
Thus, we identified a total of 197 Super-enhancer Domains (SDs) containing a total of 197 boundary CTCF-CTCF PET interactions and 195 super-enhancers (Table S4A, S4B). For the ˜15% super-enhancers that did not qualify for occurrence within a SD by using the high confidence ChIA-PET data, the interaction dataset (not the high confidence data) shows that all but one of these super-enhancers are located within CTCF-CTCF loops co-bound by cohesin.
We also performed the same computational analyses for the 8,563 typical enhancers. We found that only 48% (4128/8563) typical-enhancers are contained in CTCF-CTCF topological structures similar to SDs. Developmental regulators in embryonic stem cells frequently exhibit extended binding of Polycomb complex at their promoters spanning 2-35 kb from their promoters (Boyer et al., 2006; Lee et al., 2006). We thus focused on those Polycomb-occupied TSSs that showed enrichment of H3K27me3 spanning greater than 2 kb in size. This distance cutoff was based on analyses performed in (Lee et al., 2006). We noted that ˜60% H3K27me3 regions called by MACS had neighboring H3K27me3 regions within 2 kb. In order to accurately capture the large genomic regions that show enrichment of H3K27me3 signal, we first merged the H3K27me3 regions that were within 2 kb of each other. 546 genes, including 203 encoding transcription factors, showed enrichment of H3K27me3 spanning greater than 2 kb at their promoters.
We next identified high confidence CTCF-CTCF PET interactions that encompassed the H3K27me3 regions of these 546 genes at promoters. When multiple nested CTCF-CTCF PET interactions encompassed the H3K27me3 regions, we took the smallest CTCFCTCF PET interactions for simplicity. We identified 349 Polycomb Domains (PDs) containing a total of 349 boundary CTCF-CTCF PET interactions and 380 Polycomb-associated genes (Table S5A, S5B).
Support for SD and PD Structures from Published Datasets
The existence of Super-enhancer Domains and Polycomb Domains was supported by evidence from published CTCF ChIA-PET datasets (GSE28247) (Handoko et al., 2011). We applied our ChIA-PET processing method to the published CTCF ChIA-PET dataset to identify unique PETs. We then counted the instances where a high-confidence CTCF-CTCF boundary interaction from our ChIA-PET dataset showed a minimum 80% reciprocal overlap with the span of a unique PET from the CTCF ChIA-PET dataset, i.e. 80% of a high-confidence SD boundary interaction region is in common with a CTCF ChIA-PET unique PET and vice versa. To accomplish this, we used BEDtools (https://github.com/arq5x/bedtools2) intersect with parameters -f 0.8 -r -u.
We found that 34% (6770/20080) of our CTCF-CTCF interactions were confirmed by a unique PET within the CTCF ChIA-PET dataset, 33% (65/197) of our SD boundary interactions were confirmed by a unique PET within the CTCF ChIAPET dataset, and 33% (115/349) of our PD boundary interactions were confirmed by a unique PET within the CTCF ChIA-PET dataset (Table S3A). Most Super-enhancer Domains and Polycomb Domains are distinct from the previously described Topologically Associating Domains (TADS).
We compared Super-enhancer Domains and Polycomb Domains to TADs by counting the instances where a Super-enhancer Domain or a Polycomb Domain showed a minimum 80% reciprocal overlap with a TAD. 3% (5/197) of our SDs and 4% (13/349) of our PD have an 80% reciprocal overlap with a TAD (Dixon et al., 2012). 8% (16/197) of our SDs and 9% (30/349) of our PD have an 80% reciprocal overlap with a TAD (Filippova et al., 2014) (Table S3A).
The existence of enhancer-promoter and enhancer-enhancer interactions was supported by evidence from published RNA Pall ChIA-PET datasets (Kieffer-Kwon et al., 2013). We applied our ChIA-PET processing method to the published Pol2 ChIA-PET dataset to identify unique PETs. We then counted the instances where a high-confidence enhancer-promoter or enhancer-enhancer interaction from our Smc1 ChIA-PET dataset showed a minimum 80% reciprocal overlap with a unique PET from the Pol2 ChIA-PET dataset, e.g. 80% of an enhancer-promoter interaction region is in common with a Pol2 ChIA-PET unique PET and vice versa. We found that 82% (2,402/2,921) of our enhancer-promoter interactions were confirmed by a unique PET within the Pol2 ChIA-PET dataset, and 73% (1,969/2,700) of our enhancer-enhancer interactions were confirmed by a unique PET within the Pol2 ChIA-PET dataset (Table S3A).
Several types of structural domains have been previously described, and we expect our interactions to occur largely within their boundaries. Thus, we determined how many of our interactions spanned a boundary. Topologically Associating Domains (TADs) (Dixon et al., 2012) were determined using Hi-C in mouse ESCs; 6% (1,354/23,739) of high-confidence intrachromosomal cohesion-mediated interactions cross a TAD boundary. LOCK (large organized chromatin K9 modification) domains were determined using ChIP data (Wen et al., 2009); 4% (1,053/23,739) of high-confidence, intrachromosomal cohesin-mediated interactions cross a LOCK boundary. Lamin-associated domains (LADS) were determined using DamID (Meuleman et al., 2013); 5% (1,180/23,739) of high confidence intrachromosomal cohesin-mediated interactions cross a LAD boundary (Table S3A).
Genome-wide average “meta” representations of ChIP-seq occupancy of different factors were created by mapping ChIP-seq read density to different sets of regions (
Heatmap representations of ChIP-seq read density of different factors were created by mapping the reads within super-enhancers and/or their target genes across super-enhancer domains (
Heatmap representations of ChIA-PET interactions were created by mapping high-confidence ChIA-PET interactions across Super-enhancer Domains (SD) and Polycomb Domains (PD), which are defined above. We created three types of regions: upstream, SD or PD, and downstream. Upstream and downstream regions are 20% of the SD's or PD's length each. We divided the upstream and downstream regions into 10 equally-sized bins each. We divided the SD or PD into 50 equally-sized bins. To calculate interactions in each bin, we filtered high-confidence interactions in two ways. 1) We required high-confidence interactions to have at least one end in the interrogated region. This removed interactions that are anchored outside of our region of interest. 2) We removed interactions that are not related to the internal structure of the domain. This removed interactions that have one end at an SD or PD border PET peak and the other end outside of the SD or PD.
We considered the whole span of each filtered high-confidence ChIA-PET interaction. The density of such spans in each bin was calculated, where all bins contacting an interaction were incremented by 1. Per row counts were normalized by dividing each bin count by the row maximum and displayed in Heatmaps in
An entropy-based measure of Jensen-Shannon Divergence (JSD) was adopted to identify putative SMC1- and CTCF-bound chromatin insulator elements at PD domain boundaries (
We next used JSD as described in (Fuglede and Topsoe, 2004) to quantify the similarity between normalized ChIP-seq patterns and the two pre-defined patterns, which results in a similarity score between each normalized ChIP-seq vector and the ideal vectors described above. We took the top 15 percent of our 20 kb regions ranked by their similarity score and extracted those that were at the boundaries of Polycomb Domains (PD). For robustness, only PD border regions whose average ChIP-seq signal (H3K27me3) within the 20 kb window was above the 60 percentile of all CTCF enriched regions at the side within the domain and below 50 percentile of all CTCF enriched regions at the side outside of the domain were considered as putative chromatin insulator elements.
CTCF peaks in 18 tissues/cell types from ENCODE were downloaded from the UCSC table browser (http://genome.ucsc.edu/cgibin/hgFileUi?db=mm9&g=wgEncodeLicrTfbs).
We restricted our analysis to autosomal CTCF sites, because these 18 cell types could be derived from mice of different sex or strains. We first took the intersection of our autosomal CTCF peaks in murine V6.5 ESC 129-057Bl/6 line and autosomal CTCF peaks in the murine ESC Bruce4 line from ENCODE to account for differences in cells and experimental technique. We next quantified how frequently these autosomal CTCF peaks from ESCs were occupied by CTCF ChIP-Seq peaks in 18 tissues/cell types (including ESC Bruce4 cells) from ENCODE. The histogram of CTCF occupancy across 18 tissues/cell types were plotted in
Super-enhancers were identified in mouse neural progenitor cells (NPCs) using ROSE (https://bitbucket.org/young_computation/rose). This code is an implementation of the method used in (Hnisz et al., 2013; Loven et al., 2013).
Briefly, regions enriched in H3K27ac signal were identified using MACS with background control, -keep-dup=auto, and -p 1e-9. These regions were stitched together if they were within 12.5 kb of each other and enriched regions entirely contained within +/−2 kb from a TSS were excluded from stitching. Stitched regions were ranked by H3K27ac signal therein.
ROSE identified a point at which the two classes of enhancers were separable. Those stitched enhancers falling above this threshold were considered super-enhancers.
Phillips-Cremins et al. performed 5C at 7 genomic loci (Phillips-Cremins et al., 2013). We filtered for statistically significant 5C interactions in mouse NPC by requiring a p value for both replicates <0.05, resulting in 674 interactions. We filtered for CTCF-CTCF interactions by requiring an overlap with a CTCF ChIPSeq enriched region in NPC on both ends resulting in 32 CTCF-positive 5C interactions. 34% (11/32)
CTCF 5C interactions in NPCs have an 80% reciprocal overlap with a SMC1 ChIA-PET interactions in mouse ESCs (Table S3B).
For each sample, 2×107 ESCs cells were crosslinked with 1% formaldehyde for 20 min at RT. The reaction was quenched by the addition of 125 mM glycine for 5 min at RT. Crosslinked ESCs were washed with PBS and resuspended in 10 ml lysis buffer (10 mM Tris-HCl, pH 8.0, 10 mM NaCl, 0.2% NP40 and proteinase inhibitors) and lysed with a Dounce homogenizer. Following BglII digestion overnight, 3C-ligated DNA was prepared as previously described (Lieberman-Aiden et al., 2009).
The 3C interactions at the miR-290-295 and Pou5f1 loci (Figure S4A, S4B) were analyzed by quantitative real-time PCR using custom Taqman probes as previously described (Xu et al., 2011). The amount of DNA in the qPCR reactions was normalized across 3C libraries using a custom Taqman probe directed against the Actb locus. Primer sequences are listed below.
Target Region Primer Name Sequence (5′-3′)
F, and R denote forward and reverse primers, respectively.
In brief, murine ESCs (up to 13×108 cells) were treated with 1% formaldehyde at room temperature for 10 min and then neutralized using 0.2 M glycine. The crosslinked chromatin was fragmented by sonication to size lengths of 300-700 bp. The anti-SMC1 antibody (Bethyl, A300-055A) was used to enrich SMC1-bound chromatin fragments. A portion of ChIP DNA was eluted from antibody-coated beads for concentration quantification and for enrichment analysis using quantitative PCR. For ChIA-PET library construction, ChIP DNA fragments were end repaired using T4 DNA polymerase (NEB) and ligated to either linker A or linker B. After linker ligation, the two samples were combined for proximity ligation in diluted conditions. Following proximity ligation, the paired-end tag (PET) constructs were extracted from the ligation products and the PET templates were subjected to 50 3 50 paired-end sequencing using Illumina HiSeq 2000.
ChIA-PET was performed as previously described (Chepelev et al., 2012; Fullwood et al., 2009; Goh et al., 2012; Li et al., 2012). Briefly, ES cells (up to 1×108 cells) were treated with 1% formaldehyde at room temperature for 20 min and then neutralized using 0.2M glycine. The crosslinked chromatin was fragmented by sonication to size lengths of 300-700 bp. The anti-SMC1 antibody (Bethyl, A300-055A) was used to enrich SMC1-bound chromatin fragments. A portion of ChIP DNA was eluted from antibody-coated beads for concentration quantification and for enrichment analysis using quantitative PCR.
For ChIA-PET library construction ChIP DNA fragments were end-repaired using T4 DNA polymerase (NEB). ChIP DNA fragments were divided into two aliquots and either linker A or linker B was ligated to the fragment ends. The two linkers differ by two nucleotides which are used as a nucleotide barcode (Linker A with CG; Linker B with AT) (Table S1A). After linker ligation, the two samples were combined and prepared for proximity ligation by diluting in a 20 ml volume to minimize ligations between different DNA-protein complexes. The proximity ligation reaction was performed with T4 DNA ligase (Fermentas) and incubated without rocking at 22 degrees Celsius for 20 hours.
During the proximity ligation DNA fragments with the same linker sequence were ligated within the same chromatin complex, which generated the ligation products with homodimeric linker composition. However, chimeric ligations between DNA fragments from different chromatin complexes could also occur, thus producing ligation products with heterodimeric linker composition. These heterodimeric linker products were used to assess the frequency of nonspecific ligations and were then removed bioinformatically.
As shown in Figure S1E, all heterodimeric linker ligations, giving rise to chimeric PETs, are by definition nonspecific. Because random intermolecular associations in the test tube are expected to be comparable for linkers A and B, the frequency of random homo and heterodimeric linker ligations should also be equivalent. In our SMC1 ChIA-PET library, only 7% of pair-end ligations involved heterodimeric linkers (Table S1A). Thus, we estimate that less than 14% of total homodimeric ligations are nonspecific.
Following proximity ligation, samples were treated with Proteinase K and DNA was purified. An EcoP15I (NEB) digestion was performed at 37 degrees Celsius for 17 hours to linearize the ligated chromatin fragments. The chromatin fragments were then immobilized on Dynabeads M280 Streptavidin beads. An End-Repair reaction was performed (Epicentre #ER81050), then As were added to the ends with Klenow treatment by rotating at 37 degrees Celsius for 35 minutes. Next, Illumina paired-end sequencing adapters were ligated on the ends and 18 cycles of PCR was performed. The Paired-End-Tag (PET) constructs were extracted from the ligation products and the PET templates were subjected to 50×50 paired-end sequencing using Illumina HiSeq 2000. SMC1 ChIA-PET was performed as previously described (Chepelev et al., 2012; Fullwood et al., 2009; Goh et al., 2012; Li et al., 2012).
ChIA-PET data analysis was performed as previously described (Li et al., 2010), with modifications described in the Extended Experimental Procedures. The high-confidence interactions for the two biological replicate SMC1 ChIAPET experiments and for the merged data set are listed in Tables S1C, S1D (not shown) but merged into Table S1E, respectively. All data sets used in this study are listed in Table S6.
Raw and processed sequencing data were deposited in GEO under accession number GSE57913 (http://www.ncbi.nlm.nih.gov/geo/).
The GEO accession ID for aligned and raw data is GSE57913 (www.ncbi.nlm.nih.gov/geo/).
The organization of mammalian chromosomes involves structural units with various sizes and properties, and cohesin, a structural maintenance of chromosomes (SMC) complex, participates in DNA interactions that include enhancer-promoter loops and larger loop structures that occur within topologically associating domains (TADs) (
The ChIA-PET technique was used because it yields high-resolution (˜4 kb) genome-wide interaction data, which is important because most loops involved in transcriptional regulation are between 1 and 100 kb (Gibcus and Dekker, 2013). We hoped to extend previous findings that mapped interactions among regulatory elements across portions of the ESC genome (Denholtz et al., 2013; Phillips-Cremins et al., 2013; Seitan et al., 2013) and gain a detailed understanding of the relationship between transcriptional control of ESC identity genes and control of local chromosome structure. To identify interactions between cohesin-occupied sites, we generated biological replicates of SMC1 ChIA-PET data sets in ESCs totaling ˜400 million reads (Table S1A). The two biological replicates showed a high degree of correlation (Pearson's r>0.91, Figures S1C and S1D), so we pooled the replicate data and processed it using an established protocol (Li et al., 2010), with modifications described in the Extended Experimental Procedures (Figure S1 and Table S1A). The data set contained ˜19 million unique paired-end tags (PETs) that were used to identify PET peaks (
Genomic data of any type are noisy, and our confidence in the interpretation of DNA interaction data is improved by identifying PETs that represent independent events in the sample and pass statistical significance tests. For this reason, we generated a high-confidence interaction data set (described in Extended Experimental Procedures) by requiring that at least three independent PETs support the identified interaction between two PET peaks. The high-confidence data set consisted of 23,835 interactions that were almost entirely intrachromosomal (99%) and included 2,921 enhancer-promoter interactions, 2,700 enhancer-enhancer interactions, and 7,841 interactions between non-enhancer, non-promoter CTCF sites (
We found that the interaction data supported 83% of superenhancer assignments to the proximal active gene and 87% of typical enhancer assignments to the proximal active gene (Tables S2B and S2C), with approximately half of the remainder assigned to the second most proximal gene. The interaction data most frequently assigned super-enhancers and typical enhancers to a single gene, with 76% of super-enhancers and 84% of typical enhancers showing evidence of interaction with a single gene. Prior studies have suggested that there can be more frequent interactions between enhancers and genes (Kieffer-Kwon et al., 2013; Sanyal et al., 2012; Shen et al., 2012); our high-confidence data are not saturating and do not address the upper limits of these interactions (Figure S1H and Extended Experimental Procedures).
The catalog of enhancer-promoter assignments provided by these interaction data should prove useful for future studies of the roles of ESC enhancers and their associated factors in control of specific target genes. The majority of cohesin ChIA-PET interactions did not cross the boundaries of previously defined TADs (Dixon et al., 2012; Filippova et al., 2014; Meuleman et al., 2013; Wen et al., 2009) (
Super-enhancers drive expression of key cell identity genes and are densely occupied by the transcription apparatus and its cofactors, including cohesin (Dowen et al., 2013; Hnisz et al., 2013). Analysis of high-confidence cohesin ChIA-PET interaction data revealed a striking feature common to loci containing super-enhancers and their associated genes (
In contrast, only 48% of typical enhancers were found to occur within comparable loops between two CTCF sites. The 197 SDs average 106 kb and most frequently contain one or two genes (Tables S4A and S4C). It was evident that there were cohesin-associated interactions between individual enhancer elements (constituents) of super-enhancers as well as interactions between super-enhancers and the promoters of their associated genes (Figures S3A-S3J).
Indeed, the results suggest that super-enhancer constituents have cohesin-associated interactions with one another (345 interactions) even more frequently than they do with their associated genes (216 interactions). The SDs contain high densities of pluripotency transcription factors, Mediator, and cohesin, together with histone modifications associated with transcriptionally active enhancers and genes (
The cohesin ChIA-PET interaction data and the distribution of the transcription apparatus suggest that the interacting cohesin-occupied CTCF sites tend to restrict the interactions of super-enhancers to those genes within the SD.
Because super-enhancers contain an exceptional amount of transcription apparatus and CTCF has been associated with insulator activity (Essafi et al., 2011; Handoko et al., 2011; Ong and Corces, 2014; Phillips and Corces, 2009; Phillips-Cremins and Corces, 2013), we postulated that SD structures might be necessary for proper regulation of genes in the vicinity of these structures. To test this model, we investigated the effect of deleting SD boundary CTCF sites on expression of genes inside and immediately outside of SDs (
For this purpose, we studied five SDs whose super-enhancer-associated genes play key roles in embryonic stem cell biology (miR-290-295, Nanog, Tdgf1, Pou5f1 [Oct4], and Prdm14). In all cases, we found that deletion of a CTCF site led to altered expression of nearby genes. In four out of five cases, deletion of a CTCF site led to increased expression of genes immediately outside the SDs, and in three of five cases, deletion of a CTCF site caused changes in expression of genes within the SDs. The miR-290-295 locus, which specifies miRNAs with roles in ESC biology, is located within an SD (
These results indicate that normal expression of the miR-290-295 primiRNA transcript is dependent on the CTCF boundary site and furthermore that genes located immediately outside of this SD can be activated when the SD CTCF boundary site is disrupted. The Nanog gene, which encodes a key pluripotency transcription factor, is located within an SD shown in
CRISPR-mediated deletion of the boundary CTCF site C1 of the Nanog SD led to a ˜40% drop in Nanog transcript levels (
We were not able to obtain a bi-allelic CRISPR-mediated deletion of a boundary CTCF site despite multiple attempts, but we did obtain a mono-allelic deletion of the boundary CTCF site C1 (
We tested whether the super-enhancers from disrupted SD structures show increased interaction frequencies with the newly activated genes outside the SD by using 3C. At two loci where loss of an SD boundary CTCF site led to significant activation of the gene outside the SD (miR-290-295 and Pou5f1), we performed quantitative 3C experiments to measure the contact frequency between the super-enhancers and the genes immediately outside of SDs in wild-type cells and in cells where the SD boundary CTCF site was deleted. In both cases, loss of the CTCF site led to an increase in the contact frequency between the super-enhancers and the genes immediately outside of SDs that were newly activated (Figures S4A and S4B).
We investigated whether altered SD boundaries that affect cell identity genes cause ESCs to express markers consistent with an altered cell state. Indeed, we found that ESCs lacking the miR-290-295 boundary CTCF site C1 exhibit increased expression of the ectodermal marker Pax6 and decreased expression of the endodermal lineage markers Gata6 and Sox17, suggesting that loss of the SD structure is sufficient to affect cell identity (Figure S4C). Previous studies have shown that miR-290-295 null ESCs show an increased propensity to differentiate into ectodermal lineages at the expense of endoderm (Kaspi et al., 2013). In summary, the loss of CTCF sites at the boundaries of SDs can cause a change in the level of transcripts for superenhancer-associated genes within the SD and frequently leads to activation of genes near these CTCF sites. These results indicate that the integrity of SDs is important for normal expression of genes located in the vicinity of the SD, which can include genes that are key to control of cell identity.
Maintenance of the pluripotent ESC state requires that genes encoding lineage-specifying developmental regulators are repressed, and these repressed lineage-specifying genes are occupied by nucleosomal histones that carry the polycomb-associated mark H3K27me3 (Margueron and Reinberg, 2011; Young, 2011). The mechanisms responsible for maintaining the H3K27me3 mark across short spans of regulatory regions and promoters of repressed genes are not well understood, although CTCF sites have been implicated (Cuddapah et al., 2009; Schwartz et al., 2012; Van Bortle et al., 2012).
Analysis of the H3K27me3-marked genes revealed that they, like the super enhancer-associated genes, are typically located within a loop between two interacting CTCF sites co-occupied by cohesin (
The majority (78%) of cohesin ChIAPET interactions originating in PDs occur within the PD boundaries (
We postulated that the CTCF boundaries that form PD structures might be important for repression of the polycomb-marked genes within the PD and investigated the effect of deleting boundary CTCF sites on a PD containing Tcfap2e to test this idea (
CRISPR-mediated deletion of the other boundary CTCF site (C2) caused a 4-fold increase in the expression of Tcfap2e (p<0.001) and had little effect on adjacent genes. These results suggest that the integrity of the CTCF boundaries of PDs is important for full repression of H3K27me3-occupied genes.
A previous study suggested that DNA loops mediated by cohesin and CTCF tend to be larger and more shared among multiple cell types than DNA loops associated with cohesin and Mediator, which represent enhancer-promoter interactions that may be cell type specific (Phillips-Cremins et al., 2013). This led us to postulate that: (1) the interacting CTCF structures of SDs and PDs may be common to multiple cell types and (2) the acquisition of super-enhancers and polycomb binding within these common domain structures will vary based on the gene expression program of the cell type (
To test this model, we compared the SDs identified in ESCs to comparable regions in neural precursor cells (NPCs) for which 5C interaction data was available for specific loci (Phillips-Cremins et al., 2013).
We found, for example, that the Nanog locus SD observed in ESCs with ChIA-PET data was also detected by 5C data in NPCs (
In this domain, the Olig1/Olig2 genes are not active and no super-enhancers are formed in ESCs, whereas there are three super-enhancers in NPCs, where these genes are highly expressed (
For regions where 5C interaction data in NPCs and ChIA-PET interaction data in ESCs could be compared, a total of 11 out of 32 interactions between CTCF sites identified in NPCs were supported by interaction data in ESCs (Table S3B), which is impressive given the sparsity of interaction data.
This supports the view that the interacting CTCF structures of ESC SDs may be common to multiple cell types. If the CTCF boundaries of ESC SDs and PDs are common to many cell types, we would expect that the binding of CTCF to the SD and PD boundary sites observed in ESCs will be conserved across multiple cell types.
To test this notion, we examined CTCF ChIP-seq peaks from 18 mouse cell types and determined how frequently CTCF binding occurred across these cell types (
The following Tables are referenced throughout the specification.
The patent application contains a lengthy table section. A copy of the tables are available in electronic form from the USPTO web site. An electronic copy of the tables will also be available from the USPTO upon request and payment of the fee set forth in 37 CFR 1.19(b)(3).
The Tables referenced herein were previously submitted in U.S. Provisional Application No. 62/234,770, and are hereby incorporated by reference in their entirety.
This application claims the benefit of U.S. Provisional Application No. 62/234,770, filed Sep. 30, 2015. The entire teachings of the above application is incorporated herein by reference.
This invention was made with government support under Grant Number HG002668 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62234770 | Sep 2015 | US |