T cell antigen specificity, mediated via T cell receptors (TCRs), is a hallmark of cellular immunity. TCRs are heterodimeric proteins found on the T cell surface, commonly comprised of an α- and β-chain. The TCR α- and β-chain genes are composed of discrete V, D (β-chain only) and J segments that are joined by somatic recombination during T cell development. This genetic rearrangement generates a highly diverse TCR repertoire (estimated to range from 1015 to 1061 possible receptors in human) to ensure efficient control of viral infections and other pathogen-induced diseases. TCR diversity is primarily exhibited in complementarity determining region (CDR) loops (CDR1, CDR2 and CDR3), which engage peptides that are presented by major histocompatibility complex (MHC) proteins, and therefore directly determines the specificity of T cell pMHC binding.
Although the factors underlying TCR-pMHC recognition are not fully understood, recent studies have shown that T cells binding to a particular pMHC share common TCR sequence features and, in select cases, it is possible to predict the specific binding probability of an unseen TCR sequence based on learned TCR sequence features. However, these studies were limited by the quantity and diversity of training data generated by traditional single multimer sorting or antigen re-exposure assays. Further understanding of TCR-pMHC specific binding requires innovation in both computational and experimental methods. 10× Genomics recently published a dataset generated from their highly multiplexed pooled dextramer binding immune profiling platform that couples feature-barcoded dextramers and single cell TCR sequencing. This approach makes it feasible to generate high-dimensional pMHC specific binding data at the single cell level with paired T cell α- and β-chain sequences, whereas other large-scale pooled multimer approaches only estimate the composition of pMHC specific binding T cells.
As with any other high throughput technology, highly multiplexed dextramer binding data are often associated with low signal-to-noise ratios. This makes it bioinformatically challenging to reliably identify TCR-pMHC binding events using such large-scale binding datasets. Unexpectedly high cross-HLA and cross-pMHC associations were observed from the binding events that 10× Genomics provided (
As next-generation screening technologies have increased the volume of available TCR-pMHC binding data, state-of-the-art functional classifiers to computationally validate and subsequently predict TCR-pMHC specific recognition have become more feasible. While the results from initial TCR-pMHC binding classifiers are encouraging, they were only trained using CDR loop sequences and thus unable to learn the overall complex sequence patterns from full-length TCR sequences, resulting in sub-optimal prediction accuracy for highly diverse pMHC binding TCRs. Leveraging the ability of deep learning algorithms to learn complex patterns, several deep learning frameworks were recently proposed to uncover binding patterns in large, highly complex TCR sequence datasets.
In this study, a computational framework for mapping, computationally validating, and predicting TCR-pMHC specific recognition using highly multiplexed dextramer binding data is described.
Disclosed are methods comprising receiving single cell sequencing data comprising single cell sequence data, dextramer sequence data, and single cell T-Cell Receptor (TCR) sequence data; filtering, from the dextramer sequence data, based on the single cell sequence data, data associated with low-quality cells; adjusting, based on a measure of background noise, the dextramer sequence data; filtering, from the dextramer sequence data, based on the single cell TCR-data, data according to a presence or an absence of an α-chain or a β-chain; and identifying data remaining in the normalized filtered dextramer sequence data as associated with reliable TCR-pMHC binding events.
Disclosed are methods comprising receiving single cell sequence data, dextramer sequence data, and single cell T Cell Receptor (TCR) sequence data; determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes; removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range; determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression; removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold; determining, based on the dextramer sequence data, sorted dextramer sequence data wherein the sorted dextramer sequence data comprises sorted test dextramer sequence data and negative control dextramer sequence data and unsorted dextramer sequence data, wherein the unsorted dextramer sequence data comprises unsorted test dextramer sequence data; determining, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data, a maximum negative control dextramer signal; determining, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data, a maximum sorted dextramer signal; determining, for each cell represented in the dextramer sequence data, based on the unsorted test dextramer sequence data, a maximum unsorted dextramer signal; estimating, based on the maximum negative control dextramer signals, a dextramer binding background noise; estimating, based on the maximum sorted dextramer signals and the maximum unsorted dextramer signals, a dextramer sorting gate efficiency; determining, based on the dextramer binding background noise and the dextramer sorting gate efficiency, a measure of background noise; subtracting, for each cell represented in the dextramer sequence data, the measure of background noise from a dextramer signal associated with each cell; performing, for each cell represented in the dextramer sequence data, cell-wise normalization on the dextramer signals associated with each cell; performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization; determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain; removing, from the normalized dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chains; and identifying data remaining in the normalized dextramer sequence data as associated with reliable TCR-pMHC binding events.
Disclosed are methods comprising performing TCR-pMHC binding specificity data normalization on dextramer sequence data to identify a plurality of TCR-pMHC binding events; determining, based on the normalized dextramer sequence data, a training dataset comprising a plurality of TCR sequences wherein each TCR sequence is associated with a binding affinity; determining, based on the plurality of TCR sequences, a plurality of features for a predictive model; training, based on a first portion of the training dataset, the predictive model according to the plurality of features; testing, based on a second portion of the training dataset, the predictive model; and outputting, based on the testing, the predictive model.
Disclosed are methods comprising presenting, to a trained predictive model, an unknown TCR sequence, wherein the trained predictive model is trained based on a training data set derived according to the disclosed methods; and predicting, by the trained predictive model, a binding affinity.
Disclosed are methods comprising receiving single cell sequence data, dextramer sequence data, and single cell T Cell Receptor (TCR) sequence data, determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes, removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range, determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression, removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold, determining, based on the dextramer sequence data, sorted dextramer sequence data wherein the sorted dextramer sequence data comprises sorted test dextramer sequence data and negative control dextramer sequence data, determining, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data, a maximum negative control dextramer signal, determining, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data, a maximum sorted dextramer signal, estimating, based on the maximum negative control dextramer signals and the maximum sorted dextramer signals, a dextramer binding background noise, determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain, removing, from the dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chain, determining, for each dextramer binding to a given cell represented in the dextramer sequence data, a ratio of dextramer signal within the cell to a sum of all dextramers binding to the cell (a measure of the dextramer binding specificity to the cell), determining, for each dextramer binding to a given TCR clonotype of each cell represented in the dextramer sequence data, a fraction of T cells within a clone binding to a particular dextramer (a measure of the dextramer binding specificity to the clonotype to which the cell belongs), determining, for each dextramer binding to a given cell represented in the dextramer sequence data, based on the measure the of the dextramer binding specificity to the cell and the measure of the dextramer binding specificity to the clonotype to which the cell belongs, a corrected dextramer signal associated with each dextramer binding to the cell, performing, for each cell represented in the dextramer sequence data, cell-wise normalization on the dextramer signals associated with each cell, performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization, and identifying, based on a threshold, data remaining in the normalized dextramer sequence data as associated with reliable TCR-pMHC binding events.
Disclosed are apparatus configured to perform any of the disclosed methods.
Disclosed are computer readable mediums having processor-executable instructions embodiment thereon configured to cause an apparatus to perform any of the disclosed methods.
Additional advantages of the disclosed method and compositions will be set forth in part in the description which follows, and in part will be understood from the description, or may be learned by practice of the disclosed method and compositions. The advantages of the disclosed method and compositions will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments of the disclosed method and compositions and together with the description, serve to explain the principles of the disclosed method and compositions.
The disclosed method and compositions may be understood more readily by reference to the following detailed description of particular embodiments and the Example included therein and to the Figures and their previous and following description.
It is understood that the disclosed method and compositions are not limited to the particular methodology, protocols, and reagents described as these may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.
It must be noted that as used herein and in the appended claims, the singular forms “a.”, “an.” and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, reference to “a TCR” includes a plurality of such TCRs, reference to “the dextramer” is a reference to one or more dextramers and equivalents thereof known to those skilled in the art, and so forth.
The term “subject” or “donor” may refer to an animal, such as a mammalian species (preferably human) or avian (e.g., bird) species. More specifically, a subject or donor can be a vertebrate, e.g., a mammal such as a mouse, a primate, a simian or a human. Animals include farm animals, sport animals, and pets. A subject or donor can be a healthy individual, an individual that has symptoms or signs or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy. In some embodiments, the subject donor is human, such as a human who has, or is suspected of having, cancer.
The term “barcode,” as used herein, generally refers to a label that may be attached to a molecule (e.g., dextramer, cell) to convey information about the molecule. For example, a DNA barcode can be a polynucleotide sequence attached to each dextramer and a common sequencing barcode can be a polynucleotide sequence attached during sequencing. This barcode can then be sequenced. The presence of the same barcode on multiple sequences may provide information about the origin of the sequence. For example, a barcode may indicate that the sequence came from a particular dextramer. A barcode can also indicate that a sequence came from a particular cell/dextramer combination.
As used herein, the terms “sequencing” or “sequencer” refer to any of a number of technologies used to determine the sequence of a biomolecule, e.g., a nucleic acid such as DNA or RNA. Exemplary sequencing methods include, but are not limited to, targeted sequencing, single molecule real-time sequencing, exon sequencing, electron microscopy-based sequencing, panel sequencing, transistor-mediated sequencing, direct sequencing, random shotgun sequencing, Sanger dideoxy termination sequencing, whole-genome sequencing, sequencing by hybridization, pyrosequencing, duplex sequencing, cycle sequencing, single-base extension sequencing, solid-phase sequencing, high-throughput sequencing, massively parallel signature sequencing, emulsion PCR, co-amplification at lower denaturation temperature-PCR (COLD-PCR), multiplex PCR, sequencing by reversible dye terminator, paired-end sequencing, near-term sequencing, exonuclease sequencing, sequencing by ligation, short-read sequencing, single-molecule sequencing, sequencing-by-synthesis, real-time sequencing, reverse-terminator sequencing, nanopore sequencing, 454 sequencing, Solexa Genome Analyzer sequencing, SOLiD™ sequencing, MS-PET sequencing, and a combination thereof. In some embodiments, sequencing can be performed by a gene analyzer such as, for example, gene analyzers commercially available from Illumina or Applied Biosystems.
A “polynucleotide”, “nucleic acid”, “nucleic acid molecule”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages. Typically, a polynucleotide comprises at least three nucleosides. Oligonucleotides often range in size from a few monomeric units, e.g. 3-4, to hundreds of monomeric units. Whenever a polynucleotide is represented by a sequence of letters, such as “ATGCCTG,” it will be understood that the nucleotides are in 5′→3′ order from left to right and that “A” denotes adenosine, “C” denotes cytosine, “G” denotes guanosine, and “T” denotes thymidine, unless otherwise noted. The letters A, C, G, and T may be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art.
The term “DNA (deoxyribonucleic acid)” refers to a chain of nucleotides comprising deoxyribonucleosides that each comprise one of four nucleobases, namely, adenine (A), thymine (T), cytosine (C), and guanine (G). The term “RNA (ribonucleic acid)” refers to a chain of nucleotides comprising four types of ribonucleosides that each comprise one of four nucleobases, namely; A, uracil (U), G, and C. Certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). In DNA, adenine (A) pairs with thymine (T) and cytosine (C) pairs with guanine (G). In RNA, adenine (A) pairs with uracil (U) and cytosine (C) pairs with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand. As used herein, “nucleic acid sequencing data,” “nucleic acid sequencing information,” “nucleic acid sequence,” “nucleotide sequence”, “genomic sequence,” “genetic sequence,” or “fragment sequence,” or “nucleic acid sequencing read” denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine or uracil) in a molecule (e.g., a whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, or fragment) of a nucleic acid such as DNA or RNA. It should be understood that the present teachings contemplate sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, and electronic signature-based systems.
“Optional” or “optionally” means that the subsequently described event, circumstance, or material may or may not occur or be present, and that the description includes instances where the event, circumstance, or material occurs or is present and instances where it does not occur or is not present.
Throughout the description and claims of this specification, the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps. In particular, in methods stated as comprising one or more steps or operations it is specifically contemplated that each step comprises what is listed (unless that step includes a limiting term such as “consisting of”), meaning that each step is not intended to exclude, for example, other additives, components, integers or steps that are not listed in the step.
“Exemplary” means “an example of” and is not intended to convey an indication of a preferred or ideal configuration. “Such as” is not used in a restrictive sense, but for explanatory purposes.
Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, also specifically contemplated and considered disclosed is the range from the one particular value and/or to the other particular value unless the context specifically indicates otherwise. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another, specifically contemplated embodiment that should be considered disclosed unless the context specifically indicates otherwise. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint unless the context specifically indicates otherwise. Finally, it should be understood that all of the individual values and sub-ranges of values contained within an explicitly disclosed range are also specifically contemplated and should be considered disclosed unless the context specifically indicates otherwise. The foregoing applies regardless of whether in particular cases some or all of these embodiments are explicitly disclosed.
In some aspects, the methods and systems described can identify reliable TCR-pMHC bindings by analyzing multi-omics high-throughput binding data. The methods and systems may be referred to herein as ICON (Integrative COntext-specific Normalization).
Disclosed are methods of receiving single cell sequence data, dextramer sequence data, and single cell receptor sequence data; filtering, from the dextramer sequence data, based on the single cell sequence data, data associated with low-quality cells; adjusting, based on a measure of background noise, the dextramer sequence data; filtering, from the dextramer sequence data, based on the single cell receptor data, data according to a presence or an absence of specific receptor sequences; and identifying data remaining in the normalized filtered dextramer sequence data as associated with reliable receptor-pMHC binding events.
The single cell sequence data and corresponding receptor sequence data can be from several cell types, including T cells (αβ or γδ) and B cells. Thus, as an example, disclosed are methods of receiving single cell sequence data, dextramer sequence data, and single cell TCR sequence data; filtering, from the dextramer sequence data, based on the single cell sequence data, data associated with low-quality cells; adjusting, based on a measure of background noise, the dextramer sequence data; filtering, from the dextramer sequence data, based on the single cell TCR-data, data according to a presence or an absence of an α-chain or a β-chain; and identifying data remaining in the normalized filtered dextramer sequence data as associated with reliable TCR-pMHC binding.
1. Data Acquisition
Disclosed are methods of acquiring, receiving, and/or determining multi-omics high-throughput binding data. As shown in
In some aspects, the multi-omics high-throughput binding data can be previously generated and incorporated into the disclosed methods. In some aspects, the multi-omics high-throughput binding data can be generated as part of the disclosed methods.
In some aspects, as shown in
In some aspects, once the cell type of interest has been sorted, the sorted cells can then be sorted for cells that bind a particular peptide-major histocompatibility complex (MHC) (pMHC). In some aspects, cells can be combined with a set of dextramers, for example, dCODE™ dextramers. In some aspects, the dCODE™ Dextramer® technology can be used. The dextramers can comprise two or more MHCs, a peptide presented by each MHC, and a DNA barcode. In some aspects, a pool of dextramers are used. In some aspect, a pool of dextramers can comprise, but is not limited to, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 single dextramers each comprising a different pMHC. In some aspects, a pool of dextramers comprises two or more of each of the single dextramers comprising a different pMHC. In some aspects, the two or more MHCs on a single dextramer are the same and therefore present the same peptide. In some aspects the MHC can be a MHC class I (MHC I) or MHC class II (MHC II). In some aspects, the DNA barcode comprises one or more primer sequences, a peptide-MHC (pMHC) specific barcode, and a unique molecular identifier. In some aspects, the dextramers can further comprise a label. For example, the label can be a fluorescent label. In some aspects, cells that bind a particular pMHC are sorted based on the label on the dextramer. In some aspects, cells that bind a particular pMHC are sorted based on a labeled antibody specific to the dextramer.
In some aspects, the cell sorting for specific cell types and the cell sorting for cells recognizing a dextramer can be performed simultaneously or consecutively.
In some aspects, after sorting of the cells that bound to dextramers comprising pMHCs, each cell and the corresponding dextramer can be sequenced. In some aspects, the cell sequence and the dextramer sequence (e.g., the DNA barcode sequence from the dextramer) all have a common sequencing barcode which allows one to determine which cell sequences were associated with which dextramer sequences. In some aspects, the Next GEM technology can be used for sequencing. The common sequencing barcode is different than the DNA barcode found on the dextramers.
In some aspects, the sequencing of the cells that bound to dextramers comprising pMHCs provides the sequence data 104 which may comprise single cell sequence data, dextramer sequence data, and single cell receptor sequence data. In some aspects, the single cell sequence data comprises sequences from the entire cellular genome or transcriptome. Thus, in some aspects, single cell sequence data comprises gene expression data. In some aspects, the dextramer sequence data comprises the DNA barcode sequence. In some aspects, single cell receptor sequence data comprises a sequence of a specific receptor. For example, single cell receptor sequence data comprises single cell TCR or B cell receptor (BCR) sequence data. In some aspects, single cell TCR sequence data comprises paired TCR sequence data. In some aspects, paired TCR sequence data comprises sequence data for the a chain and the β chain, if present, for each cell. In some aspects, paired TCR sequence data comprises sequence data for the γ chain and the δ chain, if present, for each cell. Thus, for each method and example described herein, the sequencing of the alpha chains and beta chains can be exchanged for sequencing of the gamma chains and delta chains.
Returning to the system 100 shown
In some aspects, the ICON module 108 can be configured to analyze the received sequence data 104 (e.g., multi-omics high-throughput binding data, single cell sequence data, dextramer sequence data, single cell receptor sequence data, etc.). The sequence data 104 may include sequence information as well as meta information. The sequence data 104 can be stored in any suitable file format including, for example, VCF files, FASTA files or FASTQ files, as are known to those of skill in the art. FASTA and FASTQ are common file formats used to store raw sequence reads from high throughput sequencing. FASTQ files store an identifier for each sequence read, the sequence, and the quality score string of each read. FASTA files store the identifier and sequence only. Other file formats are contemplated.
In some aspects, as shown in
Filtering of low-quality cells from the sequence data 104 at step 310 may comprise single cell RNA-seq based filtering of low-quality cells. The ICON module 108 can be configured to filter out low quality cells such as doublets and dead cells. The cells with an unexpected high number of genes for T cells detected (e.g. >2500 genes per cell) may be categorized as doublets and cells with a high fraction of mitochondrial gene expression (e.g. ratio of mitochondrial gene expression UMIs to the total gene expression UMIs>0.4) or too few numbers of genes detected (<200 genes per cell) may be classified as dead cells. Data associated with the low quality cells may be removed from the sequence data 104 (e.g., the dextramer sequence data).
In an embodiment, filtering of low-quality cells from the sequence data 104 at step 310 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes, removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range (the gene threshold range may be, for example, about 200 to about 2,500 genes), determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression, and removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold. The gene expression threshold can be about 40 percent of total unique molecular identifier counts.
Adjusting the sequence data 104 for background noise at step 320 may comprise single cell dCODE-Dextramer-seq based background adjustment. In an aspect, two types of background noise controls that were designed for the dextramer binding assays include negative control dextramers from dextramer stained and sorted CD8+ T cells (NC_dex, denoted as nc), and the dextramer stained CD8+ T cells without sorting on dextramer (Dex_unsorted, denoted as du). To inspect signal and noise distributions, the maximum dextramer signal in UMI (Unique Molecular Identifier) of each cell may be selected to represent the best binding of each cell. Specifically, the non-specific dextramer binding signal of a cell may be represented as Max(nc1, . . . , ncn), the maximum dextramer signal of n negative control dextramers included the dextramer pool. The dextramer binding signal of a cell from a dextramer stained and sorted sample (Dex_sorted, denoted as ds) may be represented as Max(ds1, . . . , dsm), the maximum dextramer signal in UMI of m testing dextramers. Similarly, the dextramer binding signal of a cell from a Dex_unsorted sample may be represented as Max(du1, . . . , dum). P99.9 of the non-specific dextramer binding signals in UMI may be selected as a non-specific dextramer binding cutoff (absolute outliers of negative dextramer controls may be excluded).
To estimate the potential noise introduced by the cell sorting process, the accumulative distributions of dextramer binding signals between Dex_sorted and Dex_unsorted samples may be compared to determine a cutoff for dextramer sorting efficiency. Kolmogorov-Smirnov test (KS test) p-values may be calculated by comparing the accumulative curves of dextramer sorted and dextramer unsorted samples using each data point (dextramer UMI) as a sliding window. The dextramer UMI which defines the largest difference of dextramer binding signals between Dex_sorted and Dex_unsorted (argmax Ds,u) may be used as a threshold for estimating dextramer sorting efficiency. The measure of estimated background noise (d) of dextramer sorted samples may be defined as:
d=Max(P99.9,argmax Ds,u)
The dextramer signals (UMI) for each testing dextramer of sorted cells may be corrected by subtracting the measure of estimated background noise (d):
E
c
=E
s
−d
In an embodiment, adjusting the data for background noise at step 320 may comprise determining, based on the dextramer sequence data, sorted dextramer sequence data and unsorted dextramer sequence data. The sorted dextramer sequence data can comprise sorted test dextramer sequence data (dex_sorted) and negative control dextramer sequence data (nc_dex). The unsorted dextramer sequence data, can comprise unsorted test dextramer sequence data (dex_unsorted). The method 300, at step 320, may determine, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data (nc_dex), a maximum negative control dextramer signal (Max(nc1, . . . , ncn)). The method 300, at step 320, may determine, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data (dex_sorted), a maximum sorted dextramer signal (Max(ds1, . . . , dsm)). The method 300, at step 320, may determine, for each cell represented in the dextramer sequence data, based on the unsorted test dextramer sequence data (dex_unsorted), a maximum unsorted dextramer signal (Max(du, . . . , dum)).
The method 300, at step 320, may estimate, based on the maximum negative control dextramer signals, a dextramer binding background noise (P99.9) and estimate, based on the maximum sorted dextramer signals and the maximum unsorted dextramer signals, a dextramer sorting gate efficiency (argmax Ds, u). The dextramer sorting gate efficiency may be determined, for example, by the maximum difference between Max(ds1, . . . , dsm) of the sorted test dextramer sequence data and Max(du, . . . , dum) of the unsorted dextramer sequence data.
The method 300, at step 320, may determine, based on the dextramer binding background noise (P99.9) and the dextramer sorting gate efficiency (argmax Ds, u), a measure of background noise (d) and subtract, for each cell represented in the dextramer sequence data, the measure of background noise (d) from a dextramer signal associated with each cell (Ec=Es−d).
In an embodiment, selecting T cells with paired αβ chains in the sequence data 104 at step 330 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain and removing, from the dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chains. Step 330 may comprise removing any data from the dextramer sequence data that is not associated with cells with single paired γδ chains. Thus, the same steps for adjusting background noise at step 320 can be performed with regards to the presence or absence of the γ chain and/or δ chain.
Selecting T cells with paired αβ chains in the sequence data 104 at step 330 may comprise removing any data from the dextramer sequence data that is not associated with cells with single paired αβ chains. The single cell receptor sequence data (e.g., single cell TCR-seq data), may be used to determine data associated with T cells that have only α-chain, only β-chain, and multiple α- or β-chains and such data may be removed from the sequence data 104 (e.g., the dextramer sequence data). For T cells with multiple α- or β-chains detected, the α- or β-chains with highest UMI counts may be assigned to each T cell. For example, if one T cell has 4 α-chains and 4 β-chains detected, from the list of all β-chains, the β-chain with the highest UMI may be selected. Similarly for α-chains. The selected α- or β-chains from this process may be assigned to the cell.
The method 300, at step 340, may comprise applying dextramer signal correction to the sequence data 104. At step 340, dextramer signals in the sequence data 104 may be corrected, resulting in corrected dextramer sequence data. Each dextramer has an optimal binding condition, however it is impossible to arrange the experimental conditions such that a multiplexed dextramer binding assay is optimal for every dextramer. This results in multiple dextramers binding to the same T cell/clone. To correct for this effect, dextramer signals may be penalized if simultaneously binding to the same T cell/clone, using the following technique.
Defining the background noise subtracted dextramer signal for the ith T cell binding the jth dextramer as Eij, further denote the fraction of dextramer signal due to binding of the jth dextramer for the ith T cell as:
Denoting the TCR clonotype of the ith T cell as ki, and the number of T cells belonging to clonotype ki that bind dextramer j as Tk
Using these quantities, calculate the corrected dextramer signal for the ith T cell binding the jth dextramer as:
S
ij
=E
ij(RCij)2RTkj.
The method 300, at step 350, may normalize the corrected dextramer sequence data by performing, for each cell represented in the dextramer sequence data, cell-wise normalization on the dextramer signals associated with each cell and/or performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization. Such normalization can result in normalized dextramer sequence data. Step 350 may further comprise binder identification. To make all the dextramer binding signals comparable, the corrected dextramer binding signals may be log-ratio normalized across 44 testing dextramers within a cell. pMHC-wise normalization may subsequently be conducted based on Log-Rank distribution. Normalized dextramer UMI>0 was empirically chosen as the cutoff for pMHC specific binders.
In an embodiment, the corrected dextramer sequence data may be normalized at step 350. For example, a cell-wise normalization may be performed based on Log-Rank distribution for each cell and/or a pMHC-wise normalization may be performed to make the dextramer binding signals comparable to each other. The adjusted dextramer binding signals of sorted cells Ec may normalized across the testing dextramers, then across all cells as the following equation:
Ec′>=0.9 may be empirically determined as a cutoff for pMHC specific binders
The method 300, at step 360, may further identify data remaining in the normalized dextramer sequence data as associated with reliable TCR-pMHC binding events. Such data may be considered a portion of a training data set for use in a machine learning process. The resulting processed sequence data 104 (e.g., the training data set) may be provided to the predictive module 110.
Turning now to
The training data set 410 may comprise one or more receptor sequences, one or more gene identifiers, a binding status, and an identifier of a peptide to which the receptor sequence bound (if any). The binding status may indicate “yes” for a receptor sequence that bound to a peptide or “no” for a receptor sequence that did not bind to a peptide. For receptor sequences that bound to a peptide, the identifier of the peptide can be used to identify an antigen associated with the peptide. Such data may be derived in whole or in part from the sequence data 104 processed by the ICON module 108. In an embodiment, TCR-CDR3 amino acid sequences may be determined from the sequence data 104, including associated V, D, and J gene identifiers, a label indicating binding status (Yes, No), and an identifier of a peptide to which the TCR-CDR3 amino acid sequences bound. The TCR-CDR3 amino acid sequences may be encoded into numbers to represent the 20 possible amino acids. Padding may be applied to sequences as needed. The V and J gene identifiers may be one-hot encoded to provide a categorical and discrete representation of the gene identifiers in numerical space. The encoded TCR-CDR3 amino acid and V and J gene identifiers may be concatenated together to represent one TCR record and associated with the label indicating binding status (Yes, No). The label may further indicate the specific peptide to which the TCR bound. One or more TCR records may be combined to result in the training data set 410.
A subset of the TCR records may be randomly assigned to the training data set 410 or to a testing data set. In some implementations, the assignment of data to a training data set or a testing data set may not be completely random. In this case, one or more criteria may be used during the assignment. In general, any suitable method may be used to assign the data to the training or testing data sets, while ensuring that the distributions of yes and no labels are somewhat similar in the training data set and the testing data set.
The training module 420 may train the ML module 430 by extracting a feature set from a plurality of TCR records (e.g., labeled as yes) in the training data set 410 according to one or more feature selection techniques. The training module 420 may train the ML module 430 by extracting a feature set from the training data set 410 that includes statistically significant features of positive examples (e.g., labeled as being yes) and statistically significant features of negative examples (e.g., labeled as being no).
The training module 420 may extract a feature set from the training data set 410 in a variety of ways. The training module 420 may perform feature extraction multiple times, each time using a different feature-extraction technique. In an example, the feature sets generated using the different techniques may each be used to generate different machine learning-based classification models 440. For example, the feature set with the highest quality metrics may be selected for use in training. The training module 420 may use the feature set(s) to build one or more machine learning-based classification models 440A-440N that are configured to indicate whether a new receptor sequence (e.g., with an unknown binding status) is likely or not likely to bind to a peptide or pMHC.
The training data set 410 may be analyzed to determine any dependencies, associations, and/or correlations between features and the yes/no labels in the training data set 410. The identified correlations may have the form of a list of features that are associated with different yes/no labels. The term “feature,” as used herein, may refer to any characteristic of an item of data that may be used to determine whether the item of data falls within one or more specific categories. By way of example, the features described herein may comprise one or more sequence patterns, amino acid sequences of one or both alpha and beta chains, names of v and j gene segments of one or both alpha and beta chains.
A feature selection technique may comprise one or more feature selection rules. The one or more feature selection rules may comprise an feature occurrence rule. The feature occurrence rule may comprise determining which features in the training data set 410 occur over a threshold number of times and identifying those features that satisfy the threshold as candidate features.
A single feature selection rule may be applied to select features or multiple feature selection rules may be applied to select features. The feature selection rules may be applied in a cascading fashion, with the feature selection rules being applied in a specific order and applied to the results of the previous rule. For example, the feature occurrence rule may be applied to the training data set 410 to generate a first list of features. A final list of candidate features may be analyzed according to additional feature selection techniques to determine one or more candidate feature groups (e.g., groups of features that may be used to predict binding). Any suitable computational technique may be used to identify the candidate feature groups using any feature selection technique such as filter, wrapper, and/or embedded methods. One or more candidate feature groups may be selected according to a filter method. Filter methods include, for example, Pearson's correlation, linear discriminant analysis, analysis of variance (ANOVA), chi-square, combinations thereof, and the like. The selection of features according to filter methods are independent of any machine learning algorithms. Instead, features may be selected on the basis of scores in various statistical tests for their correlation with the outcome variable (e.g., yes/no).
As another example, one or more candidate feature groups may be selected according to a wrapper method. A wrapper method may be configured to use a subset of features and train a machine learning model using the subset of features. Based on the inferences that drawn from a previous model, features may be added and/or deleted from the subset. Wrapper methods include, for example, forward feature selection, backward feature elimination, recursive feature elimination, combinations thereof, and the like. As an example, forward feature selection may be used to identify one or more candidate feature groups. Forward feature selection is an iterative method that begins with no feature in the machine learning model. In each iteration, the feature which best improves the model is added until an addition of a new variable does not improve the performance of the machine learning model. As an example, backward elimination may be used to identify one or more candidate feature groups. Backward elimination is an iterative method that begins with all features in the machine learning model. In each iteration, the least significant feature is removed until no improvement is observed on removal of features. Recursive feature elimination may be used to identify one or more candidate feature groups. Recursive feature elimination is a greedy optimization algorithm which aims to find the best performing feature subset. Recursive feature elimination repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. Recursive feature elimination constructs the next model with the features remaining until all the features are exhausted. Recursive feature elimination then ranks the features based on the order of their elimination.
As a further example, one or more candidate feature groups may be selected according to an embedded method. Embedded methods combine the qualities of filter and wrapper methods. Embedded methods include, for example, Least Absolute Shrinkage and Selection Operator (LASSO) and ridge regression which implement penalization functions to reduce overfitting. For example, LASSO regression performs L1 regularization which adds a penalty equivalent to absolute value of the magnitude of coefficients and ridge regression performs L2 regularization which adds a penalty equivalent to square of the magnitude of coefficients.
After the training module 420 has generated a feature set(s), the training module 420 may generate a machine learning-based classification model 440 based on the feature set(s). A machine learning-based classification model may refer to a complex mathematical model for data classification that is generated using machine-learning techniques. In one example, the machine learning-based classification model 440 may include a map of support vectors that represent boundary features. By way of example, boundary features may be selected from, and/or represent the highest-ranked features in, a feature set.
The training module 420 may use the feature sets extracted from the training data set 410 to build a machine learning-based classification model 440A-440N for each classification category (e.g., yes, no). In some examples, the machine learning-based classification models 440A-440N may be combined into a single machine learning-based classification model 440. Similarly, the ML module 430 may represent a single classifier containing a single or a plurality of machine learning-based classification models 440 and/or multiple classifiers containing a single or a plurality of machine learning-based classification models 440.
The extracted features (e.g., one or more candidate features) may be combined in a classification model trained using a machine learning approach such as discriminant analysis; decision tree; a nearest neighbor (NN) algorithm (e.g., k-NN models, replicator NN models, etc.); statistical algorithm (e.g., Bayesian networks, etc.); clustering algorithm (e.g., k-means, mean-shift, etc.); neural networks (e.g., reservoir networks, artificial neural networks, etc.); support vector machines (SVMs); logistic regression algorithms; linear regression algorithms; Markov models or chains; principal component analysis (PCA) (e.g., for linear models); multi-layer perceptron (MLP) ANNs (e.g., for non-linear models); replicating reservoir networks (e.g., for non-linear models, typically for time series); random forest classification; a combination thereof and/or the like. The resulting ML module 430 may comprise a decision rule or a mapping for each candidate feature to assign an binding status to a new receptor sequence.
In an embodiment, the training module 420 may train the machine learning-based classification models 440 as a convolutional neural network (CNN). The CNN may comprise at least one convolutional feature layer and three fully connected layers leading to a final classification layer (softmax). The final classification layer may finally be applied to combine the outputs of the fully connected layers using softmax functions as is known in the art.
The candidate feature(s) and the ML module 430 may be used to predict the binding statuses (and associated peptides) of a plurality of TCR records in the testing data set. In one example, the result for each TCR record includes a confidence level that corresponds to a likelihood or a probability that the receptor sequence will bind to a peptide. The confidence level may be a value between zero and one, and it may represent a likelihood that the receptor sequence belongs to a yes/no binding status with regard to one or more peptides. In one example, when there are two statuses (e.g., yes and no), the confidence level may correspond to a value p, which refers to a likelihood that a particular receptor sequence belongs to the first status (e.g., yes). In this case, the a value 1-p may refer to a likelihood that the particular receptor sequence belongs to the second status (e.g., no). In general, multiple confidence levels may be provided for each test receptor sequence and for each candidate feature when there are more than two statuses. A top performing candidate feature may be determined by comparing the result obtained for each test receptor sequence with the known yes/no binding status for each test receptor sequence. In general, the top performing candidate feature will have results that closely match the known yes/no binding statuses.
The top performing candidate feature(s) may be used to predict the yes/no binding status of a receptor sequence with regard to one or more peptides. For example, a new TCR sequence may be determined/received. The new TCR sequence may be provided to the ML module 430 which may, based on the top performing candidate feature, classify the new TCR sequence as either binding (yes) or not binding (no) and an indication of the binding peptide(s).
The training method 500 may determine (e.g., access, receive, retrieve, etc.) first sequence data that has been processed by the ICON module 108 at step 510. The sequence data may comprise a labeled set of receptor sequences. The labels may correspond to binding status (e.g., yes or no) and identification of peptide(s) to which the receptor sequence bound.
The training method 500 may generate, at step 520, a training data set and a testing data set. The training data set and the testing data set may be generated by randomly assigning labeled receptor sequences to either the training data set or the testing data set. In some implementations, the assignment of labeled receptor sequences as training or testing samples may not be completely random. As an example, a majority of the labeled receptor sequences may be used to generate the training data set. For example, 75% of the labeled receptor sequences may be used to generate the training data set and 25% may be used to generate the testing data set.
The training method 500 may determine (e.g., extract, select, etc.), at step 530, one or more features that can be used by, for example, a classifier to differentiate among different classification of binding status (e.g., yes vs. no) with regard to one or more peptides. As an example, the training method 500 may determine a set features from the labeled receptor sequences. In a further example, a set of features may be determined from labeled receptor sequences different than the labeled receptor sequences in either the training data set or the testing data set. In other words, labeled receptor sequences may be used for feature determination, rather than for training a machine learning model. Such labeled receptor sequences may be used to determine an initial set of features, which may be further reduced using the training data set.
The training method 500 may train one or more machine learning models using the one or more features at step 540. In one example, the machine learning models may be trained using supervised learning. In another example, other machine learning techniques may be employed, including unsupervised learning and semi-supervised. The machine learning models trained at 540 may be selected based on different criteria depending on the problem to be solved and/or data available in the training data set. For example, machine learning classifiers can suffer from different degrees of bias. Accordingly, more than one machine learning model can be trained at 540, optimized, improved, and cross-validated at step 550.
The training method 500 may select one or more machine learning models to build a predictive model at 560. The predictive model may be evaluated using the testing data set. The predictive model may analyze the testing data set and generate predicted binding statuses at step 570. Predicted binding statuses may be evaluated at step 580 to determine whether such values have achieved a desired accuracy level. Performance of the predictive model may be evaluated in a number of ways based on a number of true positives, false positives, true negatives, and/or false negatives classifications of the plurality of data points indicated by the predictive model.
For example, the false positives of the predictive model may refer to a number of times the predictive model incorrectly classified a receptor sequence as binding that was in reality not binding. Conversely, the false negatives of the predictive model may refer to a number of times the machine learning model classified an receptor sequence as not binding when, in fact, the receptor sequence was binding. True negatives and true positives may refer to a number of times the predictive model correctly classified one or more receptor sequences as binding or non-binding. Related to these measurements are the concepts of recall and precision. Generally, recall refers to a ratio of true positives to a sum of true positives and false negatives, which quantifies a sensitivity of the predictive model. Similarly, precision refers to a ratio of true positives a sum of true and false positives. When such a desired accuracy level is reached, the training phase ends and the predictive model (e.g., the ML module 430) may be output at step 590; when the desired accuracy level is not reached, however, then a subsequent iteration of the training method 500 may be performed starting at step 510 with variations such as, for example, considering a larger collection of sequence data.
In an embodiment, provided is a flexible framework for the study of TCR-pMHC specificity, referred to herein as TCRAI. In an embodiment, TCRAI may utilize Tensorflow 2. TCRAI is highly modularized and allows for adjustment to model architecture. Any number of V(D)J genes and CDR regions of the TCR may be defined as inputs to the model in textual form. A selection may be made with regard to how to process these inputs into numerical form in a non-learnable way, via “processor” objects that convert text to numerical representations. These numerical inputs can then be further processed in learnable ways via “extractor” objects that form blocks of the neural network and give as their output vector representations of the input data, referred to herein as TCRAI fingerprints. TCRAI fingerprints may be concatenated into a single TCRAI fingerprint describing the input TCR via a single numerical vector. The TCRAI fingerprint may then be passed through a “closer” object which forms the final block of the neural network architecture, producing a prediction on the input TCR. TCRAI provides several such pre-built processors, extractors, and closers. TCRAI may be configured to perform binomial, multinomial, regression, and/or other tasks by choosing to construct a different closer object. In an embodiment, TCRAI may be used to build a model to make predictions of whether a given TCR can bind a specific pMHC complex.
In an embodiment, TCRAI may make use of 1D convolutions and batch normalization for CDR3 sequences and lower dimensional representations for genes, resulting in model regularization and forcing the model to learn stronger gene associations.
In an embodiment, the input information of the TCR may be processed into a numerical format. For each CDR3 sequence, amino acids may be converted to integers, and the integer vectors may be encoded into a one-hot representation. For V and J genes, a dictionary of gene type to integer may be built for each V and J gene and used to convert each gene to an integer.
The neural network architecture applied to the processed input information may include embedding layers and convolutional networks. Specifically, processed CDR3 residues may be embedded into a 16-dimensional space via a learned embedding, and the resulting numeric CDR3s may be fed through one or more (e.g., 3) 1D convolutional layers. In an embodiment, filters of dimensions [64,128,256], kernel widths [5,4,4], and strides [1,3,3] may be used. Each convolution may be activated by an exponential linear unit activation and followed by dropout and batch normalization. Following these three convolutional blocks, global max pooling may be applied to the final features, this process encodes each CDR3 by a vector of length 256, a “CDR3 fingerprint.” The processed gene input for each gene may be one-hot encoded and embedded into a reduced dimensional space (e.g., 16 for V genes, and 8 for J genes) via a learned embedding, giving a “gene fingerprint” of each gene as a vector. The fingerprints of all selected CDR3s and genes may then be concatenated together into a single vector, a “TCRAI fingerprint.” The TCRAI fingerprint may be passed through one final full-connected layer to give binomial predictions (single output value, sigmoid activation), regression predictions (single output, no activation), or multinomial predictions (multiple output values, softmax activation).
In an embodiment, TCR sequencing files may be collected as a raw csv formatted multi-omics high-throughput binding data. Sequencing files may be parsed to take the amino acid sequence of the CDR3 after removing unproductive sequences. Clones with different nucleotide sequences, but the same matched amino acid sequence from CDR3s, and the V, D, J genes may be aggregated together under one TCR. Thus, each TCR record may include single paired α and β TCR chains, with CDR3 amino acid sequence and V, J genes for each chain.
The data may be split into a training set (e.g., 76.5%), a validation set (e.g., 13.5%), and a left-out test set (e.g., 10%) for each model, and subsequently a 5-fold Monte-Carlo cross-validation (MCCV) may be performed on the training set. The model may be trained by minimizing the cross-entropy loss via the Adam optimizer, and the cross-entropy loss may be weighted by weights 1/(number of classes*fraction of samples in that class) for each class. Early stopping may be engaged, via a left-out validation dataset, to prevent overfitting, in which the model ceases training if the validation loss increases for more than 5 epochs and the weights of the model with minimal validation loss are restored. In the event of training a large number of models, only the learning rate and batch size need be tuned during cross-validation. After cross-validation the optimally performing hyperparameters may be selected and the model may be re-trained on the full training set, using the validation set to control early-stopping. The re-trained model may then be evaluated on the left-out test set.
TCRAI models may produce both a prediction for a TCR to bind a specific pMHC (or one of many pMHCs, in the multinomial case), and a numerical vector (TCRAI fingerprint) (e.g., by encoding paired αβ chain CDR3 amino acid sequences and the V and J genes of each TCR into a one-dimensional input vector) that describes that TCR within the context of the question of whether it can bind that pMHC.
In an embodiment, the distribution of fingerprints may be analyzed to identify groups of TCRs with different binding modalities. The fingerprints can be reduced to a two-dimensional space, for example, using UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. When using a model trained on one dataset and inferring fingerprints on another unseen dataset, the UMAP projector can be fit with TCRs from the training dataset and transform the TCRs from the unseen set using that projector.
When clustering TCR fingerprints, the fingerprints of all TCRs of the dataset can be projected into two-dimensional space as described above, and then those TCRs that are strong true positives (STPs, binomial prediction >0.95) can be selected. These STPs can then be clustered, for example using a k-means classifier, in the two-dimensional space. Other clustering algorithms may be used. TCRs from within in each cluster can then collected and used to construct CDR3 motif logos (using weblogo), gene-usage, and/or UMI distributions by pairing the unique TCR clonotypes within the cluster with all repeated clonotypes in the high throughput data.
In an aspect, the trained predictive model (e.g., machine learning classifier) may be used to predict a binding status of a TCR sequence with regard to one or more peptides. A TCR sequence may be presented to the machine learning classifier. The machine learning classifier may predict a likelihood that the TCR sequence will bind to one or more specific peptides. Similarly, a plurality of TCR sequences may be presented to the machine learning classifier. The machine learning classifier may predict, for each TCR sequence in the plurality of TCR sequences, a likelihood that each TCR sequence will bind to one or more specific peptides. In an aspect, the machine learning classifier can generate a TCR-peptide map as shown in the example output below.
A TCR-peptide map thus generated may be used to rapidly identify peptides that a subject's TCR sequences are likely to bind to. A biological sample (e.g., blood) may be obtained from a subject, cells isolated, and sequenced. The subject's TCR sequences may be identified and compared to the TCR-peptide map to identify peptides most likely to bind to the subject's TCR sequences.
In some aspects, identifying and evaluating antigen-specific T cells can be used to better understand the activities of drugs in mono- and combination therapy settings, identify features of potent anti-tumor T cells, screen for immunogenic epitopes in a haplotype-relevant manner, develop new vaccine and TCR therapies, and develop peptide binding algorithms based on TCR sequence features.
In some aspects, disclosed are methods of identifying a subject using binding patterns of the subject's TCRs. For example, blood can be drawn (first blood draw), cells from the blood can be processed via a single cell-based immune profiling platform, and the resulting data can be processed according to the ICON methods described herein. In some aspects, the cells are exposed to a variety of dextramers comprising pMHCs from a wide range of immunogens. After performing an ICON method as described herein, a reliable TCR binding pattern can be determined. In some aspects, a TCR binding pattern represents the specificity of TCRs to the immunogens on the dextramers. Blood can then be drawn at a different time point (days, weeks, months, years later) from the first blood draw (second blood draw). In some aspects, it would be expected that the second blood draw would likely comprise T cells having TCRs with different sequences than what was present in the first blood draw since there are about 1015 possible TCR sequences, however, the TCR binding pattern is unlikely to change. The cells from the second blood draw can be exposed to the same dextramers as used for the first blood draw and the resulting data analyzed according to an ICON method. Regardless of the different TCR sequences, the binding data of the first blood draw and second blood draw can be compared and used to determine if they are both from the same subject.
In some aspects, disclosed are methods of identifying a subject using machine learning to predict the binding patterns of the subject's TCRs. Reliable TCR binding data can be identified according to an ICON method as described herein. In some aspects, the reliable TCR binding data can be used to train a machine learning classifier as described herein. The trained machine learning classifier can be used to predict specificity TCR binding pattern of a subject. In some aspects, blood can be drawn (first blood draw) and a TCR binding pattern can be predicted using the trained machine learning classifier. Blood can then be drawn at a different time point (days, weeks, months, years later) from the first blood draw (second blood draw). In some aspects, it would be expected that the second blood draw would likely comprise T cells having TCRs with different sequences than what was present in the first blood draw since there are about 1015 possible TCR sequences, however, the TCR binding pattern is unlikely to change. Regardless of the different TCR sequences, the trained machine learning classifier may be used to predict a second TCR binding pattern using data derived from the second blood draw. It is possible to predict that the second blood draw is from the same subject as the first blood draw based on the TCR signatures.
In some aspects, a TCR or BCR binding pattern can be established using the described methods. In some aspects, having reliable TCR data identified using the methods described herein allows someone, such as a medical professional, to infer the antigenic history or vaccine history of a subject. In some aspects, reliable TCR data identified using the ICON methods described herein allows someone, such as a medical professional, to infer what pathogens a subject has been exposed to or even what countries the subject has visited. For example, the presence of TCR binding data to pathogens only present in Africa can indicate that the subject has been to Africa and been exposed to those pathogens.
In some aspects, reliable TCR data identified using the ICON methods described herein can assess a current immunologic state of a subject. For example, blood can be drawn (first blood draw), cells from the blood can be processed via a single cell-based immune profiling platform, and the resulting data can be processed according to the ICON methods described herein, resulting in TCR binding data. In some aspects, the dextramers used in establishing the TCR binding data comprise tumor specific pMHCs. Thus, once the TCR binding data has been normalized using an ICON method, and reliable TCR binding data is established, the presence of predicted tumor specific TCRs can be determined. For example, the reliable TCR data can be used in the disclosed machine learning (CNN) methods and therefore the blood from the subject can be analyzed for the presence of predicted tumor specific TCRs. Thus, the presence of tumor specific TCRs can result in early detection of cancer before any tumors or cancer symptoms are detected.
In some aspects, disclosed are methods for selecting T cells for T cell-based therapies. In some aspects, training data can be accumulated using the disclosed methods of machine learning classifying. In some aspects, the classifier can assign probabilities of a pMHC binding to each TCR sequence tested. In some aspects, the TCR sequence tested is associated with a T cell, wherein the T cell can be from a primary or secondary cell culture. This avoids needed to perform binding assays on all T cells being tested to determine if each T cell has a TCR specific to the different pMHCs. Instead, the classifier is relied on for determining the probability of TCR-pMHC binding. Those TCRs, and thus T cell comprising that TCR), classified as being highly selective to a specific pMHC can then be used for T cell therapies. In some aspects, T cells identified through the machine learning classifier can provide safer cell therapies than those T cells identified through binding assays because only the most reliable binding data was used to create the training data used to classify the TCRs associated with the T cells selected.
In some aspects, disclosed are methods for immune monitoring. In some aspects, blood can be drawn from a subject undergoing immunotherapy (e.g. vaccine treatment; immune checkpoint treatment), the cells, particularly the T cells, can be classified, based on the training data established in the disclosed machine learning approaches, as having a specificity to the epitope of interest or not. In some aspects, if a T cell is determined to have specificity to an epitope of interest then one can infer that the subject will be or is responsive to the immunotherapy. For example, if the immunotherapy is a vaccine that triggers an immune response to a cancer specific antigen, then T cells obtained from the subject would be classified based on their probability of binding to the cancer specific antigen. If T cells are selected as having a high probability of binding to the cancer specific antigen based on the training data obtained using the single cell immune profiling technology and ICON, then the subject would be considered to be a responder to the immunotherapy (e.g. vaccine).
In some aspects, disclosed are methods of TCR epitope mapping using the disclosed methods. In some aspects, TCR epitope mapping is a term that refers to the process of identifying the specific (in some cases the shortest) amino acid sequence of the epitope of a specific antigen that is recognized by T-cell (CD4+ and/or CD8+) receptors, and at the same time has the potential to stimulate a long lasting and a cytotoxic immune response. While performing the disclosed single cell immune profiling platform technology, dextramers can be used wherein all the different epitopes from one or more antigens of interest can be presented on dextramers. In other words, a single dextramer can comprise a pMHC wherein the peptide of the pMHC is a single epitope from one or more antigens of interest and enough dextramers are used so that every epitope of the one or more antigens of interest are present in the pMHC on the dextramers. T cells can be exposed to the dextramers in the disclosed single cell immune profiling platform with the dextramers comprising a single epitope from one or more antigens of interest and wherein enough dextramers are used so that every epitope of the one or more antigens of interest are present in the pMHC on the dextramers. The single cell sequence data, dextramer sequence data, and single cell TCR sequence data obtained from the single cell immune profiling can provide data about the T cells that bound to the different dextramers (e.g. epitopes). The single cell immune profiling data is then processed using ICON as described herein, therefore resulting in binding data for those cells that had the most reliable binding to one or more epitopes of the one or more antigens of interest. In some aspects, machine learning classification of TCRs that bind to the one or more epitopes of the one or more antigens of interest can be used to predict which T cells from a subject might be reactive against a particular antigen (e.g. tumor antigen).
The materials described above as well as other materials can be packaged together in any suitable combination as a kit useful for performing, or aiding in the performance of, the disclosed method. It is useful if the kit components in a given kit are designed and adapted for use together in the disclosed method. For example disclosed are kits for generating single cell sequencing data, the kit comprising reagents for single cell immune profiling. In some aspects, the kits can comprise one or more of the disclosed dextramers comprising pMHCs. In some aspects, the kits can comprise Next GEM sequencing materials. In some aspects, the kits can comprise multi-omics high-throughput binding data comprising one or more of single cell sequence data, dextramer sequence data, and/or single cell receptor sequence data.
The following examples illustrate the present methods and systems as they relate to colorectal cancer detection. The following Examples are not intended to be limiting thereof.
1. Results
i. Multi-Omics High-Throughput TCR-pMHC Binding Data.
10× Genomics recently generated an expansive, publicly available TCR-pMHC binding dataset. In their initial report, the binding profile of over 150,000 CD8+ T cells from four HLA haplotyped healthy donors (
Described herein is a highly multiplexed dextramer binding dataset generated at the single cell level. 10× Genomics used a simple approach to determine pMHC binding TCRs by applying global cutoffs for background noise and non-specific dextramer binding to all donors. However, an unexpectedly high number of promiscuous cross-HLA and cross-peptide associations were found from the TCR-pMHC binding events identified by this approach, particularly in donors 3 and 4 (
To robustly identify reliable binding events from such high-throughput TCR-pMHC binding data, ICON was developed, an Integrative COntext-specific Normalization method (
ii. The TCR-pMHC Binding Events Identified from 10× Genomics High-Throughput Data.
Applying ICON, a total of 20,843 CD8+ T cells were identified from 1,514 unique T cell clones that bind to 29 pMHCs from three donors (
Donors 1 and 2, who possess the most common HLA haplotype (A*02:01) in the dextramer pool (
Interestingly, 37% of TCRs with shared β-chains are paired with different α-chains. This rate is slightly lower (30.9%) for shared TCR α-chains. The majority of TCRs (˜92%) with shared α- or β-chains bind to the sample pMHC, but ˜8% of them recognize different pMHCs (
The dual specificity of TCR (specificity versus degeneracy) has been suggested as an important feature of the immune response mechanism that sufficiently distinguishes self from foreign peptides to avoid autoimmune reactivity while maintaining broad antigenic coverage. Indeed, highly specific, yet promiscuous, TCR-pMHC interactions were observed. 98.7% of unique TCRs bind to one specific pMHC and the remaining TCRs interact with 2 or 3 pMHCs (
iii. Convolutional Neural Network (CNN) Based Classification of T Cell Antigen Specificity.
With this large, diverse TCR-pMHC binding dataset, more robust functional classifiers for computationally validating or prioritizing these binding events are desired. Recent work demonstrated that Convolutional Neural Networks (CNNs) can learn high dimensional information from TCR sequences and thus, may robustly predict TCR-pMHC binding. A CNN-based framework was adapted for validating and/or predicting TCR-pMHC binding. In brief, the paired αβ chain CDR3 amino acid sequences were encoded as well as the V and J genes of each TCR into a one-dimensional input vector. Specifically, trainable embeddings were used to encode the CDR3 amino acid sequences and the V and J gene segments were transformed into vectors. The CNN structure may comprise one convolutional feature layer and three fully connected layers leading to a final classification layer (
To evaluate the performance of this CNN-based model, eleven pMHC-specific binding T cell repertoires were collated generated by traditional single multimer binding and antigen re-exposure assays as a gold-standard dataset (
iv. Classification of pMHC Binding Repertoires Identified from the 10× Genomics High-Throughput Data.
Next the CNN-based classifier was applied to the top seven pMHC binding repertoires identified from the 10× Genomics binding data (
Historically, TCR β-chain sequencing was often used to infer T cell antigen binding specificity due to its higher combinatorial potential compared to the α-chain. To quantitatively evaluate the contribution of TCR α- and β-chains in predicting TCR-pMHC interaction, either the α-chain or β-chain was used in lieu of paired αβ chains as input to the CNN-based classifier. The performance with paired αβ chains is better than α- or β-chain alone with an average increase of 16% in the AUC (
To further understand conserved TCR sequence features underlying the classification, the motif conservation of CDR3 amino acid sequences were explored from the ten most predictive TCR sequences for each of these seven pMHC repertoires (
v. Immune Phenotypes of pMHC Binding CD8+ T Cells.
The combined information of antigen specificity and T cell phenotype has been reported to be important to clinical success of immunotherapies, such as vaccination. The multi-omics data generated by the 10× Genomics immune profiling platform enables the association of T cell antigen specificity with various T cell phenotypes. Using gene (single cell RNA-seq) and surface protein (CITE-seq) expression levels from this multi-omics dataset, pMHC binding CD8+ T cells were separated into subpopulations (Methods and
98.6% of pMHC binding T cells were memory cells enriched in expanded T cell clones (
Although the majority of pMHC binding T cells expressed a memory phenotype, 1.3% of them were naïve cells. These naïve cells had more diverse pMHC interactions than non-naïve cells and were often bound to endogenous antigens, tumor-associated antigens (e.g. MART-1), or to antigens derived from viruses for which the donor was purportedly seronegative (e.g. HIV) (
A method (ICON) that can identify reliable TCR-pMHC interactions was developed by significantly increasing signal-to-background ratios in the highly multiplexed 10× Genomics TCR-pMHC binding data. Having appropriate controls (negative control dextramers and dextramer-unsorted T cell sample) is essential to accurately estimate the background noise, a factor that was found to be indispensable to reliably identify TCR-pMHC binding events. While ICON was developed on one dataset consisting of a single pool of multiplexed dextramers, this method can be generalized to query pMHC-TCR binding data from a broader range of pMHC dextramer pools as more multiplexed datasets are generated.
In this study, the robustness of this CNN-based classifier in predicting TCR-pMHC specific binding was demonstrated, indicating that this computational prediction can potentially be used to study virtually (versus experimentally) T cell antigen specific recognition. Immune monitoring of T cell antigen specific recognition has been applied to determine the immune responses against specific antigens (e.g. tumor-specific antigens and peptide vaccines) and their possible correlation with clinical outcome in patients receiving immunotherapies. However, experimentally mapping TCR sequences to antigen specificity is costly and labor intensive. With adequate training data for a particular pMHC, the classifier presented here can assign probabilities of the pMHC binding to each TCR sequence of interest without conducting binding assays. In this study, the multinomial prediction mode of this classifier (
The results indicate that a large portion (>30%) of TCRs that bind to a specific pMHC share a single chain and differ in the second chain, emphasizing that T cell clonality must be determined by data with paired αβ chains. Additionally, 8% of these TCRs that share a single chain can bind to different pMHCs. This is in line with the predictive power of TCR antigen specificity using paired TCR chains is 16% greater than using either chain alone. Thus, single cell paired αβ chain sequencing is likely to be more powerful to accurately interrogate T cell repertoire clonality and TCR-pMHC binding specificity.
The ability to assess biologically-relevant T cell reactivities is important to interrogate and monitor immune responses to pathogens and other disease states. It was observed that the majority of the T cell reactivities recovered (98.6%) were matched with the appropriate HLA type/supertype, and further, that the phenotypes of multimer positive cells were largely restricted to memory T cell compartments, indicating that relevant memory reactivities from prior functional T cell responses are resolvable with this technology. Paired αβ TCR sequencing revealed multiple TCR sequences that were specific for individual multimers, reinforcing the broad antigenic immune responses to common viral challenges.
While a low degree of HLA mismatched reactivities were recovered, these were significantly enriched in non-expanded naïve T cells relative to memory subsets, potentially revealing antigen-specific interactions to previously unexposed targets or those that did not culminate in functional T cell responses. Additionally, it is expected that a range of TCR avidities were recovered in these experiments, which might contribute to the detection of unexpected binding patterns. Dextramers are highly multimerized and likely to detect a broader range of TCR binding avidity than traditional tetramer reagents. Furthermore, a range of fluorescent dextramer intensities were sorted in the multimer-positive gating, so even low-frequency, lower-avidity TCR interactions were captured in this highly-sensitive single cell assay.
3. Methods
i. The 10× Genomics Single Cell Immune Profiling Datasets
10× Genomics data used for this study were downloaded from: support.10×genomics.com/single-cell-vdj/datasets
ii. Single-Cell RNA-Seq Data QC
CD8+ cells from each donor were selected for the downstream analysis by the following criteria: number of RNA features <=2500 and >200 genes detected per cell, and mitochondria percentage is less than 40 percent of the total UMI (unique molecular identifier) counts.
iii. Classification pMHC Binding T Cell
Seurat V3 single-cell sequencing analysis R package33, 34 was used for the classification analysis based on single cell RNA-seq data. Since the significant enrichment of TCR VJ gene usages was observed in identified pMHC binding T cells, the TCR genes were taken out from the classification. So, cell clusters will not be dominated by their shared VJ gene usage. Then, all other gene expression of identified binding T cells was normalized and scaled using Seurat V3 default parameters. PCA was run on normalized and transformed UMI counts on variably expressed genes. Top 10 PCs were used for the cell classification. UMAP was used for classification visualization (
iv. Generating CDR3 Motifs from the Most Predictive pMHC Binding TCR Pairs
The CDR3 amino acid sequences of α and β chains from the ten most predictive TCRs were aligned using COBALT (www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi). Aligned CDR3 amino acid sequences were input into WebLogo35 with default parameters to generate motifs.
v. Curation of Reported pMHC Specific Binding Paired TCRs
Raw files were downloaded from VDJdb28 (vdjdb.cdr3.net/) and The Pathology-associated TCR database36 (friedmanlab.weizmann.ac.il/McPAS-TCR/). The data was processed to get pMHC TCR binding following the following criteria: for VDJdb, paired α- or β-chain CDR3 amino acid sequences were required for each “complex.id”; TCRs annotated with “source” were removed from 10× genomics; data was filtered for “Species”=“Human”. For McPAS-TCR, known “Epitope.ID” were required in the full data and having “CDR3.alpha.aa” and “CDR3.beta.aa”; Similarly for VDJdb, Human TCRs were filtered for.
vi. Normalization of TCR-pMHC Binding Data
An Integrative COntext-specific Normalization (ICON) method was developed. It takes the multi-omics single cell sequencing data generated from the 10× Genomics Immune Map platform as input data and performs TCR-pMHC binding specificity data normalization to identify reliable binding events. The multi-omics dataset includes single cell RNA-seq, paired αβ chain single cell TCR-seq, dCODE-Dextramer-seq and cell surface protein expression sequencing—also named CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing). ICON includes the following major steps (
Single cell RNA-seq based filtering of low-quality cells. It filters out low quality cells such as doublets and dead cells. The cells with an unexpectedly high number of genes for T cells detected (e.g. >2500 genes per cell) were categorized as doublets and cells with a high fraction of mitochondrial gene expression (e.g. ratio of mitochondrial gene expression UMIs to the total gene expression UMIs>0.4) or too few numbers of genes detected (<200 genes per cell) were classified as dead cells. (
Single cell dCODE-Dextramer-seq based background adjustment. There are two types of background noise controls that were designed for the dextramer binding assays and were used in the analysis: one is negative control dextramers (n=6) from dextramer stained and sorted CD8+ T cells (NC_dex, denoted as nc), and the other is dextramer stained CD8+ T cells without sorting on dextramer (Dex_unsorted, denoted as du). To inspect signal and noise distributions, the maximum dextramer signal in UMI (Unique Molecular Identifier) of each cell was chosen to represent the best binding of each cell. Specifically, the non-specific dextramer binding signal of a cell is represented as Max (nc1, . . . , nc6), the maximum dextramer signal of the 6 negative control dextramers included the dextramer pool. The dextramer binding signal of a cell from a dextramer stained and sorted sample (Dex_sorted, denoted as ds) is represented as Max(ds1, . . . , ds44), the maximum dextramer signal in UMI of the 44 testing dextramers. Similarly, the dextramer binding signal of a cell from a Dex_unsorted sample is represented as Max(du, . . . , du44). The distributions of these three types of dextramer signals before ICON process are shown in
To estimate the potential noise introduced by the cell sorting process, the accumulative distributions of dextramer binding signals were compared between Dex_sorted and Dex_unsorted samples to determine the cutoff for dextramer sorting efficiency (
d=Max(P99.9,argmax Ds,u)
The dextramer signals (UMI) for each 44 testing dextramer of sorted cells was corrected by subtracting the estimated background (
E
c
=E
s
−d
Then, a cell-wise normalization was conducted based on Log-Rank distribution for each cell. A pMHC-wise normalization was performed to make the dextramer binding signal comparable to each other. The adjusted dextramer binding signals of sorted cells E_c were normalized across 44 testing dextramers, then across all cells as the following equation. E_c{circumflex over ( )}′>=0.9 was chosen empirically as the cutoff for pMHC specific binders (
Selecting T cells with single paired αβ chains based on single cell TCR-seq. T cells were removed that have only α-chain, only β-chain, and multiple α- or β-chains. Only the T cells with the single paired αβ chains were used in this study.
The ICON normalization process was performed for each donor separately.
vii. Antigen-Specific T Cell Expansion and Antigen Re-Exposure to Identify MART-1 Binding T Cells
Peripheral blood mononuclear cells (PBMC) from HLA A*02:01 individuals were isolated by Ficoll-Paque Plus gradient isolation. PBMC were seeded to culture plates in T cell media (CellGenix dendritic cell media, cat #20801-0500+5% human serum AB (Sigma, cat #H3667))+1% penicillin/streptomycin/L-glutamine (ThermoFisher, cat #10378-016), the T cell supporting cytokines IL-7 and IL-15 at 5 ng/ml (CellGenix, cat #1410-050 and 1413-050, respectively), and IL-2 at 10 U/ml (Peprotech, cat #200-0), and the A*02:01-restricted MART-1 epitope ELAGIGILTV at 10 ug/ml (Genscript). Cultures were fed with fresh media and cytokines every two days for one week. On day seven of culture, cells were stained with the fluorescently-tagged dextramer HLA-A*02:01 MART-1 ELAGIGILT (Immudex, cat #WB2162-PE) to assess antigen specific CD8+ T cell expansion by flow cytometry. For antigen re-exposure assays, the peptide was added to T cell expansion cultures after 7 days of expansion. Twenty-four hours following re-stimulation, cells were collected and stained with fluorescently-labeled antibodies for CD3 (BD Biosciences, cat #612750), CD8 (BD Biosciences, cat #612889), CD69 (BD Biosciences, cat #564364), CCR7 (Biolegend, cat #353218), CD45RO (Biolegend, cat #304238), CD137 (Biolegend, cat #309828), and CD25 (Biolegend, cat #356104). Utilizing an Astrios cell sorter (Beckman Coulter), fluorescence activated cell sorting (FACS) gating on forward scatter plot, side scatter plot, and fluorescent channels was set to select live cells while excluding debris and doublets. a 100 μm nozzle was used to sort single CD3+CD8+CD45RO+CD137+ cells for further processing.
Sorted cells were then loaded onto a Chromium Single Cell 5′ Chip (10× Genomics, cat #) and processed them through the Chromium Controller to generate GEMs (Gel Beads in Emulsion). RNA-Seq libraries were prepared with the Chromium Single Cell 5′ Library & Gel Bead Kit (10× Genomics, cat #) following the manufacturer's protocol.
viii. Regeneron Oligo-Tagged Dextramer Staining and Sorting for 10× Genomics Donor 3 and Donor 4
10× Genomics kindly provided cryopreserved donor 3 and donor 4 PBMCs for use in reassessing CD8+ T cell dextramer binding ability. CD8+ T cells were enriched using Miltenyi CD8+ T cell negative enrichment (Mitenyi). The cells were then incubated for 45 minutes with benzonase (Millipore) and dasatinib (Axon) before being stained with oligo-tagged dextramer pools (Immudex,
TCR sequence similarity distance-based classification recently reported a weighted hamming distance-based method, TCRdist, to predict TCR-pMHC binding specificity based on the sequence space of TCR CDR regions guided by structural information on pMHC binding. Nearest-neighbor (NN) distance (the average TCRdist between a receptor and its nearest-neighbor receptors within the repertoire) was further calculated to measure receptor density within repertoires. For each pMHC repertoire, binders were defined to be TCRs that bind to the given pMHC. NN-distances were calculated between each binding TCR and each set of pMHC binders with the given TCR removed. The NN distances were separated based on the known specificity of each TCR. Receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) were calculated for the binary classifier of each pMHC using the plotROC R package38. In brief, ROC curves were generated by calculating sensitivity and specificity at several NN distance thresholds for each classifier—classifying TCRs as binding to a given pMHC if their NN distance falls below the given threshold.
ix. CNN-Based Classification
The weighted binary classifier was adapted based on a deep learning framework, which includes three major steps with adjustments made to accommodate the specific needs.
x. Input Data Formatting
TCR sequencing files were collected as a raw csv formatted file from 10× Genomics. Sequencing files were parsed to take the amino acid sequence of the CDR3 after removing unproductive sequences. Clones with different nucleotide sequences but the same matched amino acid sequence from CDR3s and the V, D, J genes were aggregated together under one TCR. Thus, each TCR record used here includes single paired α and β TCR amino acids sequences of CDR3, V, and J genes. For the model running with α-chain only, TCRB-CDR3 amino acid sequences, β-chain genes were removed from the input. Similar removal was done for the β-chain only model.
xi. Data Transformations
Each TCR-CDR3 amino acid sequence was encoded into numbers to represent the 20 possible amino acids. Only sequences that comply with IUPAC (International Union of Pure and Applied Chemistry) amino acids were kept. 0-padding was applied to a maximum length of 40 for TCRs of different length. A trainable embedded layer was used to further extract features from the amino acid sequences. The V and J genes were one-hot encoded to provide a categorical and discrete representation of the gene names in numerical space. The encoded sequences and gene names were concatenated together to represent one TCR record. This data transformation process was applied before training all networks.
xii. Single TCR Sequence Classifier
This method was adapted, where they provided a general conventional neural network architecture to train TCR and focused on sample or repertoire level prediction. Optimizing single TCR sequence prediction was focused on. To achieve this, T cell clone size was removed from the input data. In addition, a single translationally invariant layer was applied to the sequence followed by three fully connected convolutional layers to a final output layer. The network was trained using an Adam Optimizer (learning rate=0.001) to minimize the cross-entropy loss between the soft-maxed-logits and the one-hot encoded representation of the discrete categorical outputs of the network. This approach was modified by using a biologically meaningful kernel size of 439 to capture potential motifs. To account for the unbalanced class representation in the training data, weighted cross-entropy loss function was applied using the following formula:
Σi=0nwc*(ŷi−yi),wc=n/nc
wc is the weights computed using the inverted frequency of TCR sequences for each class. C represents one class; nc is the total TCR in one class; n is total number of TCRs; ŷi, yi represent predicted and actual class for each TCR sequence.
A Monte-Carlo Cross Validation (MCCV) training was conducted by holding a certain number of TCRs for validation and testing, respectively. The validation group of sequences was used to implement an early stopping algorithm. Here, 20 iterations were taken of Monte-Carlo sampling. A Receiver Operating Characteristic (ROC) curve for the sequence classifier was computed based on the testing set after averaging on all MCCV predictions.
1. Results
i. Identification of pMHC Specific Binding TCRs from High-Throughput Binding Data
10× Genomics recently generated an expansive, publicly available TCR-pMHC binding dataset. In their initial report, the binding profile of over 150,000 CD8+ T cells from four HLA haplotyped healthy donors (Table 1, donors 1 to 4) was assessed across 44 pMHC dextramers using a single cell-based immune profiling platform Immune Map to directly detect antigen binding to T cells, while simultaneously sequencing T cell αβ chain pairs and transcriptomes (
Described herein is a highly multiplexed dextramer binding dataset generated at the single cell level with paired T cell α- and β-chain sequences. 10× Genomics applied global cutoffs for background noise and non-specific dextramer bindings to all donors and dextramers to identify pMHC binding TCRs(18). Unsurprisingly, an unexpectedly high number of promiscuous TCR-pMHC binding events was found that 10× Genomics provided (
Applying ICON, a total of 53,062 CD8+ T cells belonging to 5,721 unique T cell clones that bind to 37 pMHCs from five donors were identified (
Among all pMHC binding TCRs, 99% of total TCRs (96% of unique TCRs) bind to nine pMHCs: B*08:01_RAKFKQLL_BZLF1_EBV (# of T cells: 18,468/# of unique TCRs: 479), A*02:01_GILGFVFTL_Flu-MP_Influenza (# of T cells: 8,365/# of unique TCRs: 1,095), A*11:01_IVTDFSVIK_EBNA-3B_EBV (# of T cells: 5,438/# of unique TCRs: 149), A*03:01_KLGGALQAK_IE-1_CMV (# of T cells: 3,899/# of unique TCRs: 2,865), A*11:01_AVFDRKSDAK_EBNA-3B_EBV (# of T cells: 1,579/# of unique TCRs: 95), A*02:01_GLCTLVAML_BMLF1_EBV (# of T cells: 1,886/# of unique TCRs: 117), A*02:01_ELAGIGILTV_MART-1_Cancer (# of T cells: 297/# of unique TCRs: 293), B*35:01_IPSINVHHY_pp65_CMV (# of T cells: 6,986/# of unique TCRs: 280) and A*02:01_NLVPMVATV_pp65_CMV (# of T cells: 5,612/# of unique TCRs: 164) (
ii. TCRAI: A Neural Network Classifier of T Cell Antigen Specificity
With large and diverse TCR-pMHC binding events identified, robust functional classifiers for rapidly validating these binding events are desired. Recent work demonstrated that neural networks can learn high dimensional information from TCR sequences and thus, may robustly predict TCR-pMHC binding.
A Python package, TCRAI, has been developed utilizing Tensorflow 2, providing a flexible framework for the study of TCR-pMHC specificity (
To evaluate the performance of TCRAI, a literature search for currently available methods was performed (table 3) and the classifier was compared to four major methods in this field: GLIPH2, DeepTCR, NetTCR and TCRdist. For the comparison, eight pMHC-specific binding T cell repertoires were collated with at least 50 unique paired αβ chain TCRs generated by traditional single multimer binding or antigen re-exposure assays as a gold-standard dataset (table 4 and Materials and Methods). Three of the methods DeepTCR, NetTCR and TCRdist are, like TCRAI, predictive models. The area under the ROC (receiver operator characteristic) curve (AUROC/AUC), a standard measure of classification success, of these prediction models indicates that TCRAI and DeepTCR, with similar neural network frameworks, perform better than TCRdist and NetTCR. Overall, TCRAI has more consistent and better performance than DeepTCR (
iii. Classification of pMHC Binding TCRs Identified from the High-Throughput Data
TCRAI was then applied to the nine most abundant pMHC binding repertoires ICON identified from the high throughput data (
To further validate the performance of TCRAI, four pMHC repertoires (A*02:01_ELAGIGILTV_MART-1, A*02:01_GILGFVFTL_Flu-MP, A*02:01_GLCTLVAML_BMLF1_EBV and A*02:01_NLVPMVATV_pp65_CMV) were used that also have binding TCRs in the curated public dataset. TCRAI was trained using the four repertoires identified from the high throughput dataset to predict the four curated repertoires.
iv. Characterization of pMHC Specific TCRs
To investigate the properties of TCRs that bind a given pMHC, how TCRAI classifier models arrange TCRs within their fingerprint space were analyzed (Materials and Methods). TCR fingerprints from a classifier model allow for the discovery of specific groups of TCRs with conserved gene usage and CDR3 motifs. These groups often exhibit different binding abilities and divergent structural binding modalities.
Clustering TCRs to A*02:01_GILGFVFTL_Flu-MP_Influenza leads to two well-separated clusters in the TCRAI fingerprint space (
The TCRs binding to the other eight pMHCs were also characterized. The result for A*02:01_GLCTLVAML_BMLF1_EBV binding TCRs is particularly interesting. In previous studies, a dominant public TCR constructed from TRBV20-1/TRBJ1-2/TRAV5/TRAJ31 has been observed. However, previous analyses of the TCR population binding to this pMHC have focused on TRAV5 TCRs, to which the population is heavily biased. The current experiments unbiasedly identified 5 clusters of TCRs in the TCRAI fingerprint space (
v. Immune Phenotypes of pMHC Binding CD8+ T Cells
The combined information of antigen specificity and T cell phenotype has been reported to be important to clinical success of immunotherapies, such as vaccination. The multi-omics data generated by the Immune Map platform enables the association of T cell antigen specificity with T cell phenotypes. Using gene (single cell RNA-seq) and surface protein (CITE-seq, cellular indexing of transcriptomes and epitopes by sequencing) expression from this multi-omics dataset, pMHC binding CD8+ T cells was grouped into subpopulations (
96% of pMHC binding T cells were memory cells and were enriched in expanded T cell clones (
Although the majority of pMHC binding T cells expressed a memory phenotype, 4% of them were naïve cells. These naïve cells had more diverse pMHC interactions than non-naïve cells and were often bound to tumor-associated antigens (e.g. MART-1), endogenous antigens, or to antigens derived from viruses for which the donor was purportedly seronegative (e.g. HPV) (
2. Discussion
High-throughput TCR-pMHC binding data present an attractive pathway for furthering the understanding of TCR antigen recognition. However, this type of data is often associated with high noise to signal ratios. Herein is presented a framework of computational tools including a novel method ICON that can identify reliable TCR-pMHC interactions by significantly increasing signal-to-noise ratios in the highly multiplexed TCR-pMHC binding data with good sensitivity and specificity. ICON computes the noise corrected dextramer signal in a parameter free manner, making it easily generalizable to pMHC-TCR binding data from a broader range of pMHC dextramer pools and potentially extendible to the normalization of protein binding signals in single cell space, such as CITE-seq.
In this study, a Python package TCRAI was developed, with which the robustness of deep-learning classifiers in predicting TCR-pMHC specific bindings is demonstrated. Due to the importance of the CDR3 region in determining the specificity of TCRs to a given antigen, it is tempting to build a predictive model harnessing only this information, as others have. However, due to highly conserved gene usage for many pMHCs, the VJ gene usage is found to be an important predictive element of TCRAI, particularly in the case of few unique pMHC binding TCRs in the dataset. The predictive performance of models that receive CDR3 information outperform gene-level only models in the case where there are more than at least on the order of 100 pMHC binding TCRs was observed (
It has been shown that TCRAI can not only perform state-of-the-art classification of TCR-pMHC specific binding but can also identify groups of TCRs with differing binding profiles. Partnering the dextramer UMI counts with TCR sequence information allowed for the investigation of differing binding abilities between these groups. The findings indicate that as the volume of high-throughput TCR pMHC binding data grows, so will the ability to discover new TCR motifs and pair these with not only UMI, but also wider multi-Omics data. The ability to investigate, for example, different transcriptional regulation of T cell receptor signaling between groups of TCRs with different binding mechanisms would be very exciting not only for broad scientific questions, but also for the development of T cell therapeutics.
T cell antigen specific recognition can potentially be studied virtually (versus experimentally) using TCRAI. Immune monitoring of T cell antigen specific recognition has been applied to determine the immune responses against specific antigens (e.g. SARS-COV2, tumor-specific antigens and peptide vaccines) and their possible correlation with disease severity, clinical outcome in patients receiving immunotherapies. However, experimentally mapping TCR sequences to antigen specificity is costly and labor intensive. With adequate training data for a particular pMHC, the TCRAI classifier presented here can assign probabilities of pMHC binding to each TCR sequence of interest without conducting binding assays. In this study, the multinomial prediction mode of this classifier has been validated (
The ability to assess biologically relevant T cell reactivities is important for the interrogation and monitoring of immune responses to pathogens and other disease states. Most of the T cell reactivities recovered (94%) were matched with the appropriate HLA type/supertype, and further, that the phenotypes of multimer positive cells were largely restricted to memory T cell compartments, indicating that relevant memory reactivities from prior functional T cell responses are resolvable with this technology. Paired αβ TCR sequencing revealed multiple TCR sequences that were specific for individual multimers, reinforcing the broad antigenic immune responses to common viral challenges.
While a low degree of HLA mismatched reactivities were recovered, these were significantly enriched in non-expanded naïve T cells relative to memory subsets, potentially revealing antigen-specific interactions to previously unexposed targets or those that did not culminate in functional T cell responses. Additionally, a range of TCR avidities could be recovered in these experiments, which can contribute to the detection of unexpected binding patterns. Dextramers are highly multimerized and likely to detect a broader range of TCR binding avidity than traditional tetramer reagents. Furthermore, a range of fluorescent dextramer intensities were sorted in the multimer-positive gating, so even low-frequency, lower-avidity TCR interactions were captured in this highly sensitive single cell assay.
3. Materials and Methods
i. The 10× Genomics Single Cell Immune Profiling Datasets
10× Genomics data used for this study were downloaded from: support.10×genomics.com/single-cell-vdj/datasets
ii. Identification of pMHC Binding T Cell Phenotypes
Seurat V3 single-cell sequencing analysis R package was used for the classification analysis based on single cell RNA-seq data. Since the significant enrichment of TCR VJ gene usages was observed in identified pMHC binding T cells, the TCR genes were taken out from the classification. So, cell clusters will not be dominated by their shared VJ gene usage. Then, all other gene expression of identified binding T cells was normalized and scaled using Seurat V3 default parameters. PCA was run on normalized and transformed UMI counts on variably expressed genes. Top 10 PCs were used for the cell classification. UMAP was used for classification visualization.
iii. Curation of Reported pMHC Specific Binding Paired TCRs
Raw files were downloaded from VDJdb(42) (vdjdb.cdr3.net/) and The Pathology-associated TCR database (friedmanlab.weizmann.ac.il/McPAS-TCR/). The data was processed to get pMHC TCR binding following the following criteria: for VDJdb, paired α- or β-chain CDR3 amino acid sequences were required for each “complex.id”; TCRs annotated with “source” from 10× genomics were removed; “Species”=“Human” was filtered for. For McPAS-TCR, known “Epitope.ID” were required in the full data and having “CDR3.alpha.aa” and “CDR3.beta.aa”; Similarly, for VDJdb, were filtered for Human TCRs.
iv. Normalization of High-Throughput TCR-pMHC Binding Data
ICON, an Integrative COntext-specific Normalization method, was developed to identify reliable TCR-pMHC interactions. It takes multi-omics single cell sequencing data generated from a multiplexed multimer binding platform, like 10× Genomics Immune Map as input data, including single cell RNA-seq, paired αβ chain single cell TCR-seq, dCODE-Dextramer-seq and cell surface protein expression sequencing—also named CITE-seq. ICON includes the following major steps (
Step 1: Single Cell RNA-Seq Based Filtering of Low-Quality Cells
It filters out low quality cells such as doublets and dead cells. The T cells with an unexpectedly high number of genes (e.g. >2500 genes per cell) were categorized as doublets and cells with a high fraction of mitochondrial gene expression (e.g. ratio of mitochondrial gene expression to the total gene expression >0.2) or too few genes detected (<200 genes per cell) were classified as dead cells (
Step 2: Single Cell dCODE-Dextramer-Seq Based Background Estimation
Six negative control dextramers were designed for estimating the background noise from the multiplexed dextramer binding assay. To inspect signal and noise distributions, the maximum dextramer signals in UMI (Unique Molecular Identifier) of negative control dextramers and test dextramers for each cell were used to represent the worst noise and best dextramer binding of each T cell. The density distributions of these two types of dextramer signals are shown in
Step 3: Selecting T Cells with Paired αβ Chains Based on Single Cell TCR-Seq
T cells that have only a single chain were removed. For T cells with multiple a- or β-chains detected, the ones with highest UMI counts were assigned to each T cell.
Step 4: Dextramer Signal Correction
Each dextramer has its own optimal binding condition, however it is impossible to arrange the experimental conditions such that a multiplexed dextramer binding assay is optimal for every dextramer. This results in multiple dextramers binding to the same T cell/clone, as observed in this high throughput dataset (
Defining the background noise subtracted dextramer signal for the ith T cell binding the jth dextramer as Eij, the fraction of dextramer signal due to binding of the jth dextramer for the ith T cell is further denoted as:
Denoting the TCR clonotype of the ith T cell as ki, and the number of T cells belonging to clonotype ki that bind dextramer j as T_(kij), the fraction of T cells that belong to clonotype ki that bind the ith dextramer is denoted as:
Using these quantities, the corrected dextramer signal is calculated for the ith T cell binding the jth dextramer as:
S
ij
=E
ij(RCij)2RTkj
Step 5: Cell- and pMHC-Wise Dextramer Signal Normalization and Binder Identification
To make all the dextramer binding signals comparable, the corrected dextramer binding signals were log-ratio normalized across 44 testing dextramers within a cell. pMHC-wise normalization was subsequently conducted based on Log-Rank distribution. Normalized dextramer UMI>0 was empirically chosen as the cutoff for pMHC specific binders.
v. Regeneron Oligo-Tagged Dextramer Staining and Sorting
CD8+ T cells were enriched from healthy donor PBMC using Miltenyi CD8+ T cell negative enrichment (Mitenyi). The cells were then incubated for 45 minutes with benzonase (Millipore) and dasatinib (Axon) before being stained with oligo-tagged dextramer pools (Immudex, see Table 2) for 30 minutes at room temperature. Cells were then stained with fluorescently labeled for CD3 (BD Biosciences, cat #612750), CD4 (BD Biosciences, cat #563919, CD8 (BD Biosciences, cat #612889), CCR7 (Biolegend, cat #353218), and CD45RA (Biolegend, cat #304238) and CITE-seq antibodies for an additional 30 minutes on ice. Utilizing an Astrios cell sorter (Beckman Coulter), fluorescence activated cell sorting (FACS) gating on forward scatter plot, side scatter plot, and fluorescent channels was set to select live cells while excluding debris and doublets. A 100 μm nozzle was used to sort single CD3+CD8+dextramer+ cells for further processing.
vi. Building a Neural Network Based Classifier TCRAI
Though TCRAI provides a flexible framework for the design of TCR classifiers, a specific and consistent architecture was used throughout this work, which is described in detail below. Aside from its flexible architecture, some key differences from the DeepTCR architecture are the use of 1D convolutions and batch normalization for the CDR3 sequences, and lower dimensional representations for the genes. These changes give improved model regularization and force the model to learn stronger gene associations.
In order to process the input information of the TCR into numerical format the following method was applied. For each CDR3 sequence, amino acids are first converted to integers, and subsequently these integer vectors are encoded into a one-hot representation. For the V and J genes a dictionary of gene type to integer is separately built for each V and J gene and use these to convert each gene to an integer.
The neural network architecture applied to the processed input information includes embedding layers, and convolutional networks. Specifically, processed CDR3 residues were embedded into a 16-dimensional space via a learned embedding, and the resulting numeric CDR3s are fed through 3 1D convolutional layers, with filters of dimensions, kernel widths and strides. Each convolution is activated by an exponential linear unit activation and is followed by dropout and batch normalization. Following these three convolutional blocks, global max pooling is applied to the final features, this process encodes each CDR3 by a vector of length 256, a “CDR3 fingerprint”. The processed gene input for each gene is one-hot encoded and embedded into a reduced dimensional space (16 for V genes, and 8 for J genes) via a learned embedding, giving a “fingerprint” of each gene as a vector. The fingerprints of all selected CDR3s and genes are concatenated together into a single vector, the “TCRAI fingerprint.” The TCRAI fingerprint is passed through one final full-connected layer to give binomial predictions (single output value, sigmoid activation), regression predictions (single output, no activation), or multinomial predictions (multiple output values, softmax activation). Binomial and multinomial predictions are focused on in this work.
TCR sequencing files were collected as a raw csv formatted file from 10× Genomics. Sequencing files were parsed to take the amino acid sequence of the CDR3 after removing unproductive sequences. Clones with different nucleotide sequences but the same matched amino acid sequence from CDR3s and the V, D, J genes were aggregated together under one TCR. Thus, each TCR record used here includes single paired α and β TCR chains, with CDR3 amino acid sequence and V, J genes for each chain.
The data is split into training (76.5%), validation (13.5%), left-out test set (10%) for each model, and subsequently a 5-fold Monte-Carlo cross-validation (MCCV) is performed on the training set. The model is trained by minimizing the cross-entropy loss via the Adam optimizer, and the cross-entropy loss is weighted by weights 1/(number of classes*fraction of samples in that class) for each class. Early stopping is engaged, via a left-out validation dataset, to prevent overfitting, in which the model ceases training if the validation loss increases for more than 5 epochs and the weights of the model with minimal validation loss are restored. Due to the large number of models being trained here, only the learning rate and batch size are tuned during cross-validation. After cross-validation the optimally performing hyperparameters are chosen and the model is re-trained on the full training set, using the validation set to control early-stopping. The re-trained model is then evaluated on the left-out test set.
vii. TCRAI Fingerprint Analysis
TCRAI models produce both a prediction for a TCR to bind a specific pMHC (or one of many pMHCs, in the multinomial case), and a numerical vector “fingerprint” that describes that TCR within the context of the question of whether it can bind that pMHC. In order to gain an understanding of how the model works, and to identify groups of TCRs with different binding modalities, the distribution of these fingerprints is analyzed. UMAP is used to reduce the fingerprints to a two-dimensional space. When using a model trained on one dataset and inferring fingerprints on another unseen dataset, the UMAP projector is fit with TCRs from the training dataset and the TCRs transformed from the unseen set using that projector.
When clustering TCR fingerprints, the fingerprints of all TCRs of the dataset into two-dimensional space are projected as described above, and then those TCRs that are strong true positives are selected (STPs, binomial prediction >0.95). These STPs are then clustered using a k-means classifier in the two-dimensional space. TCRs from within in each cluster are then collected and used to construct CDR3 motif logos (using weblogo), gene-usage, and UMI distributions by pairing the unique TCR clonotypes within the cluster with all repeated clonotypes in the high throughput data.
viii. DeepTCR Modification
The DeepTCR method was adapted to construct a binary classifier with the adjustments as described below.
For each TCR record the single paired α and β TCR chains were used, with CDR3 amino acid sequence and V, J genes for each chain only, in line with the inputs provided to the TCRAI package. That is, clonality, MHC, or D gene usage was not included to the DeepTCR model. The final output layer was adjusted to give a single binomial output, and hyperparameters of the model were optimized for the problem at hand in the context of the DeepTCR framework.
The computing device 4101 and the server 4102 can be a digital computer that, in terms of hardware architecture, generally includes a processor 4108, memory system 4110, input/output (I/O) interfaces 4112, and network interfaces 4114. These components (4108, 4110, 4112, and 4114) are communicatively coupled via a local interface 4116. The local interface 4116 can be, for example, but not limited to, one or more buses or other wired or wireless connections, as is known in the art. The local interface 4116 can have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications. Further, the local interface may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
The processor 4108 can be a hardware device for executing software, particularly that stored in memory system 4110. The processor 4108 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computing device 4101 and the server 4102, a semiconductor-based microprocessor (in the form of a microchip or chip set), or generally any device for executing software instructions. When the computing device 4101 and/or the server 4102 is in operation, the processor 4108 can be configured to execute software stored within the memory system 4110, to communicate data to and from the memory system 4110, and to generally control operations of the computing device 4101 and the server 4102 pursuant to the software.
The I/O interfaces 4112 can be used to receive user input from, and/or for providing system output to, one or more devices or components. User input can be provided via, for example, a keyboard and/or a mouse. System output can be provided via a display device and a printer (not shown). I/O interfaces 41412 can include, for example, a serial port, a parallel port, a Small Computer System Interface (SCSI), an infrared (IR) interface, a radio frequency (RF) interface, and/or a universal serial bus (USB) interface.
The network interface 4114 can be used to transmit and receive from the computing device 4101 and/or the server 4102 on the network 4104. The network interface 4114 may include, for example, a 10BaseT Ethernet Adaptor, a 100BaseT Ethernet Adaptor, a LAN PHY Ethernet Adaptor, a Token Ring Adaptor, a wireless network adapter (e.g., WiFi, cellular, satellite), or any other suitable network interface device. The network interface 4114 may include address, control, and/or data connections to enable appropriate communications on the network 4104.
The memory system 4110 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, DVDROM, etc.). Moreover, the memory system 4110 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory system 4110 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 4108.
The software in memory system 4110 may include one or more software programs, each of which comprises an ordered listing of executable instructions for implementing logical functions. In the example of
For purposes of illustration, application programs and other executable program components such as the operating system 4118 are illustrated herein as discrete blocks, although it is recognized that such programs and components can reside at various times in different storage components of the computing device 4101 and/or the server 4102. An implementation of the training module 220 can be stored on or transmitted across some form of computer readable media. Any of the disclosed methods can be performed by computer readable instructions embodied on computer readable media. Computer readable media can be any available media that can be accessed by a computer. By way of example and not meant to be limiting, computer readable media can comprise “computer storage media” and “communications media.” “Computer storage media” can comprise volatile and non-volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Exemplary computer storage media can comprise RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
In an embodiment, the ICON module 108 and/or the predictive module 110 may be configured to perform a method 4200, shown in
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes at step 4202.
The method 4200 may comprise removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range at step 4203. By way of example, the gene threshold range may be from about 200 genes to about 2,500 genes.
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression at step 4204.
The method 4200 may comprise removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold at step 4205. The gene expression threshold can be about 40 percent of total unique molecular identifier counts.
The method 4200 may comprise determining, based on the dextramer sequence data and unsorted dextramer sequence data at step 4206. The sorted dextramer sequence data can comprise sorted test dextramer sequence data and negative control dextramer sequence data. The unsorted dextramer sequence data can comprise unsorted test dextramer sequence data.
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data, a maximum negative control dextramer signal at step 4207. The maximum negative control dextramer signal may be expressed as (Max(nc1, . . . , ncn)), wherein n is a number of negative control dextramers.
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data, a maximum sorted dextramer signal at step 4208. The maximum sorted dextramer signal may be expressed as (Max(ds1, . . . , dsm)), wherein m is a number of test dextramers.
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the unsorted test dextramer sequence data, a maximum unsorted dextramer signal at step 4209. The maximum unsorted dextramer signal may be expressed as (Max(du, . . . , dum)), wherein m is the number of test dextramers.
The method 4200 may comprise estimating, based on the maximum negative control dextramer signals, a dextramer binding background noise at step 4210. The dextramer binding background noise may comprise determining (P99.9).
The method 4200 may comprise estimating, based on the maximum sorted dextramer signals and the maximum unsorted dextramer signals, a dextramer sorting gate efficiency at step 4211. The dextramer sorting gate efficiency may be expressed as (argmax Ds, u). The dextramer sorting gate efficiency may be determined as a maximum difference between (Max (ds1, . . . , dsm)) and (Max(du, . . . , dum)).
The method 4200 may comprise determining, based on the dextramer binding background noise and the dextramer sorting gate efficiency a measure of background noise at step 4212. The measure of background noise may be expressed as (d).
The method 4200 may comprise subtracting, for each cell represented in the dextramer sequence data, the measure of background noise from a dextramer signal associated with each cell at step 4213. Subtracting the measure of background noise from a dextramer signal associated with each cell may comprise evaluating (Ec=Es−d).
The method 4200 may comprise performing, for each cell represented in the dextramer sequence data, cell-wise normalization on the dextramer signals associated with each cell at step 4214. Performing cell-wise normalization may comprise evaluating:
The method 4200 may comprise performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization at step 4215. Performing pMHC-wise normalization may comprise evaluating:
The method 4200 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain at step 4216.
The method 4200 may comprise removing, from the normalized dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chains at step 4217.
The method 4200 may comprise identifying data remaining in the normalized dextramer sequence data as associated with reliable TCR-pMHC binding events at step 4218.
The method 4200 may further comprise training a predictive model based on the data associated with reliable TCR-pMHC binding events. The method 4200 may further comprise predicting a binding status of a newly presented receptor sequence according to the trained predictive model.
In an embodiment, the ICON module 108 and/or the predictive module 110 may be configured to perform a method 4300, shown in
The method 4300 may comprise filtering, from the dextramer sequence data, based on the single cell sequence data, data associated with low-quality cells at step 4320. Filtering, from the dextramer sequence data, based on the single cell sequence data, data associated with low-quality cells can comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes, removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range, determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression, and removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold. The gene threshold range may be from about 200 genes to about 2,500 genes. The gene expression threshold can be about 40 percent of total unique molecular identifier counts.
The method 4300 may comprise adjusting, based on a measure of background noise, the dextramer sequence data at step 4330. The method 4300 may further comprise determining, based on the dextramer sequence data, sorted dextramer sequence data wherein the sorted dextramer sequence data comprises sorted test dextramer sequence data and negative control dextramer sequence data and unsorted dextramer sequence data, wherein the unsorted dextramer sequence data comprises unsorted test dextramer sequence data. The method 4300 may further comprise determining, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data, a maximum negative control dextramer signal, determining, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data, a maximum sorted dextramer signal, and determining, for each cell represented in the dextramer sequence data, based on the unsorted test dextramer sequence data, a maximum unsorted dextramer signal. The maximum negative control dextramer signal may be expressed as (Max(nc1, . . . , ncn)), wherein n is a number of negative control dextramers. The maximum sorted dextramer signal may be expressed as (Max(ds1, . . . , dsm)), wherein m is a number of test dextramers. The maximum unsorted dextramer signal may be expressed as (Max(du, . . . , dum)), wherein m is the number of test dextramers.
Adjusting, based on the measure of background noise, the dextramer sequence data can comprise estimating, based on the maximum negative control dextramer signals, a dextramer binding background noise, estimating, based on the maximum sorted dextramer signals and the maximum unsorted dextramer signals, a dextramer sorting gate efficiency, determining, based on the dextramer binding background noise and the dextramer sorting gate efficiency, the measure of background noise (d), and subtracting, for each cell represented in the dextramer sequence data, the measure of background noise from a dextramer signal associated with each cell. The measure of background noise may be expressed as (d). Subtracting the measure of background noise from a dextramer signal associated with each cell may comprise evaluating (Ec=Es−d). The method 4300 may further comprise normalizing the dextramer sequence data. Normalizing the dextramer sequence data can comprise performing, for each cell represented in the dextramer sequence data, cell-wise and normalization on the dextramer signals associated with each cell and/or performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization. Performing cell-wise normalization may comprise evaluating:
Performing pMHC-wise normalization may comprise evaluating:
The method 4300 may comprise filtering, from the dextramer sequence data, based on the single cell TCR-data, data according to a presence or an absence of an α-chain or a β-chain at step 4340. Filtering, from the dextramer sequence data, based on the single cell TCR-data, data according to the presence or the absence of the α-chain or the β-chain can comprise determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain and removing, from the normalized dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chains.
The method 4300 may comprise identifying data remaining in the normalized filtered dextramer sequence data as associated with reliable TCR-pMHC binding events at step 4350.
The method 4300 may further comprise training a predictive model based on the data remaining in the normalized filtered dextramer sequence data. The method 4300 may further comprise predicting a binding status of a newly presented receptor sequence according to the trained predictive model.
In an embodiment, the ICON module 108 and/or the predictive module 110 may be configured to perform a method 4400, shown in
The method 4400 may comprise determining, based on the normalized dextramer sequence data, a training dataset comprising a plurality of TCR sequences wherein each TCR sequence is associated with a binding affinity at step 4420. Determining, based on the normalized dextramer sequence data, the training dataset comprising the plurality of TCR sequences wherein each TCR sequence is associated with a binding affinity can comprise determining, for each TCR sequence of the plurality of TCR sequences, a paired αβ chain CDR3 amino acid sequence, a V gene identifier, and a J gene identifier and encoding, for each TCR sequence of the plurality of TCR sequences, the paired αβ chain CDR3 amino acid sequence, the V gene segment sequence, and the J gene segment sequence into a one-dimensional input vector. Encoding, for each TCR sequence of the plurality of TCR sequences, the paired αβ chain CDR3 amino acid sequence comprises transforming each alphabetical representation of an amino acid into a numerical representation of the amino acid. Encoding, for each TCR sequence of the plurality of TCR sequences, the V gene identifier and the J gene identifier comprises one hot encoding to generate a categorical and discrete representation of gene names in numerical space.
The method 4400 may further comprise clustering the one-dimensional input vectors into one or more clusters. Clustering the one-dimensional input vectors into one or more clusters comprising applying a KNN clustering algorithm to the one-dimensional input vectors. The one or more clusters are indicative binding strength.
The method 4400 may comprise determining, based on the plurality of TCR sequences, a plurality of features for a predictive model at step 4430. The predictive model can comprise a weighted binary classifier or a Convolutional Neural Network (CNN).
The method 4400 may comprise training, based on a first portion of the training dataset, the predictive model according to the plurality of features at step 4440. Training, based on the first portion of the training dataset, the predictive model according to the plurality of features comprises a training Convolutional Neural Network (CNN). Training, based on a first portion of the training dataset, the predictive model according to the plurality of features comprises applying a class-weighted cost function.
The method 4400 may comprise testing, based on a second portion of the training dataset, the predictive model at step 4450.
The method 4400 may comprise outputting, based on the testing, the predictive model at step 4460.
The method 4400 may further comprise presenting, to the trained predictive model, an unknown TCR sequence and predicting, by the trained predictive model, a binding affinity.
In an embodiment, the ICON module 108 and/or the predictive module 110 may be configured to perform a method 4500, shown in
The method 4500 may comprise predicting, by the trained predictive model, a binding affinity at step 4520. The predictive model can comprise a weighted binary classifier or a Convolutional Neural Network (CNN).
The method 4500 may comprise determining, based on the normalized dextramer sequence data, a training dataset comprising a plurality of TCR sequences wherein each TCR sequence is associated with a binding affinity. The training dataset can comprise a plurality of TCR sequences wherein each TCR sequence is associated with a binding affinity. The training data set can comprise can comprise a paired αβ chain CDR3 amino acid sequence, a V gene identifier, a J gene identifier, and a binding affinity (e.g., yes/no).
The method 4500 may comprise training, based on a first portion of a training dataset, the predictive model according to the plurality of features. Training, based on the first portion of the training dataset, the predictive model according to the plurality of features comprises training a Convolutional Neural Network (CNN). Training, based on the first portion of the training dataset, the predictive model according to the plurality of features comprises training a Convolutional Neural Network (CNN) with a single translationally invariant layer applied to each TCR sequence followed by three fully connected convolutional layers to a final output layer. Training, based on a first portion of the training dataset, the predictive model according to the plurality of features comprises applying a class-weighted cost function. Training, based on the first portion of the training dataset, the predictive model according to the plurality of features comprises training a Neural Network by embedding the one-hot encoded V and J genes of each chain of the TCR sequence via learned embeddings, and concatenating these embeddings together with the output of a Convolutional Neural Network for each CDR3, which is fed the embedded CDR3, forming a 1D numerical vector representing the TCR, followed by passing each numeric TCR sequence through a final fully connected layer.
In an embodiment, the ICON module 108 and/or the predictive module 110 may be configured to perform a method 4600, shown in
The method 4600 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a number of genes at 4602.
The method 4600 may comprise removing, from the dextramer sequence data, data associated with cells having a number of genes outside of a gene threshold range at 4603.
The method 4600 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell sequence data, a fraction of mitochondrial gene expression at 4604.
The method 4600 may comprise removing, from the dextramer sequence data, data associated with cells having a fraction of mitochondrial gene expression that exceeds a gene expression threshold at 4605.
The method 4600 may comprise determining, based on the dextramer sequence data, sorted dextramer sequence data wherein the sorted dextramer sequence data comprises sorted test dextramer sequence data and negative control dextramer sequence data at 4606.
The method 4600 may comprise determining, for each cell represented in the dextramer sequence data, based on the negative control dextramer sequence data, a maximum negative control dextramer signal at 4607.
The method 4600 may comprise determining, for each cell represented in the dextramer sequence data, based on the sorted test dextramer sequence data, a maximum sorted dextramer signal at 4608.
The method 4600 may comprise estimating, based on the maximum negative control dextramer signals and the maximum sorted dextramer signals, a dextramer binding background noise at 4609.
The method 4600 may comprise determining, for each cell represented in the dextramer sequence data, based on the single cell TCR sequence data, a presence or an absence of at least one α-chain and at least one β-chain at 4610.
The method 4600 may comprise removing, from the dextramer sequence data, based on the presence or the absence of the at least one α-chain and the at least one β-chain, data associated with cells having only an α-chain, only a β-chain, or multiple α- or β-chains at 4611.
The method 4600 may comprise determining, for each dextramer binding to a given cell represented in the dextramer sequence data, a ratio of dextramer signal within the cell to a sum of all dextramers binding to the cell (a measure of the dextramer binding specificity to the cell) at 4612. Determining, for each dextramer binding to a given cell represented in the dextramer sequence data, a ratio of dextramer signal within the cell to a sum of all dextramers binding to the cell may comprise determining a background noise subtracted dextramer signal Eij, for the ith T cell binding the jth dextramer and determining a fraction of dextramer signal due to binding of the jth dextramer for the ith T cell by evaluating:
The method 4600 may comprise determining, for each dextramer binding to a given TCR clonotype of each cell represented in the dextramer sequence data, a fraction of T cells within a clone binding to a particular dextramer (a measure of the dextramer binding specificity to the clonotype to which the cell belongs) at 4613. Determining, for each dextramer binding to a given TCR clonotype of each cell represented in the dextramer sequence data, a fraction of T cells within a clone binding to a particular dextramer may comprise determining a TCR clonotype ki, of the ith T cell, determining a number of T cells, Tk
The method 4600 may comprise determining, for each dextramer binding to a given cell represented in the dextramer sequence data, based on the measure the of the dextramer binding specificity to the cell and the measure of the dextramer binding specificity to the clonotype to which the cell belongs, a corrected dextramer signal associated with each dextramer binding to the cell at 4641. Determining, for each dextramer binding to a given cell represented in the dextramer sequence data, based on the measure the of the dextramer binding specificity to the cell and the measure of the dextramer binding specificity to the clonotype to which the cell belongs, a corrected dextramer signal associated with each dextramer binding to the cell may comprise determining the corrected dextramer signal for the ith T cell binding the jth dextramer by evaluating:
S
ij
=E
ij(RCij)2RTkj.
The method 4600 may comprise performing, for each cell represented in the dextramer sequence data, cell-wise normalization on the dextramer signals associated with each cell;
The method 4600 may comprise performing, for each cell represented in the dextramer sequence data, pMHC-wise normalization at 4615.
The method 4600 may comprise identifying, based on a threshold, data remaining in the normalized dextramer sequence data as associated with reliable TCR-pMHC binding events at 4616.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the method and compositions described herein. Such equivalents are intended to be encompassed by the following claims.
This application claims priority to U.S. Provisional Application No. 63/111,395, filed Nov. 9, 2020, U.S. Provisional Application No. 63/090,498, filed Oct. 12, 2020, and U.S. Provisional Application No. 63/013,480, filed Apr. 21, 2020, which are herein incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63111395 | Nov 2020 | US | |
63090498 | Oct 2020 | US | |
63013480 | Apr 2020 | US |