The present disclosure is drawn to analyses and de novo elucidation of high-confidence peptide sequences (e.g., intact native peptide sequences) from complex mixtures of proteins. The provided methods and compositions include the use of an algorithm and program that combines data from parallel sequence analyses of polypeptide samples where the polypeptide samples, e.g., are intact native and compared to enzymatically or chemically cleaved polypeptides.
Proteomics research focuses on large-scale studies to characterize the proteome, the entire set of proteins, in a living organism. Historically, it has been challenging to determine the sequences of larger intact native peptides from complex samples de novo from tandem mass spectra without the aid of genomic, transcriptomic and/or proteomic sequence data. Typically, identification of protein sequences utilizing liquid chromatography-tandem mass spectrometry (LC-MS/MS) is performed either through database-search methods (given that a set of predictive sequences for a certain organism is readily available) or de novo sequencing based on fragmentation data alone. While an essential step in the determination of novel peptide sequences, de novo sequencing often yields lower accuracies and less conclusive results compared to the former due to a lack of reference information. Accordingly, an improved method for de novo sequencing could be highly beneficial for the evaluation of organisms lacking suitable pre-existing sequence databases, allowing better identification of peptides of interest (e.g., antimicrobial peptides).
The present disclosure is drawn to methods and compositions for large scale analyses and elucidation of high-confidence peptide sequences (e.g., intact native peptide sequences) from complex mixtures of proteins. Said method enables one to generate a computed de novo polypeptide sequence. The terms “peptide” and “polypeptide” may be used interchangeably herein. The provided methods and compositions include the use of an algorithm and program that combines data from parallel analyses of polypeptide samples where starting polypeptide sequences (e.g., intact native polypeptide sequences) are compared to enzymatically or chemically cleaved polypeptide sequences wherein the polypeptide sequences are grouped into clusters with sequences aligned within the clusters. For each cluster of sequences, a computed sequence is then generated and may factor in potential parameters, such as for example, residue alignment between sequences, positional confidence scores, etc.
In aspects of the present disclosure, a polypeptide sequencing method includes: dividing a sample including a mixture of polypeptides into at least two factions, where the at least two fractions include a starting fraction and at least one other fraction; contacting polypeptides in each of the at least one other fraction with at least one agent for cleavage of the polypeptides within the at least one other fraction, to provide at least one fragmented fraction; analyzing polypeptides in the starting fraction and polypeptides in the at least one fragmented fraction to provide starting sequences for polypeptides in the starting fraction and fragment sequences for polypeptides in the at least one fragmented fraction; grouping the starting sequences and the fragment sequences into at least one cluster, where each of the at least one cluster includes at least one starting sequence of the starting sequences and at least one fragment sequence of the fragment sequences; and for each cluster of the at least one cluster, generating a computed sequence for the respective cluster based on values in the at least one starting sequence and values in the at least one fragment sequence of the respective cluster, where the computed sequence serves as a computationally determined sequence for a polypeptide in the sample.
In various embodiments of the method, the sample is derived from a source selected from the group consisting of: soil, water, air, agricultural, or industrial compositions.
In various embodiments of the method, the sample is a biological sample.
In various embodiments of the method, the biological sample is derived from an animal species with an immune system that produces molecules to defend against infections.
In various embodiments of the method, the mixture of polypeptides in the sample are enriched based on size, charge, or affinity chromatography.
In various embodiments of the method, the cleavage of the polypeptides is achieved enzymatically or chemically.
In various embodiments of the method, the at least one agent for cleavage of the polypeptides includes proteolytic enzymes.
In various embodiments of the method, the starting fraction includes intact native polypeptides.
In various embodiments of the method, the grouping the starting sequences and the fragment sequences into at least one cluster is based on applying at least one clustering criterion, where the at least one clustering criterion is applied to at least one of: comparison of two starting sequences, comparison of two fragment sequences, or comparison of a starting sequence with a fragment sequence.
In various embodiments of the method, the at least one clustering criterion includes at least one criterion using confidence scores for the starting sequences and confidences scores for the fragment sequences.
In various embodiments of the method, for each cluster of the at least one cluster, the generating the computed sequence for the respective cluster includes: positionally aligning the at least one starting sequence and the at least one fragment sequence of the respective cluster with each other, to provide aligned sequences having a plurality of aligned positions; and for each position of the plurality of aligned positions of the aligned sequences, determining a computed value for the respective position based on values of the aligned sequences at the respective position, where the computed values for the plurality of aligned positions form the computed sequence for the respective cluster.
In various embodiments of the method, the positionally aligning the at least one starting sequence and the at least one fragment sequence of the respective cluster with each other uses Gotoh local alignment technique and BLASTP scoring parameters.
In various embodiments of the method, the computed values for the plurality of aligned positions are consensus values.
In various embodiments of the method, the computed values for the plurality of aligned positions are determined further based on confidence scores for values of the aligned sequences at the plurality of aligned positions.
In various embodiments of the method, the method further includes, for each cluster of the at least one cluster: iterating, for at least one iteration, the generating the computed sequence for the respective cluster, where for each iteration, the generated computed sequence from the prior iteration is used as a starting sequence of the respective iteration.
In various embodiments of the method, the analyzing the polypeptides in the starting fraction and the polypeptides in the at least one the fragmented fraction includes analyzing the polypeptides in the starting fraction and the polypeptides in the at least one the fragmented fraction using tandem mass spectrometry.
In various embodiments of the method, the generating the computed sequence for the respective cluster includes generating the computed sequence for the respective cluster using a machine learning model.
In various embodiments of the method, the machine learning model is a sequence to sequence model.
In accordance with aspects of the present disclosure, a system is disclosed for polypeptide sequencing of a sample containing a mixture of polypeptides divided into at least two factions, where the at least two fractions include a starting fraction and at least one other fraction, and where polypeptides in each of the at least one other fraction are contacted with at least one agent for cleavage of the polypeptides within the at least one other fraction to provide at least one fragmented fraction. The system includes at least one processor, and at least one memory storing instructions which, when executed by the at least one processor, causes the system at least to perform any of the methods of the aforementioned paragraphs.
In accordance with aspects of the present disclosure, a processor-readable medium stores instructions for polypeptide sequencing of a sample containing a mixture of polypeptides divided into at least two factions, where the at least two fractions include a starting fraction and at least one other fraction, and where polypeptides in each of the at least one other fraction are contacted with at least one agent for cleavage of the polypeptides within the at least one other fraction to provide at least one fragmented fraction. The instructions, when executed by at least one processor of a system, cause the system to at least perform any of the methods of the aforementioned paragraphs.
The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims.
In order to better understand the subject matter that is disclosed herein and to exemplify how it may be carried out in practice, embodiments will now be described, by way of non-limiting example, with reference to the accompanying drawings. With specific reference to the drawings, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the disclosure.
The present disclosure is drawn to methods and compositions for large scale analyses and elucidation of high-confidence polypeptide sequences (e.g., intact native computed de novo polypeptide sequences) from a complex mixture of polypeptides. The provided methods and compositions include the use of an algorithm and program that combines clustering and alignment data from parallel analyses of polypeptide samples where derived intact native polypeptide sequences are compared to derived enzymatically or chemically fragmented polypeptide sequences.
In the following description, certain specific details are set forth in order to provide a thorough understanding of disclosed aspects. However, one skilled in the relevant art will recognize that aspects may be practiced without one or more of these specific details or with other methods, components, materials, etc.
Reference throughout this specification to “one aspect” or “an aspect” means that a particular feature, structure, or characteristic described in connection with the aspect is included in at least one aspect. Thus, the appearances of the phrases “in one aspect” or “in an aspect” in various places throughout this specification are not necessarily all referring to the same aspect. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more aspects.
As described below, a novel technology has been developed that combines data from parallel analyses of samples to elucidate higher confidence sequences of intact native peptides present in complex mixtures. The provided technology is unique, as the algorithm and associated software are usable for large-scale analyses and the elucidation of high-confidence intact native peptide sequences from complex mixtures.
The provided technology may be used for generating a computed de novo polypeptide sequence and may include: (i) dividing a sample comprising a mixture of polypeptides into multiple samples; designating one of the multiple samples an intact native fraction and designating the one or more additional multiple samples as fragmented fraction(s); (ii) contacting each of the one or more designated fragmented fractions with an agent for digestion of the polypeptides within the fragmented fraction(s); (iii) analysis of the intact native fraction polypeptides and the fragmented fraction of polypeptides using tandem mass spectrometry (MS/MS) for de novo determination of the peptide sequence of polypeptides in the intact native and fragmented fractions; (iv) grouping of peptide sequences from the different fractions based on clustering; (v) aligning the sequences within the clusters; and (vi) generating a computed sequence for each cluster thereby generating a computed de novo sequence.
Referring to
In one specific non-limiting aspect, the sample 101 is derived from an animal species with an immune system that produces molecules to defend against infections, e.g., anti-microbials or antivirals. In one embodiment, said animal species includes various reptilian species and said provided methods may be used to identify novel antimicrobial and antiviral therapeutics. Said reptiles include, for example, the American alligator or Komodo dragon. In various embodiments, the animal species includes various amphibian species.
Methods for purification and enrichment 111 of particular classes of polypeptides from a sample 101 (e.g., particle harvesting, solid-phase extraction or precipitation, etc.) may optionally be utilized to decrease the complexity of the polypeptide sample prior to analysis. In certain embodiments, the proteins may be enriched based on size, charge, or affinity chromatography. In one example, an acetonitrile (ACN)-based extraction method may be used that depletes high abundance and high molecular weight proteins from a sample prior to mass spectrometric analysis. In a specific embodiment, polypeptide enrichment 111 may be accomplished using acetonitrile/formic acid precipitation. In a specific embodiment, an 80% acetonitrile/0.1% formic acid (FA) precipitation solvent and a 75% acetonitrile/0.1% FA precipitation solvent may be used to enrich for small molecular weight proteins.
Once the enriched polypeptide sample 102 is obtained, the sample 102 is divided (112) into two or more fractions 103a, 103b, which can be processed and analyzed in parallel. One fraction 103a is designated the starting fraction (which in some embodiments may be referred as an intact native fraction), and one or more additional fractions 103b are designated to become fragmented fraction(s) 104. The fragmented fractions 104 contain samples of the polypeptide mixture which are contacted with an agent for cleavage of the polypeptides within the sample, while the starting fraction 103a (e.g., intact native fraction) lacks contact with a cleavage agent.
For cleavage 114 of the polypeptide mixtures, such cleavage 114 may be achieved through the use of various different enzymes or chemicals. Proteolytic enzymes (proteases) that may be used for cleavage of polypeptide samples include, but are not limited to, trypsin, trypsin-LysC, chymotrypsin, Glu-C, papain, elastase, endoproteinase Arg-C, Endoproteinase Glu-C, Endoproteinase Lys-C, pepsin and carboxypeptidase. In an embodiment, the polypeptide sample 103b to be cleaved (to become the fragmented fraction 104) may be divided into one or more samples with each sample being proteolytically cleaved 114 using a different enzyme. The multiple enzymes recognize and specifically hydrolyze peptides at different positions in their sequences ideally yielding polypeptide fragments with overlapping sequences.
In one non-limiting illustration, the samples 103b are divided into two groups (not shown), with one being proteolyzed using trypsin and another with GluC. These two enzymes recognize and specifically hydrolyze peptides at different positions in their sequences, ideally yielding fragments with overlapping sequences. The proteolyzed peptide fragments 104 can then analyzed via, e.g., MS/MS.
As an alternative to proteolytic cleavage, the polypeptide samples 103b may be fragmented (114) using chemical-based cleavage. For example, cyanogen bromide (CNBr) cleaves at methionine (Met) residues; BNPS-skatole cleaves at tryptophan (Trp) residues; formic acid cleaves at aspartic acid-proline (Asp-Pro) peptide bonds; hydroxylamine cleaves at asparagine-glycine (Asn-Gly) peptide bonds, and 2-nitro-5-thiocyanobenzoic acid (NTCB) cleaves at cysteine (Cys) residues.
Following the cleavage 114 of sample polypeptides 103b, the cleaved samples 104 together with the starting fraction 103a (e.g., intact native sample), may optionally undergo desalting 115, and are then analyzed via, e.g., tandem mass spectrometry (MS/MS), nanopore technology, and/or Edman degradation sequencing, among other techniques 116. MS/MS will be used as an example of block 116 herein, but it will be understood that other analysis techniques may be employed at block 116. Accordingly, any disclosure that employs MS/MS shall be treated as though the disclosure employs other analysis techniques, as well. The tandem MS/MS 116 may be conducted in conjunction with electron-transfer dissociation (ETD), electron-transfer/higher-energy collisional dissociation (EThCD), higher-energy collisional dissociation (HCD) or other fragmentation chemistry which those skilled in the art will recognize. The MS/MS data are then analyzed 117 to assign amino acid sequences de novo to polypeptides in the starting (e.g., intact native) fraction 103a and assign amino acid sequences to polypeptides in the fragmented fraction 104. The sequences for the starting fraction 104 may be referred to herein as starting sequences 107a, and the sequences for the fragmented fraction 104 may be referred to herein as the fragment sequences 107b. Determination of the sequences 107a, 107b may be accomplished based on the MS/MS output using, for example, software such as PEAKS® provided by Bioinformatics Solutions Inc., among other possible software. Such analysis 117 may also provide average and positional confidence scores for the assigned sequences 107a, 107b and/or for individual amino acids in the sequences 107a, 107b based on how well the sequences are supported by the MS/MS spectral data. In accordance with aspects of the present disclosure, in blocks 116 and 117, providing the sequences based on MS/MS data may provide more accurate sequences for shorter sequences than for longer sequences and/or may provide higher confidence for shorter sequences than for longer sequences.
The example of
The starting sequences 107a and the fragment sequences 107b are clustered and aligned 118, which will be described in more detail in connection with
In aspects of the present disclosure, the grouping process may perform a pairwise comparison between each pair of sequences. In the example, the pairwise comparisons include three comparisons—a first comparison 212, a second comparison 214, and a third comparison 216. In various embodiments, the pairwise comparisons may determine the degree of overlap between two sequences. In various embodiments, two sequences may overlap completely such that a longer sequence encompasses a shorter sequence. In various embodiments, two sequences may overlap partially such that a beginning or ending portion of one sequence does not overlap with the other sequence.
In the first comparison 212, the two sequences 202, 204 have several single-position overlaps (e.g., S, T, W, etc.) and have a two-position overlap (i.e., MY). In the illustrated example, for the two sequences 202, 204 having length 21 and length 14, a two-position overlap is not sufficient to group the two sequences 202, 204 into the same cluster. In the second comparison 214, the two sequences 202, 206 have several single-position overlaps (e.g., A, P, R, W, etc.) and have a three-position overlap (i.e., RDA). In the illustrated example, for the two sequences 202, 206 having length 21 and length 13, the three-position overlap is not sufficient to group the two sequences 202, 206 into the same cluster. In the third comparison 216, the two sequences 204, 206 have several single-position overlaps and have a five-position overlap (i.e., EMYKEY) formed by a beginning portion of sequence 206 and an end portion of sequence 206. In the illustrated example, for the two sequences 204, 206 having length 14 and length 13, the five-position partial overlap is sufficient to group the two sequences 204, 206 into the same cluster. As shown in
The clustering criteria for how much overlap is sufficient to group two sequences into the same cluster can be configured in various ways. For example, in various embodiments, if the two sequences have lengths L1 and L2 and if the overlap is continuous and has a length of at least min (L1,L2)/3, the two sequences may be grouped into the same cluster. In various embodiments, an 3 overlap may be required to have minimum length, such as a three or another number. In various embodiments, the overlapping portion of two sequences may be non-continuous. For example, two sequences AABAA and AACAA have a four-character overlap that is non-continuous, such that the overlapping characters are separated by one or more non-overlapping characters. A non-continuous overlap may be sufficient to group the two sequences into the same cluster. In various embodiments, in a non-continuous overlap, there may be a criterion that overlapping characters may only be separated from each other by a maximum number of non-overlapping characters, such as separated by only one non-overlapping character, or another number. In various embodiments, different clustering criteria may be applied for grouping two sequences having complete overlap than for grouping two sequences having partial overlap. In various embodiments, different clustering criteria may be applied for grouping two sequences having continuous overlap than for grouping two sequences having non-continuous overlap. Such and other clustering criteria for grouping two sequences into the same cluster are contemplated to be within the scope of the present disclosure.
In various embodiments, each of the sequences may be accompanied by confidence scores for the sequences and/or confidence scores for individual positions in the sequences. For example, when the analysis 117 of
In accordance with aspects of the present disclosure, when confidence scores are available for sequences, they may be used in the clustering criteria for grouping sequences into clusters. In various embodiments, one or more thresholds may be used in connection with the confidence scores. For example, if two sequences have a sufficient overlap length (e.g., overlap length above a length threshold), but one or both confidence scores are below a score threshold, the two sequences may not be grouped into the same cluster. In various embodiments, where individual position scores are available, there may be used in the clustering criteria for grouping sequences into clusters. In various embodiments, clustering criteria may require that each overlapping position has a position confidence score above a score threshold. In various embodiments, clustering criteria may require that a certain percentage or certain number of overlapping positions have a position confidence score above a score threshold. In various embodiments, confidence scores for sequences as well as confidence scores for positions in a sequence may both be used in clustering criteria. Such and other embodiments are contemplated to be within the scope of the present disclosure.
The result of the grouping process of
In aspects of the present disclosure, the grouping process may perform a pairwise comparison between each pair of fragment sequences as well as between each pair of starting sequences and fragment sequences. In the example, the pairwise comparisons between the nine fragment sequences 301-309 include thirty-six comparisons, and the pairwise comparisons between the nine fragment sequences 301-309 and the three starting sequences 202, 204, 206 include twenty-seven comparisons.
As mentioned above in connection with
Continuing with the example, and to describe some of the sixty-three comparisons, the pairwise comparison of fragment sequence 305 (STHISTI) and fragment sequence 308 (HISTIME) has a five-position overlap (HISTI) for the two sequences of length 7, which in the example is sufficient to group the two fragment sequences 305, 308 into the same cluster. Comparing the same two fragment sequences 305, 308 to the first cluster 222 and second cluster 224, fragment sequence 305 overlaps with the first cluster 222 by just single positions and overlaps with the second cluster 224 by a three-position continuous overlap (i.e., STH). Fragment sequence 308 overlaps with the first cluster 222 by just single positions and overlaps with the second cluster 224 by a three-position non-continuous overlap (i.e., T_ME). Because both fragment sequences 305, 308 are clustered together, and both have greater overlap with the second cluster 224 than with the first cluster 222, and the overlap length is three positions for both fragment sequences 305, 308 with the second cluster 224, this is sufficient information to group the fragment sequences 305, 308 into the second cluster 224. The other pairwise comparisons are performed in the same manner to group the fragment sequences 301-309 into clusters. As mentioned above, if confidence scores for sequences and/or confidence scores for positions in sequences are available, such scores may be used in the clustering criteria and may be used with one or more thresholds.
The results of the comparisons is that fragment sequences 301, 302, 303, and 307 are grouped into the first cluster 222, and fragment sequences 304, 305, 306, 308, and 309 are grouped into the second cluster 224, which is shown in
As mentioned above, the sequences in a cluster are grouped because they were computationally estimated to be sequences for the same polypeptide. In accordance with aspects of the present disclosure, the information in the sequences of a cluster can be analyzed to generate a computed sequence that serves as an estimate of the true sequence of the polypeptide corresponding to the cluster.
Based on the overlapping values between the sequences, the sequences can be aligned with each other as shown in
Notably, based on the alignment, the aligned fragment sequences show a total of 22 positions (positions a through v), while the starting sequence only has 21 positions. This difference of one position indicates that, more likely than not, the starting sequence may be missing a position (rather than the fragment sequences having an extraneous position). This is because, as mentioned above, providing the sequences based on MS/MS data may provide more accurate sequences for shorter sequences than for longer sequences and/or may provide higher confidence scores for shorter sequences than for longer sequences.
In various embodiments, when the length of the starting sequence is different form the length of the aligned fragment sequences, an approach for determining the location of the missing position in the starting sequence may start by marking positions as possible missing positions or as possible non-missing positions using various criteria. Using the same example, positions that may possibly be a non-missing position may include positions that have a starting sequence value that matches at least one fragment sequence value in the same position (e.g., positions a, b, d, etc.). Other criteria are contemplated. Based on such criteria, the markings for such positions are shown in
In various embodiments, after marking the positions, a specific position can be identified where the specific position is marked as a possible missing position, and positions after the specific position contain more marked possible missing positions than marked possible non-missing positions. In the example, the specific position fitting such criteria is position m, which is a marked as a possible missing position, and after position m, there are more marked possible missing positions (five) than marked possible non-missing positions (four). Then, a placeholder value (such as a null/empty value or a space value, etc.) can be inserted into position m in the starting sequence to shift the remaining positions in the starting sequence by one position, and the result can be evaluated to determine whether it provides fewer marked possible missing positions. In the example of
As shown by the examples in
In accordance with aspects of the present disclosure, each of the aligned sequences at each position is analyzed to determine a computed value for that position. In various embodiments, a computed value may be the value that has more occurrences than other values in the position, which may be referred to as a “consensus” value. For example, in the first cluster 222, position f has three values from the various sequences: A, O, and O. Because O has more occurrences than A, the consensus value for position f would be O. In various embodiments, if a position has only one value from the starting sequence and one value from a fragment sequence (such as position c), the value from the fragment sequence can be selected as the computed value for the position. In various embodiments, if a position has different values from all sequences, the value from the shortest sequence may be selected as the computed value. In various embodiments, when confidence scores for each position of each sequence are available, the confidence scores can be considered in determining the consensus value. For example, continuing with position c as an example, suppose value I has a confidence score of 0.8, and value O has a confidence score of 0.9. Because the confidence score for value O is higher than the confidence score for the value I, the computed value for position c may be selected to be O based on the confidence scores.
Determining the computed values at each position, the fictitious computed sequence 421 is determined to be IFORGOTMYPASSWORDAGAIN, as shown in
In various embodiments, after determining the computed sequence 421, the alignment process of operation 118 and the computed sequence generation 410 can be iterated again for the first cluster 222 using the computed sequence 421 as the starting sequence. In various embodiments, any total number of iterations of the alignment process of operation 118 and the computed sequence generation 410 can be performed, such as one iteration, two iterations, or more than two iterations.
Using the same process for the second cluster 224, the second cluster 224 includes the following starting sequences and fragment sequences.
WHPRDAREMYKEY
Based on the overlapping values between the sequences, they can be aligned with each other as shown in
Based on the alignment, the aligned fragment sequences show a total of 22 positions (positions a through v), while the starting sequences together only have 21 positions. This difference of one position indicates that, more likely than not, the starting sequence may be missing a position (rather than the fragment sequences having an extraneous position). This is because, as mentioned above, providing the sequences based on MS/MS data may provide more accurate sequences for shorter sequences than for longer sequences and/or may provide higher confidence scores for shorter sequences than for longer sequences.
The marking of possible missing positions and possible non-missing positions is shown in
In the example, position q is marked as a possible missing position, and after position q, there are more marked possible missing positions (five) than marked possible non-missing positions (zero), as shown in
As shown by the examples in
Using position q as the location of the missing position in the starting sequence, and determining the computed values at each position, the fictitious computed sequence 422 is determined to be WHEREAREMYKEYSTHISTIME, as shown in
In various embodiments, after determining the computed sequence 422, the alignment process of operation 118 and the computed sequence generation 410 can be iterated again for the second cluster 224 using the computed sequence 422 as the starting sequence. In various embodiments, any total number of iterations of the alignment process of operation 118 and the computed sequence generation 410 can be performed, such as one iteration, two iterations, or more than two iterations.
The examples described above in connection with
In aspects of the present disclosure, the computed sequences may be validated based on consistency with raw data (i.e., fragment ions supporting residue corrections made) and based on BLAST results against the sample organism's genome or transcriptome (if available)
The foregoing disclosure described an approach for computing an amino acid sequence for a polypeptide using clustering, alignment, and computational techniques. Other approaches are contemplated to be within the present disclosure, such as machine learning approaches.
In accordance with aspects of the present disclosure, various machine learning models may be used to provide a computed sequence based on receiving one or more starting sequences and/or one or more fragment sequences. An example of a machine learning model for such task is a sequence to sequence model, which receives an input sequence and provides an output sequence. A sequence to sequence model includes an encoder and a decoder. The encoder processes the input sequence to provide a vector, and the decoder processes the vector to provide an output sequence. The encoder and the decoder can each include a recurrent neural network (RNN) and long short-term memory (LSTM). Persons skilled in the art will understand how to implement sequence to sequence models. For training the sequence to sequence model, the input sequence(s) can be starting sequences and/or fragment sequences having known ground truth sequences, and the model can be trained so that the output sequences match the ground truth sequences as closely as possible.
The sequence to sequence model is merely an example, and other machine learning models are contemplated to be within the scope of the present disclosure. In various embodiments, the machine learning model(s) may be applied to one or more starting sequences and/or fragment sequences that have not been clustered. In various embodiments, the machine learning model(s) may be applied to one or more starting sequences and/or fragment sequences that have been clustered and that belong to the same cluster.
Referring now to
At block 510, the operation involves dividing a sample comprising a mixture of polypeptides into at least two factions, where the at least two fractions include a starting fraction and at least one other fraction. The starting fraction includes polypeptides that may be intact native polypeptides. The at least one other fraction includes polypeptides to be cleaved.
At block 520, the operation involves contacting polypeptides in each of the at least one other fraction with at least one agent for cleavage of the polypeptides within the at least one other fraction, to provide at least one fragmented fraction. The at least one agent may perform the cleavage enzymatically or chemically. The result of the cleavage is that the fragmented fraction(s) include fragments of the polypeptides in the starting fraction.
At block 530, the operation involves analyzing polypeptides in the starting fraction and polypeptides in the at least one the fragmented fraction based on tandem mass spectrometry (MS/MS) to provide starting sequences for polypeptides in the starting fraction and fragment sequences for polypeptides in the at least one fragmented fraction. The MS/MS provides data that may be used by software such as PEAKS® to provide an amino acid sequence for the polypeptides and may be used to provide confidence scores for the amino acid sequences and/or confidence scores for individual positions in the amino acid sequences. The sequences for the polypeptides in the starting frequence may be sequences for intact native polypeptides. The sequences for the polypeptides in the fragmented fraction(s) may be sequences for fragments of the native polypeptides.
At block 540, the operation involves grouping the starting sequences and the fragment sequences into at least one cluster, where each of the at least one cluster includes at least one starting sequence of the starting sequences and at least one fragment sequence of the fragment sequences. The grouping of sequences into clusters may be based on one or more clustering criterion. The clustering criteria may use confidence scores if they are available.
At block 550, the operation involves, for each cluster of the at least one cluster, generating a computed sequence for the respective cluster based on values in the at least one starting sequence and values in the at least one fragment sequence of the respective cluster, where the computed sequence serves as a computationally determined sequence for a polypeptide in the sample. The computed sequence may be a consensus sequence for an intact native polypeptide. In various embodiments, generating the computed sequence involves positionally aligning the at least one starting sequence and the at least one fragment sequence of a cluster with each other, to provide aligned sequences having a plurality of aligned positions, and for each position of the plurality of aligned positions of the aligned sequences, determining a computed value for the respective position based on values of the aligned sequences at the respective position. Then, the computed values for the plurality of aligned positions form the computed sequence for the cluster.
The computing components include an electronic storage 610, a processor 620, a memory 640, and a network interface 630. The various components may be communicatively coupled with each other. The processor 620 may be and may include any type of processor, such as a single-core central processing unit (CPU), a multi-core CPU, a microprocessor, a digital signal processor (DSP), a System-on-Chip (SoC), or any other type of processor. The memory 640 may be a volatile type of memory, e.g., RAM, or a non-volatile type of memory, e.g., NAND flash memory. The memory 640 includes processor-readable instructions that are executable by the processor 620 to cause the system to perform various operations, including those mentioned herein, such as the operations described in connection with of
The electronic storage 610 may be and include any type of electronic storage used for storing data, such as hard disk drive, solid state drive, and/or optical disc, among other types of electronic storage. The electronic storage 610 stores processor-readable instructions for causing the system to perform its operations and stores data associated with such operations, such as storing data relating to any of the sequences, clusters, or confidence scores, among other data. The electronic storage 610 may be a non-transitory processor readable medium. The network interface 630 may implement networking technologies, such as Ethernet, Wi-Fi, and/or other wireless networking technologies.
The components shown in
The embodiments disclosed herein are examples of the disclosure and may be embodied in various forms. For instance, although certain embodiments herein are described as separate embodiments, each of the embodiments herein may be combined with one or more of the other embodiments herein. Specific structural and functional details disclosed herein are not to be interpreted as limiting, but as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present disclosure in virtually any appropriately detailed structure. Like reference numerals may refer to similar or identical elements throughout the description of the figures.
The phrases “in an embodiment,” “in embodiments,” “in various embodiments,” “in some embodiments,” or “in other embodiments” may each refer to one or more of the same or different embodiments in accordance with the present disclosure. A phrase in the form “A or B” means “(A), (B), or (A and B).” A phrase in the form “at least one of A, B, or C” means “(A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).”
The systems, devices, and/or servers described herein may utilize one or more processors to receive various information and transform the received information to generate an output. The processors may include any type of computing device, computational circuit, or any type of controller or processing circuit capable of executing a series of instructions that are stored in a memory. The processor may include multiple processors and/or multicore central processing units (CPUs) and may include any type of device, such as a microprocessor, graphics processing unit (GPU), digital signal processor, microcontroller, programmable logic device (PLD), field programmable gate array (FPGA), or the like. The processor may also include a memory to store data and/or instructions that, when executed by the one or more processors, causes the one or more processors to perform one or more methods and/or algorithms.
Any of the herein described methods, programs, algorithms or codes may be converted to, or expressed in, a programming language or computer program. The terms “programming language” and “computer program,” as used herein, each include any language used to specify instructions to a computer, and include (but is not limited to) the following languages and their derivatives: Assembler, Basic, Batch files, BCPL, C, C+, C++, Delphi, Fortran, Java, JavaScript, machine code, operating system command languages, Pascal, Perl, PL1, Python, scripting languages, Visual Basic, metalanguages which themselves specify programs, and all first, second, third, fourth, fifth, or further generation computer languages. Also included are database and other data schemas, and any other meta-languages. No distinction is made between languages which are interpreted, compiled, or use both compiled and interpreted approaches. No distinction is made between compiled and source versions of a program. Thus, reference to a program, where the programming language could exist in more than one state (such as source, compiled, object, or linked) is a reference to any and all such states. Reference to a program may encompass the actual instructions and/or the intent of those instructions.
It should be understood that the foregoing description is only illustrative of the present disclosure. Various alternatives and modifications can be devised by those skilled in the art without departing from the disclosure. Accordingly, the present disclosure is intended to embrace all such alternatives, modifications and variances. The embodiments described with reference to the attached drawing figures are presented only to demonstrate certain examples of the disclosure. Other elements, steps, methods, and techniques that are insubstantially different from those described above and/or in the appended claims are also intended to be within the scope of the disclosure.
This application claims the benefit of and priority to U.S. Provisional Application No. 63/518,552, filed on Aug. 9, 2023, the entire contents of which are incorporated by reference herein.
This invention was made with government support under grant number 2131062 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63518552 | Aug 2023 | US |