In some embodiments, the present invention is related to computer methods/systems for optimize search results through querying a plurality of databases according to false discovery rates, random hit rates.
Numerous data sources are being created and maintained all over the world. The number of data sources almost guarantees that part or all of a query can be answered using one of these data sources. The mere task of executing a query can be daunting regardless of whether the scope of the query is within the confines of a local computing system, a private network, a local area network, or the World Wide Web. The process of querying these data sources is made further difficult as users must decide which data sources are sufficiently reliable in order to obtain a meaningful search result. For example, the user must consider both the relative accuracy of the sources and the timeliness of the data contained within the sources. These and other shortcomings of are addressed herein.
In the field of bioinformatics, attempts are made to assign a putative source (e.g. translation from RNA, synthesis from DNA, etc.) to de novo peptide sequences. By definition, the putative sources are not fully experimentally confirmed and are, thus, flawed.
Described herein are embodiments of methods, systems, and devices generally directed to assigning a putative source to a de novo peptide sequence and/or creating a workflow for performing said assignment. In some embodiments, the present invention includes workflows that have increased confidence in assigned putative source in the absence of experimental confirmation of the source assignment.
In one embodiment, a putative source of a peptide sequence can be determined based at least in part on a one or more searches of the peptide sequence within one or more databases such that the one or more searches are performed in order of increasing random hit rate until the putative source is determined. The random hit rate for each respective search can be determined based at least in part on a number of random peptide sequences that are found by the respective search. The one or more databases can include, but are not limited to: an expanded human proteome database, a human genome database, a non-endogenous proteome database, additional databases, and combinations thereof. The one or more searches can include, but are not limited to: a linear human proteome search for the peptide sequence within the expanded human proteome database, a linear human genome search of translations of the human genome database, a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database, a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, a cis-spliced search within the expanded human proteome database, and a trans-spliced search within the expanded human proteome database. Each of the searches can indicate a respective potential source of the searched peptide sequence when the respective source finds a match. The putative source determined for the searched peptide sequence can be the potential source identified by the search step having the lowest random hit rate which found a match for the search peptide.
In one embodiment, peptide source search steps can be ordered to generate a peptide source assignment workflow. A plurality of random peptide sequences can be generated and each of the random peptide sequences can be searched by each peptide source search step. A random hit rate can be determined for each peptide source search step based at least in part on a number of the plurality of random peptide sequences found by the peptide source search step. The peptide source search steps can be ordered in the workflow from lowest random hit rate to highest random hit rate. The random hit rate can increase as the number of found random peptide sequences increases. The peptide source search steps can include, but are not limited to: a linear human proteome search for the peptide sequence within the expanded human proteome database, a linear human genome search of translations of the human genome database, a linear mismatch search for peptides having a mismatch to the peptide sequence within the, a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, a cis-spliced search within the expanded human proteome database, and a trans-spliced search within the expanded human proteome database.
Described herein are embodiments of methods of an invention comprising generating a plurality of simulated random queries; determining, based on applying the plurality of simulated random queries to each source of a plurality of sources, a number of matches associated with each source; determining, based on the numbers of matches associated with each source, a false discovery rate associated with each source; and generating, based on the false discovery rates, a query support data structure configured to facilitate application of a new query to the plurality of sources.
Also described are embodiments of the methods comprising receiving a query; applying, based on a query support data structure, the query to one or more sources of a plurality of sources; determining, based on a query result, a label associated with a source of the plurality of sources associated with the query result; and applying the label to the query.
The various steps of the methods disclosed herein, or steps carried out by the systems disclosed herein, may be carried out at the same or different times, in the same or different geographical locations, e.g., countries, and/or by the same or different people.
Additional advantages of the embodiments of the methods and systems will be set forth in part in the description which follows, and in part will be understood from the description, or may be learned by practice of the embodiments of the methods and systems. The advantages of the disclosed embodiments of the methods and systems will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments of the disclosed methods and systems and together with the description, serve to explain the principles of the disclosed methods and systems.
The disclosed methods and systems may be understood more readily by reference to the following detailed description of particular embodiments and the Example included therein and to the Figures and their previous and following description.
It is understood that the disclosed methods and systems are not limited to the particular methodology, protocols, and reagents described as these may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, reference to “a peptide” includes a plurality of such peptides, reference to “the peptide” is a reference to one or more peptides and equivalents thereof known to those skilled in the art, and so forth.
The term “peptide” can be used interchangeably with “polypeptide” and refers to a polymeric form of amino acids of any length, which can include genetically coded and non-genetically coded amino acids, chemically or biochemically modified or derivatized amino acids, and peptides having modified peptide backbones. In some aspects, the term peptide refers to a string of two or more naturally occurring amino acids.
“Optional” or “optionally” means that the subsequently described event, circumstance, or material may or may not occur or be present, and that the description includes instances where the event, circumstance, or material occurs or is present and instances where it does not occur or is not present.
Throughout the description and claims of this specification, the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps. In particular, in methods stated as comprising one or more steps or operations it is specifically contemplated that each step comprises what is listed (unless that step includes a limiting term such as “consisting of”), meaning that each step is not intended to exclude, for example, other additives, components, integers or steps that are not listed in the step.
As used herein, the term “computer-readable representation of protein sequence” can include a sequence listing of a protein itself, a genetic sequence (e.g. DNA, RNA) from which a protein sequence can be derived through a process (e.g. transcription, translation) understood to a person skilled in the pertinent art, and/or portions thereof. Similarly, as used herein, the term “computer-readable representations of translations from ribonucleic acids (RNAs) can include a sequence listing of a protein or peptide that can be translated (at least in theory) from the RNAs as understood to a person skilled in the pertinent art, a genetic sequence of the RNA, a genetic sequence of DNA from which the RNAs can (at least in theory) be transcribed as understood to a person skilled in the pertinent art, and/or portions thereof. As used herein, the term “computer-readable representations of translations from RNAs” can refer to specific types of RNA including messenger RNAs (mRNAs), non-coding RNAs, long non-coding RNAs, micro RNAs, and other types of RNAs as understood by a person skilled in the pertinent art. Computer-readable representations of translations from a specific type of RNA can include a sequence listing of a protein or peptide that can be translated (at least in theory) from the specific type of RNAs as understood to a person skilled in the pertinent art, a genetic sequence of the specific type of RNA, a genetic sequence of DNA from which the specific type of RNA can (at least in theory) be transcribed as understood to a person skilled in the pertinent art, and/or portions thereof.
The terms “random hit rate” and “false discovery rate” are used interchangeably herein and are understood to mean a frequency at which randomly generated inputs are found by a search of a database.
An “individual” or “subject” or “animal” refers to humans, veterinary animals (e.g., cats, dogs, cows, horses, sheep, pigs, etc.) and experimental animal models of diseases (e.g., mice, rats). In some embodiments, the subject is a human.
Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of skill in the art to which the disclosed methods belong. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present method and compositions, the particularly useful methods, devices, and materials are as described. Publications cited herein and the material for which they are cited are hereby specifically incorporated by reference. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such disclosure by virtue of prior invention. No admission is made that any reference constitutes prior art. The discussion of references states what their authors assert, and applicants reserve the right to challenge the accuracy and pertinence of the cited documents. It will be clearly understood that, although a number of publications are referred to herein, such reference does not constitute an admission that any of these documents forms part of the common general knowledge in the art.
Devices and/or components of the system 100 may connect to and/or communicate with each other via a network 106. The network 106 may be a public network, a private network, and/or a combination thereof. The network 106 may support any wired and/or wireless communication technology and/or technique. For example, the network 106 may include a and/or support a cellular network, a data network, a content delivery network, a fiber-optic network, and/or any other type of network.
The system 100 may include a user device 102 (e.g., a computing device, a client device, a smart device, etc.). The user device 102 may comprise a communication element 103 for providing an interface to a user to interact with the user device 102 and/or any other device/component of the system 100. The communication element 103 may be any interface for presenting and/or receiving information to/from the user, such as user feedback. An interface may include a display and/or interactive interface (e.g., a keyboard, a touchscreen, a mouse, a/audio controller, etc.). An interface may include a communication interface such as a web browser (e.g., Internet Explorer®, Mozilla Firefox®, Google Chrome®, Safari®, or the like). Other software, hardware, and/or interfaces may be used to provide communication between the user and one or more of the user device 102 and/or any other device/component of the system 100. The communication element 103 may request or query various files from a local source and/or a remote source, such as computing devices 107-112, and/or any other device/component of the system 100. The computing devices 107-112 may be disposed locally or remotely relative to the user device 102.
The communication element 103 may transmit/send data to a local or remote device, such as the computing devices 107-112, and/or any other device/component of the system 100 via wired and/or wireless communication techniques. For example, the communication element 103 may utilize any suitable wired communication technique, such as Ethernet, coaxial cable, fiber optics, and/or the like. The communication element 103 may utilize any suitable long-range communication technique, such as Wi-Fi (IEEE 802.11), BLUETOOTH®, cellular, satellite, infrared, and/or the like. The communication element 103 may utilize any suitable short-range communication technique, such as BLUETOOTH®, near-field communication, infrared, and the like.
The user device 102 may receive and/or analyze data/information, such as query information and/or the like. For example, the user device 102 may receive data/information, query information, and/or the like via the communication element 103. The data/information, query information, and/or the like may include any type of information, such as statistical queries, analytical queries, industry-specific queries (e.g., immunopeptidomics-related queries, bioinformatic-related queries, biotechnology-related queries, healthcare-related queries, business-related queries, chemistry-based queries, mathematical-based queries, etc.).
The user device 102 may include a query module 105 that may analyze data/information, such as query information and/or the like. The query module 105 may be software, hardware, and/or a combination of software and hardware. The query module 105 may be configured for natural language processing, syntax determination/analysis, query language (coding) processing/analysis, and/or the like.
The user device 102 (e.g., the query module 105) may receive and/or generate a query. For example, the user device may receive and/or generate a query such as “Was the health inspection score for XYZ restaurant the same in 2020 as it was in 2019?” In another example, the user device may receive and/or generate a query such as “What was the health inspection score for XYZ restaurant in 2020?” The query module 105 may use, for example, natural language processing, syntax determination/analysis, query language (coding) processing/analysis, and/or the like to determine/identify portions/components of the query. The portions/components of the query may include one or more data constraints, predicates, text strings, syntax elements, semantic components, and/or the like. The query module 105 may combine portions/components of the query to, for example, determine/generate a set expression.
Query-based set expression(s) may be applied to a data/information source and/or system to determine a result and/or the accuracy of results. A result may be an indication of an aggregate value/amount of data records, for example, a number/quantity of matches, hits, correspondences, and/or the like between portions/components of the query and one or more data records stored by and/or associated with the source and/or system. The number/quantity of matches, hits, correspondences, and/or the like may be evaluated and/or compared against a threshold, such as a data discovery threshold. If the number/quantity of matches, hits, correspondences, and/or the like satisfy and/or exceed the discovery threshold, the query module 105 may create a data record, provide an indication of, and/or assign a label to the source and/or system. The label may indicate, for example, the type and/or quantity of matches, hits, correspondences, and/or the like associated with the source and/or system. The label may indicate any data/information relevant to queries applied to the source and/or system and/or a corresponding result.
The user device 102 may evaluate the efficacy of any source and/or system for outputting a result of a query. For example, the user device 102 (e.g., the query module 105, etc.) may send queries to and/or process queries based on one or more data sources. For simplicity and example, the computing devices 107-112 may represent one or more data sources and/or one or more search engines. Although not shown, the computing devices 107-112 may each represent a plurality of associated data sources, systems, devices, repositories, and/or the like. For example, the computing devices 107-112 may each include and/or be associated with a database (e.g., a data store, a data repository, etc.). The databases may include any type of databases, such as the Internet, in-memory/centralized databases, distributed databases, operational databases, relational databases, cloud-based databases, object-oriented databases, query language-based databases (e.g., NoSQL, etc.), graph databases, and/or the like. The databases may include any data/information. In an embodiment, each of the computing devices 107-112 may represent a different search engine configured to search the same database (e.g., the Internet).
To evaluate the efficacy of the computing devices 107-112 for outputting a result of a query, the user device 102 (e.g., the query module 105, etc.) may apply one or more queries to one or more of the computing devices 107-112 and determine false discovery rates (FDRs) associated with the computing devices 107-112. For example, the user device 102 (e.g., the query module 105, etc.) may determine/generate a plurality of random queries. The plurality of random queries may be, for example, uniform random queries, weighted random queries, and/or any other type of query. The plurality of simulated queries may be, for example, immunopeptidomics-related queries and/or bioinformatics/biotechnology-related queries, such as queries associated with a plurality of simulated random peptide sequences. The plurality of simulated random queries may be generated by any known technique. For example, a random number/letter/word generator may be used to generate a plurality of simulated, random queries, and/or test queries/cases. The quantity of simulated random queries may vary based upon the type of query which may impact, for example, a number of combinations and/or permutations of the simulated queries. For example, a number of simulated queries for restaurants, airfare and the like may vary from a number of simulated queries for DNA, RNA, and/or amino acid sequences. In an embodiment, the number of simulated queries may be restrained by a specified length of the simulated queries. For example, the simulated queries may be limited to a number of characters and/or words. In some embodiments, the number of simulated queries may range anywhere from, and including, 10 queries to 10,000,000 of queries. In some embodiments, the number of simulated queries can be, but is not limited to, 10 queries to 1,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 10 queries to 10,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 10 queries to 100,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 10 queries to 1,000,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 100 queries to 1,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 100 queries to 10,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 100 queries to 100,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 100 queries to 1,000,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 1,000 queries to 100,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 10,000 queries to 100,000 queries. In some embodiments, the number of simulated queries can be, but is not limited to, 100,000 or more queries. In some embodiments, the number of simulated queries can be, but is not limited to, 1,000,000 or more queries. In some embodiments, the number of simulated queries can be at least 100,000 queries. In some embodiments, the number of simulated queries can be at least 1,000,000 queries.
For example, the query module 105 may use an application such as MySQL and/or the like to generate a plurality (e.g., tens, hundreds, millions, etc.) of simulated random queries, and/or test queries/cases via a suitable grammar/format. A suitable grammar may be any grammar, language, syntax, encoding, and/or the like understood/executable by the query module 105. The query module 105 may use query templates to generate queries of any suitable grammar/format. Query templates may be generated according to a scripting language. A query template may map and/or correspond to a particular test case. The query module 105 may determine a result and/or expected result for a query determined from a query template by applying the query to a source and/or system, such as the computing devices 107-112.
The query module 105 may generate/determine random queries based on a query determined, for example, from a query template. The query module 105 may apply the random queries to each of the computing devices 107-112 and determine which of the computing devices 107-112 output a positive and/or expected result. The output and/or expected result may be, for example, based on the ability of the computing devices 107-112 to process any given semantic and/or syntax of a query and retrieve data/information associated with the semantic and/or syntax. The user device 102 may determine/generate, for example, based on the output of each of the computing devices 107-112 a false discovery rate associated with each of the computing devices 107-112.
In an aspect, randomly generated queries may be incorrect, nonsensical, and/or illogical queries designed to evaluate the false discovery rate of any source and/or system, such as the computing devices 107-112. For example, a query template may be used to generate a query such as “What is the price for an airplane ticket to Dubai?” The query module 105 may determine/generate incorrect, nonsensical, and/or illogical versions and/or permutations of the query, such as: “what is the price for an apple to daylight,” “when is the price of an airplane to develop,” ‘Dubai airplane ticket currency,” “airflow ticket when the price is low, etc. Incorrect, nonsensical, and/or illogical versions and/or permutations of a query may be determined based on, for example, synonyms, phonetic relationships, and/or the like of elements (e.g., predicates, constraints, conditions, indicators, portions, etc.) of the query. Incorrect, nonsensical, and/or illogical versions and/or permutations of a query may be determined by rearranging elements of a query. Incorrect, nonsensical, and/or illogical versions and/or permutations of a query may be determined by any method.
The query module 105 may determine how frequently the computing devices 107-112 output results for incorrect, nonsensical, and/or illogical versions and/or permutations of a query, such as a plurality of random queries. How frequently the computing devices 107-112 output the results to incorrect, nonsensical, and/or illogical versions and/or permutations of a query may be indicated and/or correspond to the number of matches associated with each of the computing devices 107-112. The false discovery rate (FDR) for any given computing device 107-112 may be determined as a function of the number of matches and the number of the plurality of random queries. Determining the FDR for the computing device 107-112 based on the number of matches associated with each computing device 107-112 may include dividing the number of matches by a number of the plurality of random queries. In an embodiment, determining the FDR may take into account a relevancy score associated with a match provided by the computing device 107-112. For example, a search engine may identify a match and assign a relevancy score to the match indicating how relevant the match is to the query. Each search engine may use a proprietary relevancy scoring technique. A match may count towards an FDR determination if a simulated query returns a match with a relevancy score exceeding a threshold.
The user device 102 may, based on the false discovery rates associated with each of the computing devices 107-112, determine/generate a query support data structure configured to facilitate the application of a new query to the computing devices 107-112. For example, in an embodiment, the computing devices 107-112 may be, include, and/or be associated with search engines (e.g., Google®, Yahoo®, Bing®, Firefox®, etc.) and/or a similar data source, data repository, and/or data access system.
The order of the data sources may be based on a false discovery rate associated with each source. The query support data structure 200 may indicate one or more search techniques for one or more of the data sources 107-112. The query support data structure 200 may, for example, in column 202, indicate a plurality of search techniques for a single data source (e.g., the data source 107, etc.), the query support data structure 200 may indicate a single search technique for the data sources 107-112, and combinations thereof. The query support data structure 200 may comprise an identifier, in column 201, of a data source of the data sources 107-112, indicated in an order according to a false discovery rate. The false discovery rate may optionally be indicated, for example, in column 203. Data sources associated with a lower false discovery rate may be searched before data sources with a higher false discovery rate are searched. For each data source indicated in the query support data structure 200, additional data may be included. The additional data may comprise one or more of, a location of the data source, a query syntax, one or more query parameters, combinations thereof, and/or the like.
The query may be labeled based on which data source returns a query result. The label may be indicative of a source data/information associated with the query. For example, the label may indicate one or more levels of accuracy of results returned by a source based on the query. As another example, the label may indicate one or more of: text data, multimedia data, statistical data, historical data, private/secured data, public data, and/or any other label of the type of data returned by a source based on the query.
By way of example, shown in
Additionally, each data source of the plurality of data sources 307-309 may be associated with a threshold, such as a data discovery threshold applied to relevancy scores of matches. A data discovery threshold may be a system-defined threshold and/or a user-defined threshold. In an embodiment, a data source associated with a low false discovery rate may be associated with a low data discovery threshold as the data source is generally associated with “good” results and any matches from the data source should be subject to less strict relevancy requirements. A data source associated with a high false discovery rate may be associated with a high data discovery threshold as the data source is associated with less “good” results and any matches from the data source should be subject to stricter relevancy requirements. In another embodiment, a data source associated with a low false discovery rate may be associated with a high data discovery threshold as the data source is generally associated with “good” results and is more likely to contain a relevant result. A data source associated with a high false discovery rate may be associated with a low data discovery threshold as the data source is associated with less “good” results and a low data discovery threshold may be necessary in order to determine a relevant result. In an embodiment, a data discovery threshold may be determined and/or set by a user, for example, via a user interface.
In an embodiment, each data source of the plurality of data sources 307-309 may be associated with the same or a different data discovery threshold. For example, when a query is applied to a first data source a first data discovery threshold may dictate that a match exists only if the match has a relevancy score greater than the first data discovery threshold (e.g., 85%), if no match satisfies the first data discovery threshold, the query may be applied to a second data source associated with a second data discovery threshold that dictates that a match exists only if the match has a relevancy score greater than the second data discovery threshold (e.g., 90%). If no match satisfies the second data discovery threshold, then the query may be applied to a third data source associated with a third data discovery threshold that dictates that a match exists only if the match has a relevancy score greater than the third data discovery threshold (e.g., 95%), if no match satisfies the third data discovery threshold, then no results are output.
When applying the query 300 to the data sources, if a match is found that satisfies a data discovery threshold (e.g., a system-determined threshold, a user-configurable threshold, etc.) for the query 300 in and/or via the data source 307 the result may receive a first label (Highly Accurate Results) at 312 and all relevant and/or possible results may be included in the output. Otherwise, the query 300 may be applied to a next data source 308. If a match is found that satisfies a data discovery threshold (e.g., a system-determined threshold, a user-configurable threshold, etc.), the result may receive a second label (Likely Accurate Results) at 313 and all relevant and/or possible results may be included in the output. Otherwise, the query 300 may be applied to a next data source 309. If a match is found that satisfies a data discovery threshold (e.g., a system-determined threshold, a user-configurable threshold, etc.), the result may receive a third label (Accurate Results) at 314 and all relevant and/or possible results may be included in the output. If no matches are determined/identified, the non-result may receive a fourth label (No Results) at 316.
Turning now to an exemplary embodiment of the disclosed methods and systems to de novo peptide sequencing,
Tandem mass spectrometry (MS/MS) has become a leading high-throughput technology for protein identification. A tandem mass spectrometer 504 may be configured for ionizing a mixture of peptides in a sample 502 with different peptide sequences and measuring their respective parent mass/charge ratios, selectively fragmenting each peptide into pieces and measuring the mass/charge ratios of the fragment ions. The tandem mass spectrometer 504 may be, as non-limiting examples, a Linear Ion Trap Mass spectrometer (LTQ) combined with a Fourier Transform Ion Cyclotron Resonance Mass Spectrometer (LTQ-FT). Thus, a tandem mass spectrum can be viewed as a collection of fragment masses from a single peptide. This collection, or set, of fragment masses, or fragment mass values, is a “fingerprint” that identifies the peptide. The peptide sequencing problem is then to derive the sequence of the peptides given their MS/MS spectra. For an ideal fragmentation process and an ideal mass spectrometer, the sequence of a peptide could be easily determined by converting the mass differences of the consecutive ions in a spectrum to the corresponding amino acids. This ideal situation would occur if the fragmentation process could be controlled so that each peptide was cleaved between every two consecutive amino acids and a single charge was retained on only the N-terminal piece. In practice, however, the fragmentation processes in mass spectrometers are not ideal.
The problem for tandem mass spectrometry peptide sequencing is, given a spectrum S, the ion types Δ, and the mass m, finding a peptide of mass m with the maximal match to spectrum S. Peptide fragmentation in a tandem mass spectrometer can be characterized by a set of numbers Δ={δ1, . . . , δk} representing ion types. A δ-ion of a partial peptide P′⊂P is a modification of P′ that has mass m(P′)−δ. For tandem mass spectrometry, the theoretical spectrum of peptide P can be calculated by subtracting all possible ion types {δ1, . . . , δk} from the masses of all partial peptides of P (i.e., every partial peptide generates k masses in the theoretical spectrum). An (experimental) spectrum S={s1, . . . , sm} is a set of masses of fragment ions. A match between spectrum S and peptide P is the number of masses that experimental and theoretical spectra have in common.
LTQ-FT mass spectrometers can generate on the order of 100,000 spectra per day per machine. Software is a significant and limiting factor in mass spectrometry proteomics analysis—typical large datasets may require days or weeks of computational time on expensive computers or grids. Most peptide identification algorithms use database search methods that match the spectra against a protein database.
A computing device 512 may be configured to analyze the mass spectrometry data (e.g., the fragment mass spectrum 506) generated by the mass spectrometer 504 to identify one or more amino acids based upon a comparison of information derived from the mass spectrometry data to information contained within a protein sequence library 508. In some implementations, a user operating the computing device 512 may access a mass spectrometry data analyzer 514 executing upon the computing device 512. In some implementations, the user supplies the mass spectrometry data generated by the mass spectrometer 504 to the mass spectrometry data analyzer 514. The user, in other implementations, selects the mass spectrometry data from available mass spectrometry data (e.g., previously downloaded, transferred, or otherwise made available to the computing device 512 by the mass spectrometer 504). In some implementations, the mass spectrometer 504 includes the computing device 512. For example, the computing device 512 may be implemented as one or more computer processors functioning within a mass spectrometer system. Each implementation is understood to describe additional embodiments of the method and system described herein.
In some implementations, the mass spectrometry data analyzer 514 calculates additional data from the mass spectrometry data. For example, based upon the experimental information contained within the mass spectrometry data, a mass-charge ratio of ions (e.g., calculated as centroids of the peaks in the so-called “profile” spectra), the relative intensities of the peaks, and/or electric charge.
In an embodiment, sub-sequences contained in the protein sequence library 508 are used as a basis for predicting a plurality of mass spectra 510. The predicted mass spectra 510 of the sub-sequences may be compared, using the mass spectrometry data analyzer 514 of the computing device 512, to the experimentally-derived fragment spectrum 506 to identify one or more of the predicted mass spectra which most closely match the experimentally-derived fragment spectrum 506.
In an embodiment, de novo peptide sequencing may be implemented using, for example, a spectrum graph approach, wherein a spectrum is represented as a graph with peaks as vertices that are connected by edges if their mass difference corresponds to the mass of an amino acid. The vertices of the spectrum graph are further scored based on peak intensities and neutral losses, and a peptide sequence is obtained by finding a longest path in the graph. De novo peptide sequencing can be viewed as a search in the database of all possible peptides. For a typical spectrum identified in a database search, there may be hundreds, and even thousands, of very different peptide sequences that match the spectrum. As a result, de novo peptide sequencing algorithms output multiple peptide reconstructions rather than a single reconstruction.
In an embodiment, the protein sequence library 508 may comprise a spectral dictionary that may be used to generate a full length peptide reconstruction with a high probability of containing the correct peptides. However, an unsolved problem is how many reconstructions must be generated to avoid losing the correct peptide. Generating too few peptides will lead to false negative errors while generating too many peptides will lead to false positive errors. Some de novo algorithms output a single or a fixed number (decided before the search) of peptides. For some spectra, generating only one reconstruction may be enough to guarantee finding the correct peptide while in other cases (even with the same parent mass), a thousand reconstructions may be insufficient. The problem of generating varying numbers of reconstructions for each spectrum becomes particularly important for long peptides with the increasing complexity of the search space.
Predicted peptide sequences resulting from the comparison of the mass spectrometry data to the protein sequence library 508 by the mass spectrometry data analyzer 514 may be provided to a query module 505. The query module 505 may be configured for identifying a source of a peptide sequence using a plurality of data sources 518A-518N in communication with the query module via a network 520. The plurality of data sources 518A-518N may comprise any number and any type of data source. The plurality of data sources 518A-518N may each include and/or be associated with a database (e.g., a data store, a data repository, etc.). The databases may include any type of databases, such as in-memory/centralized databases, distributed databases, operational databases, relational databases, cloud-based databases, object-oriented databases, query language-based databases (e.g., NoSQL, etc.), graph databases, and/or the like. The databases may include any data/information, such as data/information associated with peptides and/or the like.
In an embodiment, the data sources 518A-518N may comprise an expanded human proteome database. The expanded human proteome database can include computer-readable representations of protein sequences. The expanded human proteome database can include computer-readable representations of translations of non-coding RNAs. The expanded human proteome database can include long non-coding RNAs (lncRNAs). The expanded human proteome database can include micro RNAs (miRNAs), which is a type of non-coding RNA. The expanded human proteome database can include RNA transcribed from human endogenous retroviruses (HERVs). The expanded human proteome database can further include messenger RNAs (mRNAs), which canonically code for proteins. In some embodiments, at least a portion of the computer-readable representations of protein sequences of the expanded human proteom database can be associated with a specific subject so the workflow can assign a subject-specific putative source to de-novo peptide sequences derived from the subject.
The expanded human proteome database can include peptides from non-canonically translated regions of the human genome, i.e. peptides from regions annotated as non-coding. The expanded human proteome database can include a portion or all of OpenProt, and/or one or more databases including similar data as a portion or all of OpenProt as understood by a person skilled in the pertinent art. OpenProt is disclosed, for example, in Brunet M. A., Brunelle M., Lucier J.-F., Delcourt V., Levesque M., Grenier F., et al. (2019). OpenProt: A More Comprehensive Guide to Explore Eukaryotic Coding Potential and Proteomes. Nucleic Acids Res. 47, D403-D410. 10.1093/nar/gky936, which is incorporated herein by reference in its entirety. The expanded human proteome database can include computer-readable representations of protein sequences representing translations of non-coding RNA by virtue of including a portion or all of OpenProt and/or one or more databases including non-coding RNA sequences and/or translations thereof. OpenProt a polycistronic model of eukaryotic genomes and includes all open reading frames (ORFs) at least 30 codons long.
The expanded human proteome database can include translations of lncRNAs, i.e. from non-canonically translated regions of the human genome. LncRNAs were first characterized as mRNA-like non-coding RNAs in that they undergo splicing and have features such as a poly(A) signal/tail, while an arbitrary criterion of ‘transcripts longer than 200 nucleotides’ has later been added to its ‘definition’. The expanded human proteome database can include a portion or all of NONCODE, and/or one or more databases including similar data as a portion or all of NONCODE as understood by a person skilled in the pertinent art. NONCODE is disclosed, for example, in Bu, D. et al. NONCODE v3.0: Integrative annotation of long noncoding RNAs. Nucleic Acids Res. 40, D210-5 (2012), which is incorporated herein by reference in its entirety. The expanded human proteome database can include computer-readable representations of protein sequences representing translations of lncRNA by virtue of including a portion or all of NONCODE and/or one or more databases including lncRNA sequences and/or translations thereof.
The expanded human proteome database can include translations of miRNAs, a type of non-coding RNA with a length of about 22 base. Typically miRNAs regulate gene expression by blocking translation of specific mRNAs and cause their degradation. The expanded human proteome database can include a portion or all of miRBase, and/or one or more databases including similar data as a portion or all of miRBase as understood by a person skilled in the pertinent art. miRBase is disclosed, for example, in Kozomara, A., Birgaoanu, M. & Griffiths-Jones, S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 47, D155-D162 (2019), which is incorporated herein by reference in its entirety. The expanded human proteome database can include computer-readable representations of protein sequences representing translations of miRNA by virtue of including a portion or all of miRNA and/or one or more databases including miRNA sequences and/or translations thereof.
The expanded human proteome database can include transcriptions of HERVs, human genome sequences corresponding to endogenous viral elements. The expanded human proteome database can include a portion or all of gEVE, and/or one or more databases including similar data as a portion or all of gEVE as understood by a person skilled in the pertinent art. gEVE is disclosed, for example, in Nakagawa, S. & Takahashi, M. U. gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes. Database (Oxford). (2016) doi:10.1093/database/baw087, which is incorporated herein by reference in its entirety. The expanded human proteome database can include computer-readable representations of protein sequences representing translations of HERVs by virtue of including a portion or all of gEVE and/or one or more databases including HERV sequences and/or translations thereof.
The expanded human proteome database can include mRNAs by virtue of including a portion or all of UniProt and/or one or more databases including similar data as a portion or all of UniProt as understood by a person skilled in the pertinent art. The expanded human proteome database can include UniProt, to the extent that OpenProt utilizes UniProt, by virtue of the expanded human proteome database including OpenProt. Additionally, or alternatively, UniProt or a portion thereof can be included separately from OpenProt within the expanded human proteome database. In a preferred embodiment, the expanded human proteome database includes UniProt reviewed and/or one or more databases including similar data as a portion or all of UniProt reviewed as understood by a person skilled in the pertinent art. In some embodiments, the expanded human proteome database includes UniProt unreviewed and/or one or more databases including similar data as a portion or all of UniProt unreviewed as understood by a person skilled in the pertinent art.
The expanded proteome database can be stored in a single memory or distributed across multiple memories. The expanded proteome database can include multiple disparate databases that can be queried as one database through a single query of a workflow such as, but not limited to the workflow illustrated in
In an embodiment, the data sources 518A-518N can include a human genome database including all or a portion of the human genome, from which computer-readable representations of proteins can be computationally synthesized. The human genome includes approximately three billion base pairs of deoxyribonucleic acid (DNA) that make up the entire set of chromosomes of the human organism. The human genome includes the coding regions of DNA, which encode all the genes (between 20,000 and 25,000) of the human organism, as well as the non-coding regions of DNA, which do not encode any genes. In some embodiments, the human genome database can include the entirety of the human genome including coding and non-coding regions of DNA. In some embodiments, the human genome database can include a non-coding portions and/or frame reads of the human genome, excluding portions and/or frame reads of the human genome from which the mRNA and non-coding RNA of the expanded human proteome database are transcribed. In some embodiments, proteins can be computationally synthesized based on one, two, three, four, five, and/or six frame translations of all or a portion of the human genome; such that some portions of the human genome may or may not be translated using the same number of frame reads as other portions of the human genome.
In an embodiment, the data sources 518A-518N can include a non-endogenous proteome database including computer-readable representations of proteins and/or peptides originating from sources non-endogenous to humans including, but not limited to, bacterial sources, viral sources, and other organisms. In an embodiment, the non-endogenous proteome database can include the NCBI BLAST database, and/or one or more databases including similar data as a portion or all of NCBI BLAST as understood by a person skilled in the pertinent art. NCBI BLAST is disclosed, for example, in Johnson, M. et al. NCBI BLAST: a better web interface. Nucleic Acids Res. 36, W5-9 (2008), which is incorporated herein by reference in its entirety. The data sources 518A-518N can include computer-readable representations of protein sequences representing translations of sources non-endogenous to humans by virtue of including a portion or all of NCBI BLAST and/or one or more databases including such sequences and/or translations thereof.
In an embodiment, the data sources 518A-518N can include computer-readable representations of proteins and/or peptides that are subject-specific, associated with an individual subject. These subject-specific data can be incorporated into one or more databases disclosed herein (e.g. expanded human proteome database, human genome database, non-endogenous proteome database, etc.) and/or included in a separate subject-specific database.
The query module 505 may utilize a query support data structure 516 to guide the identification process. The query support data structure 516 may indicate an order of search steps of the plurality of data sources to apply the query. The order may be based on a random hit rate associated with each search step. The query support data structure 516 may indicate one or more search techniques for one or more of the plurality of data sources 518A-518N. The query support data structure 516 may indicate a plurality of search techniques for a single data source, the query support data structure 516 may indicate a single search technique for a plurality of data source 518A-518N, and combinations thereof.
The query support data structure 516 can include a peptide source assignment workflow for assigning a putative source to a peptide sequence input to the workflow, wherein the putative source indicates a mostly likely origin of the peptide sequence. Each search step of the query support data structure 516 can include a peptide source search step indicating a respective potential source of the peptide sequence when the peptide source search step finds a match. A linear expanded human proteome source can be indicated by a linear human proteome search for the peptide sequence within the expanded human proteome database. A linear genome source can be indicated by a linear human genome search of translations of the human genome database. A linear mismatch can be indicated by a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database, a linear mismatch search for peptides having a mismatch to a peptide derived from a translation of the human genome, and/or a linear mismatch search of a subject-specific database. A linear non-endogenous proteome source can be indicated by a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database. A cis-spliced human proteome source can be identified by a cis-spliced search of the expanded human proteome database. A trans-spliced human proteome source can be indicated by a trans-splice search of the expanded human proteome database. The putative source assigned to the peptide sequence can be the potential source found earliest in the workflow, i.e. the search step having the lowest random hit rate.
In an embodiment, the query support data structure 600 may have been previously generated or may be generated as needed. The query support data structure 600 may be generated by, for example, generating a plurality of simulated random queries, determining, based on applying the plurality of simulated random queries to each search step, a number of matches associated with each search step, determining, based on the numbers of matches associated with each search step, a random hit rate associated with each search step, and generating, based on the random hit rates, the query support data structure configured to facilitate application of a new query to the plurality of sources. The plurality of simulated random queries may comprise at least one of a plurality of uniform random queries or a plurality of weighted random queries. Uniform random queries (e.g., peptide sequences) may be generated by randomly sampling all amino acids uniformly. Weighted random queries (e.g., peptide sequences) may be generated by randomly sampling amino acids with frequencies of amino acids matching those found in vertebrates. Determining, based on the numbers of matches associated with each source, the random hit rate associated with each search step may comprise a function of the number of matches and a number of the plurality of simulated random queries. As a non-limiting example, the random hit rate associated with each source may be determined by dividing the number of matches by a number of the plurality of simulated random queries. The random hit rate may further be dependent upon the size and/or complexity of the data source being searched.
In an embodiment, the mass spectrometry data may be used, or processed and then used, as a query to be applied to one or more of the plurality of data sources 518A-518N according to the query support data structure 600. The query may be further processed prior to being applied to applied to one or more of the plurality of data sources 518A-518N. In an embodiment, one or more permutations of the query may be determined. For example, one or more permutations of a peptide sequence may be determined and the one or more permutations used as queries in addition to the original query. For example, a peptide sequence provided as the query to the workflow of the query support data structure 600 can include one or more ambiguous residues. For example, leucine (L) and isoleucine (I) have the same mass; therefore it is impossible to differentiate them in de novo search sequencing. To account for this, for a given peptide containing I/L, all permutations of I and L residues may be considered such that the associated permutated peptide sequences are provided as queries to the workflow of the query support data structure 600. For example, for the peptide “ATTSLLHN (SEQ ID NO:1” four possible permutations exist: ATTSLLHN (SEQ ID NO:1), ATTSLIHN (SEQ ID NO:2), ATTSILHN (SEQ ID NO:3), and ATTSIIHN (SEQ ID NO:4). Each permutated peptide sequence may be used as a query. Each permutated peptide sequence can be assigned a respective putative source according to the peptide source assignment workflow of the query module 505. The assigned putative sources of the permutations are, in turn, potential sources for the provided peptide sequence having ambiguous residue(s). The potential source indicated by the peptide source step having the lowest random hit rate can be assigned as the putative source of the provided peptide sequence having ambiguous residue(s). Further, the permutations of the provided peptide can be filtered to remove those permutations not assigned the putative source.
At a first peptide source search step 703, a query (701 and 702) may be applied to an expanded human proteome database to identify an identical match. If an identical match is found for any permutation, the peptide sequence may be labeled as “Linear,” at 704 and all possible protein sources of the peptide may be included in the output of the workflow. The peptide sequence 701 and permutations 702 found by the linear human proteome search for the peptide sequence within the expanded human proteome database 703 can be assigned a linear expanded human proteome source. The assigned source can be included in the output of the workflow. The permutations found by the linear human proteome search within the expanded human proteome database 703 can be included in the output of the workflow.
At a second peptide source search step 705, BLAT, or a similar alignment tool, may be used to apply the query (701 and 702) to the frames of the translated human genome. BLAT is disclosed, for example, in Genome Res. 2002 April; 12(4): 656-664. BLAT—The BLAST-Like Alignment Tool, which is incorporated herein by reference in its entirety. An example BLAT command may be, as a non-limiting example, “blat -t=dnax -q=prot -minScore=7 -stepSize=1 hg38.2 bit Fasta_query output.psl psl2bed<output.psl>perfect_match.bed”. If an identical match is found, the peptide sequence may be labeled as “Linear,” at 706 and possible source sequences may be included in the output. The peptide sequence 701 and permutations 702 thereof found by the linear human genome search 705 can be assigned a linear genome source. The assigned source can be included in the output of the workflow.
At a third peptide source search step 707, the peptide sequences of the query (701 and 702) may be mapped to the expanded human proteome database 703, permitting a number of mismatches (as a non-limiting example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and the like mismatches. In an embodiment, the number of mismatches may be 1. An example BLAT command may be, for example, “blat -t=prot -q=prot -minScore=7 -stepSize=1 combined DB.processed.fasta Fasta_query output_blat_hits.psl”. If a peptide sequence with a mismatch is found (by way of example, 1 mismatch), the peptide sequence may be labeled as “one mismatch” at 708. The peptide sequence 701 and permutations thereof 702 found by the linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database 707 are assigned, as a source, the linear mismatch of the expanded human proteome. The assigned source can be included in the output of the workflow.
At a fourth peptide source search step 709, the peptide sequences of the query (701 and 702) may be mapped to other organisms at 709, for example by using the BLAST NCBI tool. If any identical matches (e.g., a homologous match) are found the results may be annotated as “LINEAR BLAST” at 710. The peptide sequence 701 and permutations thereof 702 found by the linear non-endogenous search for the peptide sequence within the non-endogenous proteome database 709 are assigned the linear non-endogenous proteome source. The assigned source can be included in the output of the workflow. In some embodiments, the fourth peptide source search step 709 can be omitted, and the workflow illustrated in
At a fifth peptide source search step 711, the peptide sequences of the query (701 and 702) may be fragmented into 2 or more fragments (where each fragment is greater than 1 amino acid). The fragments may be used as a query applied to the expanded human proteome database. If there is a match for both fragments in the same protein, the peptide sequence may be labeled as “cis-spliced” at 712. The peptide sequence 701 and permutations thereof 702 found by the cis-spliced search of the expanded human proteome database 711 are assigned the cis-spliced human proteome source. The assigned source can be included in the output of the workflow.
At a sixth peptide source search step 713, if there are hits for both fragments in two different proteins, the peptide sequence may be labeled as “trans-spliced” at 714. The peptide sequence 701 and permutations thereof 702 found by the trans-spliced search of the expanded human proteome database 711 are assigned the trans-spliced human proteome source. The assigned source can be included in the output of the workflow. In some embodiments, the sixth peptide source search step 713 can be omitted, and the workflow illustrated in
Any remaining peptide sequences may be labeled as not assigned (N/A) at 715. The workflow can halt advancement to a subsequent peptide source search step upon assigning a putative source to a peptide sequence of the query 701, 702.
Returning to
Examples presented herein generally include a peptide source assignment workflow having search steps sequenced in order of increasing random hit rates and methods and systems for using and generating the peptide source assignment workflow. Examples presented infra are specific to labeling of peptides, although other applications, including those disclosed supra, can be performed following a similar methodology. The examples presented infra can reduce false labeling of peptides as cis-spliced and trans-spliced compared to previous systems and methodologies.
Antigen presenting cells use major histocompatibility (MHC) complexes I or II to present peptides to CD8+ or CD4+ T cells, respectively. Characterization of the peptides presented to T cells, known as the immunopeptidome, is being studied in the fields of infectious disease, autoimmunity as well as cancer immunotherapy. Cancer-associated MHC-presented peptides that illicit an immune response are possible safe and effective targets for cancer immunotherapy. The discovery and characterization of the immunopeptidome can be achieved using a multitude of technologies such as whole-exome sequencing, RNA sequencing, ribosome profiling and tandem mass spectrometry (MS/MS) based peptide sequencing. While next-generation sequencing approaches can characterize the potential endogenous immunopeptidome, only direct detection of peptides, like by MS/MS, can provide experimental evidence for the existence of peptides presented by MHC complexes. Notably, besides using peptides bound to MHC complexes for characterization of an immunopeptidome, peptides can also originate from multiple other genetic and transcription-based aberrations. Examples of additional means for identifying aberrant peptides include cancer specific gene and transposon overexpression (e.g., but not limited to, cancer-testis genes, transposons, and human endogenous retroviruses (HERVs)), alternative splicing, stop codon readthrough or alternative open-reading frame translation.
Immunopeptidomics using peptide-MHC elution followed by MS/MS traditionally requires a reference database of potential peptides that might be detected. Recent advances in peptide spectra matching software allow omitting reference database searches to perform de novo sequencing, whereby the software identifies the sequences of unknown peptides, post-translational modifications (PTMs) and amino acid substitutions directly from MS/MS spectra. Using these methods, the diversity of peptides that may be bound to the MHC complex can be understood, but not their protein sources. For MHC I, the canonical mechanism of presenting peptides starts with proteasomal cleavage of proteins within the cytoplasm, generating fragments between 8 and 12 amino acids in length. Those peptides are then bound to the MHC I complex before its translocation to the cell membrane. However, some studies have suggested that, in addition to cleavage, proteasomes can catalyze the reverse reaction, ligating small peptides together in a process called proteasome-catalyzed peptide splicing (PCPS). While canonical cleavage generates peptides whose sequences are identical to the parental protein (herein called linear), pieces of spliced peptides can be from the same protein (herein called cis-spliced) or, theoretically, from different proteins (herein called trans-spliced).
Prior attempts to use a de novo sequencing approach to identify peptides of unknown origin (“cryptic peptides”), have identified many of these cryptic peptides as likely being generated through post-translational splicing. However, the abundance and even the existence of spliced peptides is a matter of controversy in the field. A strategy was previously developed to identify spliced peptides in the MHC-I immunopeptidome by mass spectrometry. A database was generated containing all possible cis-spliced peptides, allowing for MS/MS spectra to be queried for cis-spliced peptides. It was reported that about 30% of p-HLA are short-distance cis-spliced peptides. The same group also developed a pipeline for mapping the MHC class I spliced immunopeptidome of cancer cells. The study suggested a substantial (˜25%) portion of peptides can be mapped to cis-spliced sequences in HCT116 and HCC1143 cell lines derived from colon and breast carcinomas, respectively. Trans-spliced peptides were excluded from analysis since their occurrence in vivo is controversial, and their addition to a database would massively increase its complexity. Later, a bioinformatics workflow was developed to identify linear, cis-, and trans-spliced peptides called hybridfinder. Hybridfinder first searches for exact matches of peptides in the UniProt human protein sequence database, then it searches for all possible cis- then trans-spliced forms of that peptide in the human proteome. Hybridfinder was used to analyze MS/MS data containing peptides eluted from MHC I complexes purified from seventeen HLA-monoallelic cell lines. Cis- and trans-spliced peptides were found to represent up to 45% of MHC-bound peptides.
1. Expanding the Search for Sources of Non-Canonical Human Peptides
A strategy is disclosed herein for determining the order of putative sources when assigning sources to de novo sequence peptides. The strategy is used to develop a peptide source assignment workflow that searches for the sources of peptides amongst multiple sources in a specific order, with the order optimized to minimize assignment of peptides to incorrect sources. For example, assignment of de novo peptides to post-translational cis- or trans-splicing occurs by chance extremely often and most peptides can be attributed to other sources which are less likely to occur by chance. As disclosed herein, a rigorous derivation of the optimal order of a peptide source assignment is presented and the workflow's utility in identifying the most plausible sources of de novo peptides is presenting, thus furthering the understanding of the immunopeptidome.
Previous studies have shown that up to 45% of MHC-bound peptides that do not map identically to the UniProt human proteome. The workflow disclosed herein includes databases developed of several other potential sources from which unmapped peptides may also stem. Peptides from non-canonically translated regions of the human genome, e.g., peptides from regions annotated as non-coding, were searched. For this source, OpenProt was used which includes all open reading frames (ORFs) at least 30 codons long, which was supplemented with the rest of the human genome translated into six frames. Translations of known transcribed elements were also included, including long non-coding RNAs (lncRNAs), micro RNAs (miRNAs), and HERVs which may be spliced and therefore contain sequences not found via translating genomic DNA. Below, this combination of sources is referred to as the expanded human proteome database. In addition, unknown SNPs, missense mutations, or recurrent errors in either transcription, translation, or MS amino acid identification could generate peptides with a single mismatch to a sequence encoded in the human proteome. Mismatched peptides were searched for using BLAT to align de novo peptides sequences to the expanded human proteome database with a single mismatch allowed. Finally, some peptides may originate from other organisms, especially bacterial or viral sources. For these sources de novo peptide sequences were searched for in the BLAST database (see Methods).
2. Optimal Ordering of Putative Sources Through Estimation of Random Hit Rate
For each potential source (e.g., the computing devices 107-112 of
The random sequences were used to estimate the random hit rate of each potential source of peptides (
An enhanced peptide mapping pipeline to assign sources for peptides in order of decreasing random hit rate was designed. When using either set of simulated data to order peptide sources by random hit rates, the enhanced pipeline searches for peptide sources in the following order: 1) the expanded human proteome database (assigned as linear), 2) the non-coding regions of the human genome using BLAT (assigned as linear), 3) single mismatch peptides in the expanded human proteome (assigned as linear), 4) the BLAST database (assigned as linear), 5) cis-spliced and 6) trans-spliced peptides in the expanded human proteome. When ordered in series, 4,495/5,000 (90%) of uniform random sequences and 4,847/5,000 (97%) of weighted random sequences are found with this pipeline (
3. Peptide Whose Sequences are Assigned with Higher Confidence During De Novo Sequencing are Identified in Earlier Parts of the Assignment Workflow
A test was performed to determine whether the proportion of peptides found in real experiments are consistent with the final order. The peptides source assignment workflow as applied to six novel immunopeptidomics data sets from IM9 and Raji cell lines (see Methods). During de novo sequencing, amino acid calls are given local confidence scores; the quality of the sequencing across the peptide can be quantified by the average local confidence (ALC %) score. The ALC % score is generated by MS/MS and associated with each de novo peptide sequence 701 (
It was hypothesized that peptides with higher ALC % are more likely to be assigned to more reliable sources with lower random hit rates, i.e. sources earlier in our workflow. Indeed, across six experiments the majority of peptides with the highest ALC % were found in the first source of the pipeline (linear expanded human proteome source), in stark contrast to the pattern of sources found for randomly generated peptides (
The second fragmentation in MS/MS experiments (MS2 scans) can be inherently noisy due to poor fragmentation or ionization of certain peptides. To evaluate the proportion of ambiguous de novo calls as a function of ALC %, a set of MS2 scans was taken from IM9 cell lines for which both de novo identified peptides as well as conventional database calls were available. As hypothesized, the de novo ALC % goes down so does the proportion of peptide calls that agrees between de novo and conventional database searches (
4. Re-Analysis of Monoallelic Cell Line Data Using the Peptide Source Assignment Workflow
Peptide identification by the peptide source assignment workflow was compared versus hybridfinder on peptides eluted from MHC complexes on the data set from the hybridfinder publication: immunopeptidomics from a collection of cell lines engineered to express a single HLA allele. See
The peptide source assignment workflow presented here shows that putative spliced peptides are likely peptides stemming from mutated DNA sequences, non-canonically spliced RNA sequences, non-canonically translated regions of the human genome, mismatched human sequences or bacterial proteins. Altogether, 20% of peptides are assigned as spliced peptides with the workflow presented here, down from 29% using hybridfinder (
In some embodiments, the method of the present invention reduces identification of spliced peptides by 20-60%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 30-60%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 40-60%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 50-60%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 20-50%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 30-40%.
In some embodiments, the method of the present invention reduces identification of spliced peptides by 5-70%. In some embodiments, the method of the present invention reduces identification of spliced peptides by 14-60%.
At each step, the random peptides mapping results were used to estimate how many peptides were likely found by chance. When compared to weighted random peptides, more peptides detected in cell lines were assigned as linear (P<1e-308, two-sided Fisher's exact test), linear with a single mismatch (P=3.16e-05), linear from the BLAST database (P=0.00126), cis-spliced (P=9.9e-92) and trans-spliced (P=3.29e-73) (
5. Peptides that Map Throughout the Human Genome are Enriched for Expressed Regions
Where peptides that map within the human genome, but outside of the UniProt proteome, land in terms of genomic annotations were examined. The first three steps of the pipeline can map a peptide to regions of the human genome. In the first step, peptides that map exclusively in the OpenProt database land in ORFs that are not in the UniProt human proteome. The location of these peptides in the human genome was analyzed (
6. Peptides Identified by BLAST are not Enriched for any Microbial Genus
The search within the BLAST database has the highest random hit rate for linear peptides in the peptide source assignment workflow. While peptides from cell lines had modestly more matches in the BLAST database than would be expected based on the uniform or weighted random data (
7. Recurrent Novel Peptide
To identify reclassified common peptides across multiple datasets peptides shared by more than three cell lines were selected. For example, QSPVALRPL (SEQ ID NO:5) is highly recurrent, and was identified as trans-spliced by the hybridfinder algorithm, but was reclassified as linear using the disclosed pipeline. The same peptide is listed in the Immune Epitope Database as being a part of an unidentified protein. Upon further inspection, this is an out-of-frame peptide in the FAM96A gene, a pro-apoptotic tumor suppressor in gastrointestinal stromal tumors (see, e.g., Schwamb et al. Int. J. Cancer (2015) September 15; 137(6):1318-29 incorporated by reference herein in its entirety). If out-of-frame translation is specific to cancer samples, this peptide could be a cancer immunotherapy target.
With the increasing number of peptides identified in immunopeptidomics experiments using de novo sequencing, the need for better characterization of the immunopeptidome is more pressing than ever. Previous studies attributed PCPS as the primary source of peptides of unknown protein identity. Described herein is a peptide source assignment workflow that assigns parental proteins of de novo sequenced peptides from several sources with lower random hit rate than the set of all possible PCPS peptides. It was found that 32% of putative PCPS peptides can be explained by single mismatches with known proteins, translational of supposedly untranslated parts of the human genome, or bacterial and viral peptides. Not surprisingly, the majority of peptides are encoded by known expressed regions. Finally, a recurrent out-of-frame peptide was identified in the tumor suppressor gene FAM96A that could be of interest as a cancer immunotherapy target.
1. Datasets
i. Simulated Random Peptides
Two sets of peptide sequences were simulated for random hit rate estimation. The “random” built-in python library was used to produce sets of 8-12 length amino acid sequences, 1,000 peptides for each length, a total of 5,000 random peptides in each set. For the first simulated peptide sequence set, all amino acids have an equal probability of being incorporated into a sequence; this set is referred to as “uniform random”. In the second set, the amino acids have a probability of being incorporated that matches their frequency in vertebrates; this set is referred to as “weighted random”. The two sets of peptide sequences are included in Table 1.
ii. IM9 and Raji Cell Line Immunopeptidomics
Three replicates of IM9 and Raji cell line were processed through MS/MS:
IFNγ can enhance expression of surface major histocompatibility complex (HLA) molecules and increase the processing and presentation of tumor-specific antigens, facilitating T-cell recognition and cytotoxicity. IFNγ also up-regulates many components of the antigen presenting pathway, as well as induces a shift between the constitutive to immunoproteasome subunits which have different catalytic activity in the proteosome, generating a different population of HLA-associated peptides. We use IFNγ treatment with cell lines to increase the chances to expand our detectable immunopeptidome by mass spectrometry.
a. Immunoprecipitation
HLA-Pan Class I (W6/32) columns were prepared using NHS-activated Sepharose 4 beads (GE Healthcare 17090601) and a coupling buffer of 0.2M sodium bicarbonate and 0.5M sodium chloride; they were washed with 0.1M Tris hydrochloride with a pH of 8.5, and 0.1M acetate buffer. Affinity purification was performed under gravity and the flow-through was captured for further analysis. 0.1M glycine (Sigma) pH 2.7 was used to elute bound HLA molecules under gravity (
Peptides and HLAs fractions were placed on SpeedVac Vacuum Concentrator (Thermo) for 2 hours. Each sample, after SpeedVac, was resuspended in 0.10% trifluoroacetic acid. The peptide fractions were purified further using a C-18 ZipTip® (Cat no: ZTC185096 Millipore). All samples were then analyzed with the Orbitrap Fusion™ Lumos™ Tribrid™ Mass Spectrometer (Thermo) for peptide sequencing.
b. Data Analysis
Raw data files from the Orbitrap Fusion™ Lumos™ (Thermo) LC/MS were searched with the PEAKS® Studio X (BSI) proteomics software against Human Uniprot Database, custom databases for proteins of interest, and de novo.
iii. HLA-Monoallelic Immunopeptidomics
For the MS/MS data from HLA-I monoallelic cell lines, the peptides were downloaded from the supplementary table of Faridi, P. et al. Sci. Immunol. Vol 3, issue 28, pg 3947, October 12 (2018), incorporated herein by reference in its entirety. The data includes the expression of eight HLA-A alleles (A0101, A0203, A0204, A0207, A0301, A3101, A6802, A2402) and nine different HLA-B alleles (B5801, B5703, B5701, B4402, B5101, B0801, B1502, B2705, B0702). In total, there were more than 51,000 unique peptides.
2. Recapitulation of Hybridfinder
For ease of comparison to the described peptide source assignment workflow, the workflow was recapitulated from hybridfinder as described in Faridi, P. et al. Sci. Immunol. Vol 3, issue 28, pg 3947, October 12 (2018), incorporated herein by reference in its entirety. First, each peptide is sought in the UniProt human reference proteome database. Peptides with identical matches are annotated as linear. For peptides with no linear matches, all possible splits of that peptide where the length of the smaller piece is longer than 1 amino acid were generated. Then, potential matches for each fragment were searched through the database. The peptide was annotated as cis-spliced if identical matches of both fragments were detected in a single protein. The matches can be reverse-ordered. Otherwise, if the matches are available in two distinct proteins, the peptide was annotated as trans-spliced. Peptides for which no split pairs match to any protein sequences are annotated as not available (N/A).
3. The Expanded Human Proteome Database
FASTA files of OpenProt (www.openprot.org), UniProt (www.uniprot.org) reviewed and unreviewed human sequences, which also includes protein sequences from some viruses that use humans as hosts (UniProt proteome version UP0000056430, downloaded in May 2020) were combined. This database was expanded to include translated proteins sequences from lncRNAs (NONCODE Version v5.19, downloaded in May 2020), miRNAs (last modified Mar. 10, 2018, downloaded in May 2020), and endogenous viral elements (gEVE database ORFs21, downloaded in May 2020). This database is used when the workflow searches for linear human peptides and single-mismatched human peptides (steps 1 and 3), as well as in the search for cis- and trans-spliced peptides.
4. The Peptide Source Assignment Workflow
The random hit rate inherent was measured in each source from which peptides in immunopeptidomics experiments can be found using the simulated random datasets described above. The steps of the workflow were ordered in order of ascending random hit rate to construct the workflow. The steps applied to each de novo-sequenced peptide are as follows:
Step 1: Search for identical sequence matches in the expanded human proteome database (described above). Leucine (L) and isoleucine (I) have the same mass; therefore it is impossible to differentiate them in de novo search sequencing. To account for this, for a given peptide containing I/L all permutations of I and L residues are considered. For example, for the peptide “ATTSLLHN (SEQ ID NO:1)” there are four possible permutations: ATTSLLHN (SEQ ID NO:1), ATTSLIHN (SEQ ID NO:2), ATTSILHN (SEQ ID NO:3), and ATTSIIHN (SEQ ID NO:4). If the algorithm finds an identical match (e.g., 100% identical) for any permutation, the peptides are annotated as “Linear”, and all possible protein sources of the peptide are included in the output. The algorithm need not progress to additional steps, e.g., continuing with step 2, since the match has been identified. Otherwise, if a match is not identified, the algorithm progresses to step 2.
Step 2: Search for an identical match in any of the six frames of the translated human genome using BLAT32. The following commands are used:
If an identical match is found, that peptide is annotated as “Linear” and possible source sequences are included in the output. Otherwise the peptide is passed to the step 3.
Step 3: Peptides are mapped to the expanded human proteome database, this time allowing one mismatch using this code: “blat -t=prot -q=prot -minScore=7 -stepSize=1 combined DB.processed.fasta Fasta_query output_blat_hits.psl” in a genomic location of BLAT hits analysis.
If a sequence with a single mismatch is found, the peptide is annotated as “one mismatch”. Otherwise the peptide is passed to Step 4.
Step 4: Sequences are mapped to other organisms using the BLAST NCBI tool. If any identical matches are found the results are annotated as “LINEAR BLAST”.
Step 5: For the remaining peptides, the algorithm generates all possible splits of the peptide where the length of the smaller piece is larger than 1. Then it looks for matches of both fragments in all human sequence databases. If there is a match for both chunks in the same protein, the tool annotates the peptide as “cis-spliced”. Otherwise, if there are hits for both fragments in two different proteins, the tool annotates the peptide as “trans-spliced”. The rest of peptides that do not have any matches are assigned as not available (N/A).
5. Genomic Location of BLAT Hits Analysis
Analysis of the genomic locations of BLAT hits was performed using the annotatepeaks.pl script from the HOMER suite. Specifically:
Only basic annotations were considered for further analysis. To calculate the enrichment of genomic locations of peptides found in the OpenProt database with either an identical match (step 1) or with a single mismatch (step 3), a fisher's exact test was performed to compare the number of peptides in each genomic annotation in the sample versus in the whole OpenProt database. For peptides that mapped to any translated region in the human genome (step 2), the p-value enrichment calculated by HOMER was used for over or underrepresentation of each genomic annotation.
6. Tools
Python, bedops, psl2bed, BLAT, BLAST, HOMER.
Any device/component described herein may include a computer 2201 as shown in
The bus 2213 may comprise one or more of several possible types of bus structures, such as a memory bus, memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
The computer 2201 may operate on and/or comprise a variety of computer-readable media (e.g., non-transitory). Computer-readable media may be any available media that is accessible by the computer 2201 and comprises, non-transitory, volatile and/or non-volatile media, removable and non-removable media. The system memory 2212 has computer-readable media in the form of volatile memory, such as random access memory (RAM), and/or non-volatile memory, such as read-only memory (ROM). The system memory 2212 may store data such as mass spectrometry data 2207 and/or program modules such as operating system 2205 and query analysis software 2206 that are accessible to and/or are operated on by the one or more processors 2203. The system memory 2212 can further include some or all of the databases utilized by the workflow illustrated in
The computer 2201 may also comprise other removable/non-removable, volatile/non-volatile computer storage media. The mass storage device 2204 may provide non-volatile storage of computer code, computer-readable instructions, data structures, program modules, and other data for the computer 2201. The mass storage device 2204 may be, but is not limited to, a hard disk, a removable magnetic disk, a removable optical disk, magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read-only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like.
Any number of program modules may be stored on the mass storage device 2204. An operating system 2205 and query analysis software 2206 may be stored on the mass storage device 2204. One or more of the operating system 2205 and query analysis software 2206 (or some combination thereof) may comprise program modules and the query analysis software 2206. Mass spectrometry data 2207 may also be stored on the mass storage device 2204. Mass spectrometry data 2207 may be stored in any of one or more databases known in the art. The databases may be centralized or distributed across multiple locations within the network 2215. The mass storage device 2204 can further include some or all of the databases utilized by the workflow illustrated in
A user may enter commands and information into the computer 2201 via an input device (not shown). Such input devices comprise, but are not limited to, a keyboard, pointing device (e.g., a computer mouse, remote control), a microphone, a joystick, a scanner, tactile input devices such as gloves, and other body coverings, motion sensor, and the like. These and other input devices may be connected to the one or more processors 2203 via a human-machine interface 2202 that is coupled to the bus 2213, but may be connected by other interface and bus structures, such as a parallel port, game port, an IEEE 1394 Port (also known as a Firewire port), a serial port, network adapter 2208, and/or a universal serial bus (USB).
A display device 2211 may also be connected to the bus 2213 via an interface, such as a display adapter 2209. It is contemplated that the computer 2201 may have more than one display adapter 2209 and the computer 2201 may have more than one display device 2211. A display device 2211 may be a monitor, an LCD (Liquid Crystal Display), a light-emitting diode (LED) display, a television, a smart lens, smart glass, and/or a projector. In addition to the display device 2211, other output peripheral devices may comprise components such as speakers (not shown) and a printer (not shown) which may be connected to the computer 2201 via Input/Output Interface 2210. Any step and/or result of the methods may be output (or caused to be output) in any form to an output device. Such output may be any form of visual representation, including, but not limited to, textual, graphical, animation, audio, tactile, and the like. The display 2211 and computer 2201 may be part of one device, or separate devices.
The computer 2201 may operate in a networked environment using logical connections to one or more remote computing devices 2214a,b,c. A remote computing device 2214a,b,c may be a personal computer, computing station (e.g., workstation), portable computer (e.g., laptop, mobile phone, tablet device), smart device (e.g., smartphone, smartwatch, activity tracker, smart apparel, smart accessory), security and/or monitoring device, a server, a router, a network computer, a peer device, edge device or other common network nodes, and so on. Logical connections between the computer 2201 and a remote computing device 2214a,b,c may be made via a network 2215, such as a local area network (LAN) and/or a general wide area network (WAN). Such network connections may be through a network adapter 2208. A network adapter 2208 may be implemented in both wired and wireless environments. Such networking environments are conventional and commonplace in dwellings, offices, enterprise-wide computer networks, intranets, and the Internet.
Application programs and other executable program components such as the operating system 2205 are shown herein as discrete blocks, although it is recognized that such programs and components may reside at various times in different storage components of the computing device 2201, and are executed by the one or more processors 2203 of the computer 2201. An implementation of query analysis software 2206 may be stored on or sent across some form of computer-readable media. Any of the disclosed methods may be performed by processor-executable instructions embodied on computer-readable media.
In an embodiment, the query analysis software 2206 may be configured to execute some or all of the search steps 703, 705, 707, 709, 711, 713 illustrated in
In an embodiment, the query analysis software 2206 may be configured to perform a method 2300, shown in
The method 2300 may comprise, at 2304, determining, based on applying the plurality of simulated random queries to each source of a plurality of sources, a number of matches associated with each source.
In an embodiment, the method 2300 may comprise, at 2306, determining, based on the numbers of matches associated with each source, a false discovery rate associated with each source. In an embodiment, a function of the number of matches and a number of the plurality of simulated random queries may be determined. In an embodiment, a determination may be made by dividing the number of matches by a number of the plurality of simulated random queries. In an embodiment, determining, based on the numbers of matches associated with each source, the false discovery rate associated with each source may include a function of the number of matches and a number of the plurality of simulated random queries. In an embodiment, determining, based on the numbers of matches associated with each source, the false discovery rate associated with each source may include dividing the number of matches by a number of the plurality of simulated random queries.
The method 2300 may comprise, at 2308, generating, based on the false discovery rates, a query support data structure configured to facilitate application of a new query to the plurality of sources.
In an embodiment, the query analysis software 2206 may be configured to perform a method 2400, shown in
The method 2400 may comprise, at 2404, applying, based on a query support data structure, the query to one or more sources of a plurality of sources. The query support data structure may indicate an order of the plurality of sources to apply the query. The order may be based on a false discovery rate associated with each source of the plurality of sources. The method 2400 may also include comprising determining one or more permutations of the query. Applying, based on the query support data structure, the query to the one or more sources of the plurality of sources may include: applying each permutation of the one or more permutations of the query to the one or more sources of the plurality of sources; if an identical match to the one or more permutations of the query is found in a first source of the plurality of sources, discontinuing additional searches and applying a linear label to the one or more permutations of the query associated with the identical match; and assigning the one or more permutations of the query associated with the identical match as a correct query.
Applying the query to one or more sources of a plurality of sources may include: searching for an identical match to the query in a first source of the plurality of sources; and if an identical match to the query is found in the first source of the plurality of sources, discontinuing additional searches. The query result may include the identical match and the label associated with a source of the plurality of sources associated with the query result may include a linear label.
Applying the query to one or more sources of a plurality of sources may include: searching for an identical match to the query in a first source of the plurality of sources; and if an identical match to the one or more permutations of the query is found in the first source of the plurality of sources, discontinuing additional searches. The query result may include the identical match and the label associated with a source of the plurality of sources associated with the query result may include a linear label.
Applying the query to one or more sources of a plurality of sources may include: searching for an identical match to the query in any frame of a plurality of frames of a second source of the plurality of sources; and if an identical match to the query is found in any frame of a plurality of frames of the second source of the plurality of sources, discontinuing additional searches. The query result may include the identical match and the label associated with a source of the plurality of sources associated with the query result may include a linear label.
Applying the query to one or more sources of a plurality of sources may include: searching for a non-identical match to the query in a third source of the plurality of sources; and if a non-identical match to the query is found in the third source of the plurality of sources, discontinuing additional searches. The query result may include the non-identical match and the label associated with a source of the plurality of sources associated with the query result may include a mismatch label.
Applying the query to one or more sources of a plurality of sources may include: searching for a homologous match to the query in a fourth source of the plurality of sources; and if a homologous match to the query is found in the fourth source of the plurality of sources, discontinuing additional searches. The query result may include the homologous match and the label associated with a source of the plurality of sources associated with the query result may include a homologous label.
Applying the query to one or more sources of a plurality of sources may include: splitting the query into a plurality of sets of fragments; searching for each set of fragments in a fifth source of the plurality of sources; if a match for a set of fragments is found in the fifth source of the plurality of sources, discontinuing additional searches; and if a first match for a first fragment of the set of fragments and a second match for a second fragment of the set of fragments is found in the fifth source of the plurality of sources, discontinuing additional searches. The query result may include the match for the set of fragments and the label associated with a source of the plurality of sources associated with the query result may include a cis-spliced label. The query result may include the first match for the first fragment of the set of fragments and the second match for the second fragment of the set of fragments and the label associated with a source of the plurality of sources associated with the query result may include a trans-spliced label.
The method 2400 may comprise, at 2406, determining, based on a query result, a label associated with a source of the plurality of sources associated with the query result.
The method 2400 may comprise, at 2408, applying the label to the query. The method 2400 may also include determining, based on the label, a source of the query. The method 2400 may also include validating an output of a mass spectrometer system based on the source of the query.
In view of the described apparatuses, systems, and methods and variations thereof, herein below are described certain more particularly described embodiments of the invention. These particularly recited embodiments should not however be interpreted to have any limiting effect on any different claims containing different or more general teachings described herein, or that the “particular” embodiments are somehow limited in some way other than the inherent meanings of the language literally used therein.
Embodiment 1: A method of determining a putative source of a peptide sequence of a peptide, the method comprising: receiving the peptide sequence; and determining, based at least in part on one or more searches of the peptide sequence within one or more databases, the putative source associated with the peptide sequence, wherein each respective search of the one or more searches has a random hit rate that is based at least in part on a number of random sequences found by the respective search, and wherein the one or more searches are performed in order of increasing random hit rates until the putative source is determined.
Embodiment 2: The embodiment as in the embodiment 1, wherein the one or more databases comprises an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 3: The embodiment as in the embodiment 2, wherein the expanded human proteome database comprises computer-readable representations of translations from micro RNAs.
Embodiment 4: The embodiment of any of embodiments 2-3, wherein the expanded human proteome database comprises computer-readable representations of translations from long non-coding RNAs.
Embodiment 5: The embodiment of any of embodiments 2-4, wherein the expanded human proteome database comprises computer-readable representations of translations of human endogenous retroviruses.
Embodiment 6: The embodiment of any of embodiments 2-5, wherein the one or more searches comprises a linear human proteome search for the peptide sequence within the expanded human proteome database, and wherein the putative source is a linear expanded human proteome source when the linear human proteome search for the peptide sequence within the expanded human proteome database finds the peptide sequence within the expanded human proteome database.
Embodiment 7: The embodiment as in the embodiment 6, further comprising: identifying, when the source is the linear expanded human proteome source, whether the peptide is putatively translated from messenger RNA or non-coding RNA.
Embodiment 8: The embodiment of any of embodiments 2-7, wherein the one or more databases comprises a human genome database, wherein the one or more searches comprises a linear human genome search of translations of the human genome database, and wherein the putative source is a linear genome source when the linear human genome search finds human genome sequence from which the peptide is putatively synthesized.
Embodiment 9: The embodiment as in the embodiment 8, wherein the linear human genome search excludes portions of the human genome from which the messenger RNA and the non-coding RNA of the expanded human proteome database are transcribed and includes remaining portions of the human genome, and wherein the linear human genome search comprises a search of six frame translations of the human genome.
Embodiment 10: The embodiment of any of embodiments 2-9, wherein the one or more searches comprises a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database, and wherein the putative source is a linear mismatch of the expanded human proteome when the linear mismatch search finds a peptide sequence having a mismatch to the peptide sequence within expanded human proteome database.
Embodiment 11: The embodiment as in the embodiment 10, wherein the linear mismatch search is a search for peptide sequences having only a single mismatch to the peptide sequence.
Embodiment 12: The embodiment of any of embodiments 1-11, wherein the one or more databases comprises a non-endogenous proteome database comprising computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms, wherein the one or more searches comprises a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, and wherein the putative source is a linear non-endogenous proteome source when the linear non-endogenous search finds the peptide sequence within the non-endogenous proteome database.
Embodiment 13: The embodiment as in the embodiment 12, wherein the non-endogenous proteome database comprises a Basic Local Alignment Search Tool (BLAST) database.
Embodiment 14: The embodiment of any of embodiments 2-13, wherein the one or more searches comprises a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence, and wherein the source is a cis-spliced human proteome source when the cis-spliced search finds, within the expanded human proteome database, peptide fragments that can be cis-spliced to match the peptide sequence.
Embodiment 15: The embodiment of any of embodiments 2-14, wherein the one or more searches comprises a trans-spliced search, within the expanded human proteome database, for computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the source is a trans-spliced human proteome source when the trans-spliced search finds, within the expanded human proteome database, computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence.
Embodiment 16: The embodiment as in the embodiment 15, wherein the putative source is determined to be unidentified when the trans-spiced search does not find computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence.
Embodiment 17: The embodiment of any of embodiments 2-16, wherein the one or more databases comprises a human genome database, and wherein the one or more searches comprise the following searches ordered sequentially in a workflow as follows: a linear human proteome search for the peptide sequence within the expanded human proteome database; a linear human genome search of translations of the human genome database; a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database; and a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence.
Embodiment 18: The embodiment as in the embodiment 17, wherein the one or more databases comprises a non-endogenous proteome database comprising computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms, wherein the one or more searches further comprises a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, and wherein the linear non-endogenous search is ordered sequentially in the workflow after the linear mismatch search and before the cis-spliced search.
Embodiment 19: The embodiment of any of embodiments 17-18, wherein the one or more searches further comprises a trans-spliced search, within the expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the trans-spliced search is ordered sequentially in the workflow after the cis-spliced search.
Embodiment 20: The embodiment of any of embodiments 17-19, further comprising: halting advancement of the workflow to a subsequent search of the one or more searches when the putative source is determined for the peptide sequence.
Embodiment 21: The embodiment of any of embodiments 1-20, wherein the peptide sequence comprises at least one ambiguous residue, the method further comprising: generating a plurality of permutated peptide sequences each comprising a potential residue for each of the at least one ambiguous residue; determining, for each of the plurality of permutated peptide sequences, a respective potential source; and determining the putative source of the peptide sequence such that the putative source is a respective potential source.
Embodiment 22: The embodiment as in embodiment 21, wherein the potential residue for each of the at least one ambiguous residue comprises leucine and isoleucine.
Embodiment 23: The embodiment of any of embodiments 21-22, further comprising: determining a respective random hit rate for each of the respective potential sources such that the random hit rate increases as a number of random sequences are found by a respective search of the one or more searches; and determining the putative source such that the respective random hit rate of the putative source is the lowest of the respective random hit rates for each of the potential sources.
Embodiment 24: The embodiment of any of embodiments 21-23, further comprising: identifying one or more likely permutated peptide sequences of the plurality of permutated peptide sequences such that each of the one or more likely permutated peptide sequences are associated with the putative source.
Embodiment 25: The embodiment of any of embodiments 1-24, wherein the peptide sequence is a de novo peptide sequence determined via mass spectrometry.
Embodiment 26: Non-transitory computer-readable medium configured to communicate with one or more processor(s) of a computational device, the non-transitory computer-readable medium including instructions thereon, that when executed by the processor(s), cause the computational device to: receive, as an input, a peptide sequence; determine, based at least in part on one or more searches of the peptide sequence within one or more databases, a putative source associated with the peptide sequence, wherein each respective search of the one or more searches has a random hit rate that is based at least in part on a number of random sequences found by the respective search, and wherein the one or more searches are performed in order of increasing random hit rates until the putative source is determined; and provide, as an output, the putative source.
Embodiment 27: The embodiment as in the embodiment 26, wherein the one or more databases comprises an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 28: The embodiment as in the embodiment 27, wherein the expanded human proteome database comprises computer-readable representations of translations from micro RNAs.
Embodiment 29: The embodiment of any of embodiments 27-28, wherein the expanded human proteome database comprises computer-readable representations of translations from long non-coding RNAs.
Embodiment 30: The embodiment of any of embodiments 27-29, wherein the expanded human proteome database comprises computer-readable representations of translations of human endogenous retroviruses.
Embodiment 31: The embodiment of any of embodiments 27-29, wherein the one or more searches comprises a linear human proteome search for the peptide sequence within the expanded human proteome database, and wherein the putative source is a linear expanded human proteome source when the linear human proteome search for the peptide sequence within the expanded human proteome database finds the peptide sequence within the expanded human proteome database.
Embodiment 32: The embodiment as in the embodiment 31, wherein, the instructions, when executed by the processor(s), cause the computational device to: identify, when the source is the linear expanded human proteome source, whether the peptide is putatively translated from messenger RNA or non-coding RNA.
Embodiment 33: The embodiment of any of embodiments 27-32, wherein the one or more databases comprises a human genome database, wherein the one or more searches comprises a linear human genome search of translations of the human genome database, and wherein the putative source is a linear genome source when the linear human genome search finds human genome sequence from which the peptide is putatively synthesized.
Embodiment 34: The embodiment as in the embodiment 33, wherein the linear human genome search excludes portions of the human genome from which the messenger RNA and the non-coding RNA of the expanded human proteome database are transcribed and includes remaining portions of the human genome.
Embodiment 35: The embodiment of any of embodiments 27-34, wherein the one or more searches comprises a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database, and wherein the putative source is a linear mismatch of the expanded human proteome when the linear mismatch search finds a peptide sequence having a mismatch to the peptide sequence within expanded human proteome database.
Embodiment 36: The embodiment as in the embodiment 35, wherein the linear mismatch search is a search for peptide sequences having only a single mismatch to the peptide sequence.
Embodiment 37: The embodiment of any of embodiments 26-36, wherein the one or more databases comprises a non-endogenous proteome database comprising computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms, wherein the one or more searches comprises a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, and wherein the putative source is a linear non-endogenous proteome source when the linear non-endogenous search finds the peptide sequence within the non-endogenous proteome database.
Embodiment 38: The embodiment as in the embodiment 37, wherein the non-endogenous proteome database comprises a Basic Local Alignment Search Tool (BLAST) database.
Embodiment 39: The embodiment of any of embodiments 27-38, wherein the one or more searches comprises a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence, and wherein the source is a cis-spliced human proteome source when the cis-spliced search finds, within the expanded human proteome database, peptide fragments that can be cis-spliced to match the peptide sequence.
Embodiment 40: The embodiment of any of embodiments 27-39, wherein the one or more searches comprises a trans-spliced search, within the expanded human proteome database, for computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the source is a trans-spliced human proteome source when the trans-spliced search finds, within the expanded human proteome database, computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence.
Embodiment 41: The embodiment as in the embodiment 40, wherein the putative source is determined to be unidentified when the trans-spiced search does not find computer-readable representations of peptide fragments that can be trans-spliced to match the peptide sequence.
Embodiment 42: The embodiment of any of embodiments 27-41, wherein the one or more databases comprises a human genome, and wherein the one or more searches comprise the following searches ordered sequentially in a workflow as follows: a linear human proteome search for the peptide sequence within the expanded human proteome database; a linear human genome search of translations of the human genome database; a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database; and a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence.
Embodiment 43: The embodiment as in the embodiment 42, wherein the one or more databases comprises a non-endogenous proteome database comprising computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms, wherein the one or more searches further comprises a linear non-endogenous search for the peptide sequence within the non-endogenous proteome database, and wherein the linear non-endogenous search is ordered sequentially in the workflow after the linear mismatch search and before the cis-spliced search.
Embodiment 44: The embodiment of any of embodiments 42-43, wherein the one or more searches further comprises a trans-spliced search, within the expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the trans-spliced search is ordered sequentially in the workflow after the cis-spliced search.
Embodiment 45: The embodiment of any of embodiments 42-44, wherein, the instructions, when executed by the processor(s), cause the computational device to: halt advancement of the workflow to a subsequent search of the one or more searches when the putative source is determined for the peptide sequence.
Embodiment 46: The embodiment of any of embodiments 26-45, wherein the peptide sequence comprises at least one ambiguous residue, and wherein, the instructions, when executed by the processor(s), cause the computational device to: generate a plurality of permutated peptide sequences each comprising a potential residue for each of the at least one ambiguous residue; determine, for each of the plurality of permutated peptide sequences, a respective potential source; and determine the putative source of the peptide sequence such that the putative source is a respective potential source.
Embodiment 47: The embodiment as in the embodiment 46, wherein the potential residue for each of the at least one ambiguous residue comprises leucine and isoleucine.
Embodiment 48: The embodiment of any of embodiments 46-47, wherein, the instructions, when executed by the processor(s), cause the computational device to: determine a respective random hit rate for each of the respective potential sources such that the random hit rate increases as a number of random sequences are found by a respective search of the one or more searches; and determine the putative source such that the respective random hit rate of the putative source is the lowest of the respective random hit rates for each of the potential sources.
Embodiment 49: The embodiment of any of embodiments 46-48, wherein, the instructions, when executed by the processor(s), cause the computational device to: identify one or more likely permutated peptide sequences of the plurality of permutated peptide sequences such that each of the one or more likely permutated peptide sequences are associated with the putative source.
Embodiment 50: The embodiment of any of embodiments 26-49, wherein the peptide sequence is a de novo peptide sequence determined via mass spectrometry.
Embodiment 51: A method of ordering a peptide source assignment workflow, the method comprising: generating a plurality of random peptide sequences; determining a plurality of peptide source search steps; searching for each of the plurality of random peptide sequences by each of the plurality of peptide source search steps; determining, for each of the plurality of peptide source search steps, a random hit rate for a respective search step of the plurality of peptide source search steps based at least in part on a number of the plurality of random peptide sequences found by the respective search step; and ordering the peptide source search steps in the peptide source assignment workflow from lowest random hit rate to highest random hit rate.
Embodiment 52: The embodiment as in the embodiment 51, wherein the random peptide sequences comprise random sequences uniformly sampling all amino acids.
Embodiment 53: The embodiment of any of embodiments 51-52, wherein the random peptide sequences comprise sequences with frequencies of amino acids matching those found in vertebrates.
Embodiment 54: The embodiment of any of embodiments 51-53, wherein each peptide of the random peptide sequences comprises a length of eight to fourteen amino acids.
Embodiment 55: The embodiment of any of embodiments 51-54, wherein each peptide of the random peptide sequences comprises a length of nine to fourteen amino acids, ten to fourteen amino acids, eleven to fourteen amino acids, twelve to fourteen amino acids, thirteen to fourteen amino acids, eight to thirteen amino acids, eight to twelve amino acids, eight to eleven amino acids, eight to ten amino acids, eight to nine amino acids, nine to thirteen amino acids, nine to twelve amino acids, nine to eleven amino acids, nine to ten amino acids, ten to thirteen amino acids, ten to twelve amino acids, ten to eleven amino acids, eleven to thirteen amino acids, elven to twelve amino acids, or twelve to thirteen amino acids.
Embodiment 56: The embodiment of any of embodiments 51-55, wherein the plurality of peptide source search steps comprises a linear human proteome search for a peptide sequence within an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 57: The embodiment as in the embodiment 56, wherein the expanded human proteome database comprises computer-readable representations of translations from micro RNAs.
Embodiment 58: The embodiment of any of embodiments 56-57, wherein the expanded human proteome database comprises computer-readable representations of translations from long non-coding RNAs.
Embodiment 59: The embodiment of any of embodiments 56-58, wherein the expanded human proteome database comprises computer-readable representations of translations of human endogenous retroviruses.
Embodiment 60: The embodiment of any of embodiments 56-59, wherein the plurality of peptide source search steps comprises a linear human genome search of translations of a human genome database.
Embodiment 61: The embodiment as in the embodiment 60, wherein the linear human genome search excludes portions of the human genome from which the messenger RNA and the non-coding RNA of the expanded human proteome database are transcribed and includes remaining portions of the human genome.
Embodiment 62: The embodiment of any of embodiments 51-61, wherein the plurality of peptide source search steps comprises a linear mismatch search for peptides having a mismatch to the peptide sequence within an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 63: The embodiment of any of embodiments 51-62, wherein the plurality of peptide source search steps comprises a linear non-endogenous search for the peptide sequence within a non-endogenous proteome database, and wherein the non-endogenous proteome database comprises computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms.
Embodiment 64: The embodiment as in the embodiment 63, wherein the non-endogenous proteome database comprises a Basic Local Alignment Search Tool (BLAST) database.
Embodiment 65: The embodiment of any of embodiments 51-64, wherein the plurality of peptide source search steps comprises a cis-spliced search, within an expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 66: The embodiment of any of embodiments 51-65, wherein the plurality of peptide source search steps comprises a trans-spliced search, within an expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 67: The embodiment of any of embodiments 51-66, wherein the peptide source assignment workflow terminates with a peptide being not assigned when the peptide is not assigned a peptide source by any of the plurality of peptide source search steps.
Embodiment 68: The embodiment of any of embodiments 51-67, wherein the peptide source assignment workflow comprises the following searches ordered sequentially as follows: a linear human proteome search for the peptide sequence within the expanded human proteome database; a linear human genome search of translations of a human genome database; a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database; and a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence.
Embodiment 69: The embodiment as in the embodiment 68, wherein the peptide source assignment workflow comprises a linear non-endogenous search for the peptide sequence within a non-endogenous proteome database, and wherein the linear non-endogenous search is ordered sequentially within the peptide assignment workflow after the linear mismatch search and before the cis-spliced search.
Embodiment 70: The embodiment of any of the embodiments 68-69, wherein the peptide source assignment workflow comprises a trans-spliced search, within the expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the trans-spliced search is ordered sequentially within the peptide assignment workflow after the cis-spliced search.
Embodiment 71: Non-transitory computer-readable medium configured to communicate with one or more processor(s) of a computational device, the non-transitory computer-readable medium including instructions thereon, that when executed by the processor(s), cause the computational device to: receive, as an input, a plurality of peptide source search steps; generate a plurality of random peptide sequences; search for each of the plurality of random peptide sequences by each of the plurality of peptide source search steps; determine, for each of the plurality of peptide source search steps, a random hit rate for a respective search step of the plurality of peptide source search steps based at least in part on a number of the plurality of random peptide sequences found by the respective search step; order the peptide source search steps in a peptide source assignment workflow from lowest random hit rate to highest random hit rate; and provide, as an output, the peptide source assignment workflow.
Embodiment 72: The embodiment as in the embodiment 71, wherein the random peptide sequences comprise random sequences uniformly sampling all amino acids.
Embodiment 73: The embodiment of any of embodiments 71-72, wherein the random peptide sequences comprise sequences with frequencies of amino acids matching those found in vertebrates.
Embodiment 74: The embodiment of any of embodiments 71-73, wherein each peptide of the random peptide sequences comprises a length of eight to fourteen amino acids.
Embodiment 75: The embodiment of any of embodiments 71-74, wherein each peptide of the random peptide sequences comprises a length of nine to fourteen amino acids, ten to fourteen amino acids, eleven to fourteen amino acids, twelve to fourteen amino acids, thirteen to fourteen amino acids, eight to thirteen amino acids, eight to twelve amino acids, eight to eleven amino acids, eight to ten amino acids, eight to nine amino acids, nine to thirteen amino acids, nine to twelve amino acids, nine to eleven amino acids, nine to ten amino acids, ten to thirteen amino acids, ten to twelve amino acids, ten to eleven amino acids, eleven to thirteen amino acids, elven to twelve amino acids, or twelve to thirteen amino acids.
Embodiment 76: The embodiment of any of embodiments 71-75, wherein the plurality of peptide source search steps comprises a linear human proteome search for a peptide sequence within an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 77: The embodiment as in the embodiment 76, wherein the expanded human proteome database comprises computer-readable representations of translations from micro RNAs.
Embodiment 78: The embodiment of any of embodiments 76-77, wherein the expanded human proteome database comprises computer-readable representations of translations from long non-coding RNAs.
Embodiment 79: The embodiment of any of embodiments 76-78, wherein the expanded human proteome database comprises computer-readable representations of translations of human endogenous retroviruses.
Embodiment 80: The embodiment of any of embodiments 76-79, wherein the plurality of peptide source search steps comprises a linear human genome search of translations of a human genome database.
Embodiment 81: The embodiment as in the embodiment 80, wherein the linear human genome search excludes portions of the human genome from which the messenger RNA and the non-coding RNA of the expanded human proteome database are transcribed and includes remaining portions of the human genome.
Embodiment 82: The embodiment of any of embodiments 71-81, wherein the plurality of peptide source search steps comprises a linear mismatch search for peptides having a mismatch to the peptide sequence within an expanded human proteome database, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 83: The embodiment of any of embodiments 71-82, wherein the plurality of peptide source search steps comprises a linear non-endogenous search for the peptide sequence within a non-endogenous proteome database, and wherein the non-endogenous proteome database comprises computer-readable representations of proteins translated from RNA from non-endogenous organisms and/or proteins synthesized by non-endogenous organisms.
Embodiment 84: The embodiment as in the embodiment 83, wherein the non-endogenous proteome database comprises a Basic Local Alignment Search Tool (BLAST) database.
Embodiment 85: The embodiment of any of embodiments 71-84, wherein the plurality of peptide source search steps comprises a cis-spliced search, within an expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 86: The embodiment of any of embodiments 71-85, wherein the plurality of peptide source search steps comprises a trans-spliced search, within an expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the expanded human proteome database comprises computer-readable representations of translations from messenger ribonucleic acids (RNAs) and non-coding RNAs.
Embodiment 87: The embodiment of any of embodiments 71-86, wherein the peptide source assignment workflow terminates with a peptide being not assigned when the peptide is not assigned a peptide source by any of the plurality of peptide source search steps.
Embodiment 88: The embodiment of any of embodiments 71-87, wherein the peptide source assignment workflow comprises the following searches ordered sequentially as follows: a linear human proteome search for the peptide sequence within the expanded human proteome database; a linear human genome search of translations of a human genome database; a linear mismatch search for peptides having a mismatch to the peptide sequence within the expanded human proteome database; a linear non-endogenous search for the peptide sequence within a non-endogenous proteome database; a cis-spliced search, within the expanded human proteome database, for peptide fragments that can be cis-spliced to match the peptide sequence; and a trans-spliced search, within the expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence.
Embodiment 89: The embodiment as in the embodiment 88, wherein the peptide source assignment workflow comprises a linear non-endogenous search for the peptide sequence within a non-endogenous proteome database, and wherein the linear non-endogenous search is ordered sequentially within the peptide assignment workflow after the linear mismatch search and before the cis-spliced search.
Embodiment 90: The embodiment of any of the embodiments 88-89, wherein the peptide source assignment workflow comprises a trans-spliced search, within the expanded human proteome database, for peptide fragments that can be trans-spliced to match the peptide sequence, and wherein the trans-spliced search is ordered sequentially within the peptide assignment workflow after the cis-spliced search.
Embodiment 91: A method comprising: generating a plurality of simulated random queries; determining, based on applying the plurality of simulated random queries to each source of a plurality of sources, a number of matches associated with each source; determining, based on the numbers of matches associated with each source, a false discovery rate associated with each source; and generating, based on the false discovery rates, a query support data structure configured to facilitate application of a new query to the plurality of sources.
Embodiment 92: The embodiment as in the embodiment 91, wherein generating the plurality of simulated random queries comprises at least one of: generating a plurality of uniform random queries; or generating a plurality of weighted random queries.
Embodiment 93: The embodiment of any of embodiments 91-92, wherein the plurality of simulated random queries comprises a plurality of simulated random text strings.
Embodiment 94: The embodiment of any of embodiments 91-93, wherein the plurality of simulated random queries comprises a plurality of simulated random peptide sequences.
Embodiment 95: The embodiment of any of embodiments 91-94, wherein determining, based on the numbers of matches associated with each source, the false discovery rate associated with each source comprises a function of the number of matches and a number of the plurality of simulated random queries.
Embodiment 96: The embodiment of any of embodiments 91-95, wherein determining, based on the numbers of matches associated with each source, the false discovery rate associated with each source comprises dividing the number of matches by a number of the plurality of simulated random queries.
Embodiment 97: A method comprising: receiving a query; applying, based on a query support data structure, the query to one or more sources of a plurality of sources; determining, based on a query result, a label associated with a source of the plurality of sources associated with the query result; and applying the label to the query.
Embodiment 98: The embodiment as in the embodiment 97, wherein the query comprises a text string.
Embodiment 99: The embodiment of any of embodiments 97-98, wherein the query comprises a peptide sequence.
Embodiment 100: The embodiment as in the embodiment 99, wherein receiving the query comprises receiving the peptide sequence from a mass spectrometer system.
Embodiment 101: The embodiment of any of embodiments 97-100, further comprising determining, via the mass spectrometer system, one or more amino acids of the peptide sequence.
Embodiment 102: The embodiment of any of embodiments 97-101, wherein the query support data structure indicates an order of the plurality of sources to apply the query, wherein the order is based on a false discovery rate associated with each source of the plurality of sources.
Embodiment 103: The embodiment of any of embodiments 97-102, further comprising determining one or more permutations of the query.
Embodiment 104: The embodiment as in the embodiment 103, wherein applying, based on the query support data structure, the query to the one or more sources of the plurality of sources comprises: applying each permutation of the one or more permutations of the query to the one or more sources of the plurality of sources; if an identical match to the one or more permutations of the query is found in a first source of the plurality of sources, discontinuing additional searches and applying a linear label to the one or more permutations of the query associated with the identical match; and assigning the one or more permutations of the query associated with the identical match as a correct query.
Embodiment 105: The embodiment of any of embodiments 97-104, wherein applying the query to one or more sources of a plurality of sources comprises: searching for an identical match to the query in a first source of the plurality of sources; and if an identical match to the query is found in the first source of the plurality of sources, discontinuing additional searches.
Embodiment 106: The embodiment as in the embodiment 105, wherein the query result comprises the identical match and wherein the label associated with a source of the plurality of sources associated with the query result comprises a linear label.
Embodiment 107: The embodiment of any of embodiments 97-106, wherein applying the query to one or more sources of a plurality of sources comprises: searching for an identical match to the query in a first source of the plurality of sources; and if an identical match to the one or more permutations of the query is found in the first source of the plurality of sources, discontinuing additional searches.
Embodiment 108: The embodiment as in the embodiment 107, wherein the query result comprises the identical match and wherein the label associated with a source of the plurality of sources associated with the query result comprises a linear label.
Embodiment 109: The embodiment as in the embodiment 107, wherein applying the query to one or more sources of a plurality of sources comprises: searching for an identical match to the query in any frame of a plurality of frames of a second source of the plurality of sources; and if an identical match to the query is found in any frame of a plurality of frames of the second source of the plurality of sources, discontinuing additional searches.
Embodiment 110: The embodiment as in the embodiment 109, wherein the query result comprises the identical match and wherein the label associated with a source of the plurality of sources associated with the query result comprises a linear label.
Embodiment 111: The embodiment as in the embodiment 109, wherein applying the query to one or more sources of a plurality of sources comprises: searching for a non-identical match to the query in a third source of the plurality of sources; and if a non-identical match to the query is found in the third source of the plurality of sources, discontinuing additional searches.
Embodiment 112: The embodiment as in the embodiment 111, wherein the query result comprises the non-identical match and wherein the label associated with a source of the plurality of sources associated with the query result comprises a mismatch label.
Embodiment 113: The embodiment as in the embodiment 111, wherein applying the query to one or more sources of a plurality of sources comprises: searching for a homologous match to the query in a fourth source of the plurality of sources; and if a homologous match to the query is found in the fourth source of the plurality of sources, discontinuing additional searches.
Embodiment 114: The embodiment as in the embodiment 113, wherein the query result comprises the homologous match and wherein the label associated with a source of the plurality of sources associated with the query result comprises a homologous label.
Embodiment 115: The embodiment as in the embodiment 113, wherein applying the query to one or more sources of a plurality of sources comprises: splitting the query into a plurality of sets of fragments; searching for each set of fragments in a fifth source of the plurality of sources; if a match for a set of fragments is found in the fifth source of the plurality of sources, discontinuing additional searches; and if a first match for a first fragment of the set of fragments and a second match for a second fragment of the set of fragments is found in the fifth source of the plurality of sources, discontinuing additional searches.
Embodiment 116: The embodiment as in the embodiment 115, wherein the query result comprises the match for the set of fragments and wherein the label associated with a source of the plurality of sources associated with the query result comprises a cis-spliced label.
Embodiment 117: The embodiment as in the embodiment 115, wherein the query result comprises the first match for the first fragment of the set of fragments and the second match for the second fragment of the set of fragments and wherein the label associated with a source of the plurality of sources associated with the query result comprises a trans-spliced label.
Embodiment 118: The embodiment of any of embodiments 97-117, further comprising determining, based on the label, a source of the query.
Embodiment 119: The embodiment as in the embodiment 118, further comprising, validating output of a mass spectrometer system based on the source of the query.
Embodiment 120: An apparatus one or more processors and a memory storing processor-executable instructions that, when executed by the one or more processors, cause the apparatus to perform any of the Embodiments 91-119.
Embodiment 121: One or more non-transitory computer-readable media storing processor-executable instructions thereon that, when executed by a processor, cause the processor to perform any of the Embodiments 91-119. Embodiment 122: A system comprising a computing device and a plurality of sources configured to perform any of the Embodiments 91-119.
While specific configurations have been described, it is not intended that the scope be limited to the particular configurations set forth, as the configurations herein are intended in all respects to be possible configurations rather than restrictive.
Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its steps or it is not otherwise specifically stated in the claims or descriptions that the steps are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including: matters of logic with respect to arrangement of steps or operational flow; plain meaning derived from grammatical organization or punctuation; the number or type of configurations described in the specification.
Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the method and compositions described herein. Such equivalents are intended to be encompassed by the following claims.
This application claims priority to U.S. Provisional Application Ser. No. 63/159,879 filed on Mar. 11, 2021, and U.S. Provisional Application Ser. No. 63/159,880 filed Mar. 11, 2021, the contents of each of which are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/020049 | 3/11/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63159879 | Mar 2021 | US | |
63159880 | Mar 2021 | US |