The present methods and systems generally relate to the biomedical field and relate to subfields of computational biology and bioinformatics. More, specifically the invention provides an artificial intelligence algorithm which can identify aggregation prone regions, particularly amyloid sequences in a protein.
The amyloid cross-beta state is a polypeptide conformation that is adopted by 36 proteins or peptides associated to human protein deposition pathologies1. It also constitutes the structural core of a growing number of functional amyloids in both bacteria and eukaryotes2,3. Beyond these bona fide functional and pathological amyloids it has been demonstrated that many if not most proteins can adopt an amyloid-like conformation upon unfolding/misfolding4. This has led to the notion that just like the alfa-helix or beta-sheet, the amyloid state is a generic polypeptide backbone conformation but also that amino acids have different propensities to adopt the amyloid conformation5. Initially, it was observed that amyloid-like aggregation correlates with hydrophobicity, beta-strand propensity, and (lack of) net charge6. This triggered the development of aggregation prediction algorithms that essentially evaluate the above biophysical propensities7,8. Others extended to scaling residue propensities between protein folding and aggregation9,10. These algorithms confirmed the ubiquity of amyloid-like propensity in natural protein sequences and particularly in globular proteins as it was estimated that 15-20% of residues in a typical globular domain are within aggregation-prone regions (APRs)11,12. These APRs are sequence segments of six to seven amino acids in length on average and are mostly buried within the protein structure where they constitute the hydrophobic core stabilizing tertiary protein structure13,15. On the other hand, the increasing identification of both yeast prions and functional amyloids clearly indicated that amyloid sequence space is not monolithic and that more polar/less aliphatic sequences represent important alternative populations of amyloid sequence space3. The limited sensitivity of the above cited algorithms to specifically identify these other subpopulations confirmed the underestimated sequence versatility of the amyloid conformation. Indeed, more recently the role of amyloid-like sequences in proteins mediating liquid-liquid phase transitions again demonstrates the ubiquity of the amyloid in biological function and further withers the image of the amyloid state as a predominantly disease and/or toxicity associated protein conformation16-18. Rather, this suggests that like globular protein folding, amyloid assembly is a matter of kinetic and thermodynamic control that can be evolutionary tuned by sequence variation and selection. Efforts to develop aggregation predictors that can identify a broader spectrum of amyloid sequences have increased over the years19. Such approaches focused on identifying position-specific patterns by reference to accumulated experimental data of APRs′, or by using energy functions of cross-beta pairings23. Recently developed meta-predictors produce consensus outputs by combining previous methods, in an attempt to boost performance24,25. Indirect structured-based methods were initially developed by considering secondary structure propensities26,27. Complementary studies extended this notion by suggesting that disease-related amyloids form β-strand-loop-β-strand motifs28. There remains however still a need to develop reliable algorithms to detect amyloid sequences beyond their current know boundaries.
In the present invention, we have used a machine learning approach to identify amyloid sequences in proteins. Specifically, the invention provides an algorithm, which is herein further designed as Cordax, which is an exhaustively trained regression model that leverages a substantial library of curated template structures combined with machine learning. Cordax not only detects APRs in proteins, but also predicts the structural topology, orientation and overall architecture of the resulting putative fibril core. To validate the accuracy of our predictions, we designed a screen of 96 newly predicted APRs and experimentally determined their aggregation properties. Using this approach, we identified less hydrophobic polar and charged aggregation prone sequences that increasingly uncouple solubility and amyloid propensity, closely resembling characteristics of phase-separation inducers. Clustering by t-Distributed Stochastic Neighbour Embedding reveals the heterogeneous substructure of amyloid sequence space consisting in varying clusters corresponding to sequences compatible with globular structure, functional scaffolding amyloids, N/Q/Y rich prions, helical peptides and intrinsically disordered sequences. Together, the structural exploration performed here demonstrates that the field now gathered sufficient structural and sequence information to start classifying amyloids according to different structural and functional niches. Just like for globular proteins in the 1980s, this will allow to fine-tune both general and context-dependent structural rule learning allowing to manipulate and design amyloid structure and function.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes.
The present disclosure relates generally to a machine learning engine, herein referred to as the Cordax algorithm (or in short Cordax), for the identification of amyloid core sequences present in a protein. The present disclosure also relates to a system (or apparatus) implementing the artificial intelligence (AI) platform.
Example embodiments will be described more fully hereinafter, which example embodiments are described. It should be understood that such systems, computer readable media, and methods may be embodied in many different forms and should not be construed as limited to the example embodiments set forth herein. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the claims to those of ordinary skill in the art.
The term “machine learning” as used herein generally refers to a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning is a branch of AI focusing on systems that can learn from data, identify patterns, and make decisions with minimal human intervention.
As used herein, the term “full length native protein” refers to a protein that is in its native or natural state and unaltered by any denaturing agent such as heat, chemical mutation or enzymatic reactions. A wild-type protein would be considered a full-length native protein. The term full-length native protein sequence, as used herein, refers to the amino acid sequence found in the full-length native protein.
As used herein “mutation” refers to a change in the amino acid sequence of a native protein. Mutations can be described by using the native sequence and then identifying the specific acid that have been changed. A “mutant” refers to the protein that contains the mutation. A full-length mutant sequence refers to the full amino acid sequence of the mutant protein, instead of describing the mutant as the amino acids that are different from the native protein.
Terms such as “first”, “second”, and “within” are used merely to distinguish one component (or part of a component or state of a component) from another. Such terms are not meant to denote a preference or a particular orientation and are not meant to limit embodiments of the disclosure. In the following detailed description of the example embodiments, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that embodiments of the disclosure may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
A user may be any person or entity that interacts with the database, the AI platform, or both. Examples of a user may include, but are not limited to, a principal investigator, a scientist, a post-doctoral candidate, a graduate student, or a pharmaceutical company, for example. There can be one or multiple users.
The number of amyloid structures in the protein databank has been steadily increasing over the last two decades. It has now achieved a number (>80) that was reached for globular proteins at the beginning of the 1980s and that then triggered the first developments of template-based modelling methods including homology-based and threading (or fold recognition) in an attempt to estimate the versatility of individual folds and discover novel folds in a more directed manner. In the present invention we provide a new algorithm, Cordax, which is an exhaustively trained regression model that leverages a substantial library of curated amyloid template structures combined with machine learning. Cordax uses a logistic regression approach to translate structural compatibility and interaction energies into sequence aggregation propensity and is therefore unconstrained by defined sequence tendencies, such as hydrophobicity or secondary structure preference that direct most sequence-based predictors. As a result, we have discovered unconventional amyloid-like sequences, including sequences with low aliphatic content, high net charge or sequences with low intrinsic structural propensities. Clustering amyloid sequences by t-SNE two-dimensional reduction revealed the substructure of amyloid sequence space. Apart from a large cluster corresponding to sequences found in the hydrophobic core of globular proteins, we also found clusters corresponding to surface-exposed amyloid sequences in globular proteins, small aliphatic functional amyloids, N/Q/Y prions, strongly helical and intrinsically disordered sequences which could be compatible with liquid-liquid phase responsive sequences. The present invention highlights the discovery of highly soluble, yet amyloid-forming, sequences and suggests that the largest portion of the remaining uncharted amyloid sequence space is hidden in this corner (see
Cordax provides a cost-effective complementary powerful computational alternative that can be operated without any required scientific expertise necessary to apply the intricate technical approaches. Apart of its function as an aggregation predictor, the tool is uniquely poised to provide detailed complementary structural information on the putative amyloid fibril architecture of identified aggregation prone regions. Users can utilise the method to structurally characterise identified APRs by classifying their overall specific topological preferences, including β-strand directionality and key residue positions that are integral parts of the amyloid core. The latter information is imperative for efforts focused on understanding the underlying mechanisms that dictate amyloid-related diseases or the formation of functional amyloids, but can also have an immense impact on the design of applied nano-biomaterials64, targeted amyloid inducers65 or counteragents, following the increased interest in the development of structure-based inhibitors of aggregation61-63.
Accordingly, the present invention provides in a first embodiment a method for identifying at least one aggregation prone region (APR) present in a protein, the method comprising:
In a specific embodiment the querying of the machine learning engine (or algorithm which is an equivalent word) involves fragmenting said protein into hexapeptides using a sliding window process, followed by modelling said hexapeptides on the backbone of said second library, calculating the thermodynamic stability for each sequence using a Force Field and feeding the data into said logistic regression model.
In a specific embodiment the Force Field used is FoldX.
In specific embodiments the invention provides a computer-readable storage medium which stores computer-executable instructions that, when executed by at least one processor, cause the processor to perform one of the methods described herein before in the embodiments.
In yet another embodiment the invention provides an apparatus comprising control circuitry configured to perform one of the methods described in the previous embodiments.
Systems of the disclosure can include an intranet-based computer system that is capable of communicating with various software. A computer system includes any type of computing device or communication device. Examples of such a system can include, but are not limited to, super computers, a processor array, distributed parallel system, a desktop computer with LAN, WAN, Internet or intranet access, a laptop computer with LAN, WAN, Internet or intranet access, a smart phone, a server, a server farm, an android device (or equivalent), a tablet, smartphones, and a personal digital assistant (PDA). Further, as discussed above, such a system can have corresponding software (e.g. user software, sensor device software). The software of one system can be a part of, or operate separately but in conjunction with, the software of another system.
Embodiments of the disclosure include a storage repository. The storage repository can be a persistent storage device (or set of devices) that stores software and data. Examples of a storage repository can include, but are not limited to, a hard drive, flash memory, some other form of solid-state data storage, or any suitable combination thereof. The storage repository can be located on multiple physical machines, each storing all or a portion of the database, AI platform, protocols, algorithms, or other stored data according to some example embodiments. Each storage unit or device can be physically located in the same or in a different geographic location. In embodiments, the storage repository may be stored locally, or on cloud-based serveries such as Amazon Web Services.
In one or more example embodiments, the storage repository stores one or more databases, AI Platforms, protocols, algorithms, and stored data. The protocols can include any of a number of communication protocols that are used to send, receive, or send and receive data between the processor, datastore, memory and the user. A protocol can be used for wired and/or wireless communication. Examples of a protocols can include, but are not limited to, Modbus, profibus, Ethernet, and fiberoptic.
Systems of the disclosure can include a hardware processor. The processor of the executes software, algorithms, and firmware in accordance with one or more example embodiments. The processor can be a central processing unit, a multi-core processing chip, SoC, a multi-chip module including multiple multi-core processing chips, or other hardware processor in one or more example embodiments. The processor is known by other names, including but not limited to a computer processor, a microprocessor, and a multi-core processor. The processor can also be an array of processors.
In one or more example embodiments, the processor executes software instructions stored in memory. Such software instructions can include generating machine learning models, executing machine learning models, performing analysis on data received from the database, and so forth. The memory includes one or more cache memories, main memory, or any other suitable type of memory. The memory can include volatile or non-volatile memory.
The processing system can be in communication with a computerized data storage system which can be stored in the storage repository. The data storage system can include a non-relational or relational data store, such as a MySQL or other relational database. Other physical and logical database types could be used. The data store may be a database server, such as Microsoft SQL Server., Oracle., IBM DB2., SQLITE., or any other database software, relational or otherwise. The data store may store the information identifying syntactical tags and any information required to operate on syntactical tags. In some embodiments, the processing system may use object-oriented programming and may store data in objects. In these embodiments, the processing system may use an object-relational mapper (ORM) to store the data objects in a relational database. The systems and methods described herein can be implemented using any number of physical data models. In one example embodiment, an RDBMS can be used. In those embodiments, tables in the RDBMS can include columns that represent coordinates. The tables can have pre-defined relationships between them. The tables can also have adjuncts associated with the coordinates.
In embodiments, the systems of the disclosure can include one or more I/O (input/output) devices allow a user to enter commands and information into the system, and also allow information to be presented to the user or other components or devices. Examples of input devices include, but are not limited to, a keyboard, a cursor control device (such as a mouse), a microphone, a touchscreen, and a scanner. Examples of output devices include, but are not limited to, a display device (e.g., a display, a monitor, or projector), speakers, outputs to a lighting network (such as a DMX card), a printer, and a network card. For example, the input devices can be used to enter data on native proteins and mutation sequences and assays. The input devices can also enter wanted functional data for a protein. The output devices can be used to output analysis data and/or engineered protein sequences resulting from AI protein design.
Various techniques are described herein in the general context of software.
Generally, software includes routines, programs, objects, components, data structures, and so forth that perform particular tasks or implement particular abstract data types. An implementation of these modules and techniques can be stored on or transmitted across some form of computer readable media. Computer readable media is any available non-transitory medium or non-transitory media that is accessible by a computing device. By way of example, and not limitation, computer readable media includes computer storage media.
In embodiments, the AI Platform comprises a machine learning method, such as a neural network for effective protein function prediction. In some embodiments, the AI platform includes neural networks, genetic algorithms, decision trees, fuzzy logic, symbolic rules, gradient boosting, support vector machines, and other machine learning based systems. Pluralities and/or combinations of the above may also be used. In embodiments, the AI Platform can use ML frameworks such as, Keras, Caffe, Pytorch, TensorFlow, the Microsoft Cognitive Toolkit, MXNet, Chainer, and Theano, with a Python implementation as the predominant data science language. In embodiments, the AI platform will allow for agnostic integration with other algorithms (such as gradient boosting, SVM, Gaussian processes) and their respective frameworks (XGBoost, SciKit Learn, GPy etc.) by separating data preparation from model creation and by using a NumPy data format common to all of these frameworks. In some embodiments, data preparation tools can be released as a Python package.
Embodiments of the disclosure use protein feature encodings to add physical or biological knowledge to amino acid sequences to create representations amenable to machine learning. As the choice of encoding varies based on the size and diversity of the input, as well as the task, several encoding methods can be implemented, allowing users to test and select the encodings most relevant to their problem. The AI Platform can include the following encodings, for example: one-hot, autoencoders, amino acid property encoders, learned BLOSUM/MSA evolutionary encodings, sequence mutation representation relative to WT, secondary structure/solvent accessible surface area encodings, learned AA embeddings, POOL, Phoenix, and/or structural/graph/topological encodings.
The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. It should be appreciated that any component or collection of components that perform the functions described above can be generically considered as one or more controllers that control the above-discussed functions. The one or more controllers can be implemented in numerous ways, such as with dedicated hardware, or with general purpose hardware (e.g., one or more processors) that is programmed using microcode or software to perform the functions recited above.
One or more processors may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks, or fiber optic networks.
One or more algorithms for controlling methods or processes provided herein may be embodied as a readable storage medium (or multiple readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various methods or processes described herein.
In some embodiments, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the methods or processes described herein. As used herein, the term “computer-readable storage medium” encompasses only a computer-readable medium that can be considered to be a manufacture (e.g., article of manufacture) or a machine. Alternatively or additionally, methods or processes described herein may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
The terms “program” or “software” are used herein in a generic sense to refer to any type of code or set of executable instructions that can be employed to program a computer or other processor to implement various aspects of the methods or processes described herein. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more programs that when executed perform a method or process described herein need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various procedures or operations.
Executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. Non-limiting examples of data storage include structured, unstructured, localized, distributed, short-term and/or long term storage. Non-limiting examples of protocols that can be used for communicating data include proprietary and/or industry standard protocols (e.g., HTTP, HTML, XML, JSON, SQL, web services, text, spreadsheets, etc., or any combination thereof). For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags, or other mechanisms that establish relationship between data elements.
While several embodiments of the present invention have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the functions and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the present invention. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings of the present invention is/are used, Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the invention may be practiced otherwise than as specifically described and claimed. The present invention is directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present invention.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified unless clearly indicated to the contrary. Thus, as a non-limiting example, a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A without B (optionally including elements other than B); in another embodiment, to B without A (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of and “consisting essentially of shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
1. General Overview of the Cordax Algorithm
In the present invention we have designed a novel structure-based amyloid core sequence prediction method that (a) leverages all the available structure information that is currently available, and (b) employs a machine learning element for optimal prediction performance. In a first step a curated template library of amyloid core structures as described was built (see the Cordax library described in example 2 below). Similar to known prediction methods″, we fixed on the hexapeptide as a unit of prediction. In order to determine the amyloid propensity of a query hexapeptide we start by modelling its side chains on all the available amyloid template structures using the FoldX force field30, which yields a model and an associated free energy estimate (DeltaG, kcal/mol) for each template. These free energies are then fed into a logistic regression model (see example 3), which is a simple statistical method relating a binary outcome to continuous variables. The prediction output of Cordax is multiple: first, there is the prediction from the logistic regression whether or not the segment is an amyloid core sequence, second, for the sequences predicted to be an amyloid core, the most likely amyloid core model is provided. For longer query sequences, a sliding window approach is adopted. Specific technical details of the pipeline are outlined below in the further examples.
2. Collection, Refinement and Characterisation of Fibril Structures for Machine Learning, Building of the Cordax Library
We isolated 78 short segment fibril core high resolution structures from the Protein Data Bank (see Table 1). Templates were grouped into 7 distinct topological classes out of 8 theoretically possible based on their overall structural properties, as previously proposed by Sawaya et al31. Briefly, topologies are defined by whether β-sheets have parallel versus antiparallel orientation, by the orientation of the strand faces that form the steric zipper (face-to-face versus face-to-back), and finally the orientation of both sheets towards each other and whether that results in identical or different fibril edges. This complexity was addressed by generating an ensemble of amyloid cores per structure using crystal contact information derived from the solved structures. Every template comprises two facing β-sheets, each composed of five successive β-strands. Since parallel architectures can share more than one homotypic packing interface, those structures were split into separate individual entries (
The amyloid interaction interfaces were analysed in detail following energy refinement by the FoldX force field30. During this step we identified and rejected 33 imperfect β-packing interfaces formed by β-strands that contribute less than three interacting residues, thus reducing the ensemble to 146 structures. Detailed analysis of the contributions of various energy components showed that these excluded β-packing interfaces have inefficient shape complementarity and low overall stability, stemming from a combination of weak electrostatic contributions, diminished van der Waals interactions and exposure of hydrophobic residues to the solvent (
3. Regression Model Training Using Peptide Sequences with Experimentally Determined Amyloid-Forming Properties
In previous work we synthesised and explored the aggregation potential of 940 peptide sequences derived from both functional and pathological amyloid-forming proteins, which were supplemented with additional data on 462 hexapeptides derived from other published sources to develop WALTZ-DB 2.032, the largest public comprehensive repository of experimentally defined amyloidogenic peptides. In total, 1402 hexapeptide sequences from WALTZ-DB were modelled on the 140 backbone structures of the Cordax library, leading to the generation of 196280 models. The thermodynamic stability of each model (ΔG, kcal/mol) was calculated using FoldX and fed into a logistic regression model (
4. Benchmarking Peptide and Regional Detection of Aggregation Propensity with the Cordax Algorithm
As an initial test of the prediction accuracy of the regression model, we performed leave-one-out cross-validation on the training dataset32 and performance metrics were determined on a peptide basis. Due to the extensive size of the dataset, comparison to other software was performed only with methods supporting multiple sequence input and a non-binary scoring function, since performances were compared using Receiver Operating Characteristic (ROC) analysis33. The ROC curves generated highlight that Cordax performance exceeds over 8 state-of-the-art methods, which we applied using optimised options defined by the developers7,9,21-24,34. In detail, Cordax performs well over random as depicted by the highest total area under the curve (AUC) value of 0.87 (
5. Designed Aggregation Prone Peptide Nucleators Validate the Accuracy of Cordax Algorithm Predictions
In the interest of improving the current description of the familiar amyloidogenic protein dataset, we selected and synthesised a subset of 96 peptides corresponding to strong aggregation prone regions identified in these proteins by Cordax. Apart of prediction strength, the peptide screen was also selectively constructed to ensure broad sequence variability and a wide distribution on the proteins of the dataset, with a preference for longer entries defined by inadequate previous characterisation. Peptide sequences were cross-checked and filtered to exclude overlapping sequences with previously identified amyloid regions and WALTZ-DB (see Table 2). The remaining selection of 96 peptides were synthesized using standard solid phase synthesis and their amyloid-forming properties were initially examined using Thioflavin-T (Th-T) or pFTAA binding, following rotating incubation for 5 days at room temperature. The binding assays are complementary, as Th-T and pFTAA are opposingly charged molecules, which increases the amyloid identification rate by overcoming cases of dye-specific failure to bind to amyloid surfaces based on charge repulsion. Under these conditions, 66 peptides successfully bind the specific dyes (
6. Machine-Guided Structural Prediction Detects Highly Soluble Surface-Exposed Conformational Switches of Aggregation
The expanded amyloidogenic annotation of the protein dataset was supplemented with structural analysis of the newly identified aggregation prone regions. Out of 96 peptides designed and experimentally tested, 85 peptides were found to display evident amyloid-forming features, with more than half (55.3%) being predicted specifically by Cordax, contrary to shared predictions with sequence-based tools of high specificity (44.7%) (See Table 2). Pinpointing the location of the identified nucleators in parental protein folds (
7. Dimensionality Reduction Transformation Reveals that Cordax Infiltrates Uncharted Areas of Amyloid Sequence Space
To further explore the capabilities of our method, we composed a map of the known amyloid forming sequence space using t-distributed Stochastic Neighbour Embedding (t-SNE) for dimensionality reduction (
8. The Cordax Algorithm Predicts the Structural Layout and Overall Topology of Amyloid Fibril Cores
Due to restricted availability of experimentally determined structures not included in the Cordax library, we first analysed the information derived from cross-threading analysis in order to test the performance of the tool in predicting the structural architecture of aggregation prone stretches. Among 73 unique sequences corresponding to the structural library, Cordax was able to accurately assign the correct architecture to 63%, whereas 81% was identified with proper β-strand orientation (parallel/antiparallel) (
9. Cordax Pipeline—Summary
The Cordax algorithm receives a protein sequence in FASTA format as input, which is fragmented into hexapeptides using a sliding window process. Sequences are then threaded against the fragment library utilising FoldX and the derived free energies are translated into scoring values for every peptide window. An energetically fitted model is selected as the closest representative of the overall topology of the amyloid fibril core for each predicted window and is provided as output in standard PDB format to the users (
10. Datasets
Performance assessment of Cordax was carried out utilising two individual data sets for peptide and protein aggregation propensity detection. Further validation of the method was performed against an independent subset screen of 96 hexapeptides sequences.
WALTZ-DB 2.0 dataset: For peptide aggregation propensity, we used a dataset of 1402 non-redundant hexapeptides contained in the WALTZ-DB 2.0 repository32. This database is the largest currently available resource of experimentally characterized amyloidogenic peptides. It contains annotated peptide entries that are distributed in shorter subsets and extracted from literature22,23,67-69, in addition to peptides with experimentally determined amyloid-forming properties. As a result, it has been widely used as a validation set for several aggregation predicting tools21,23,67,70,71.
Reg33 dataset: Collected in 2013, this is currently a standard dataset for estimating the performance of aggregation propensity prediction in protein sequences25. It contains regional annotation of aggregating segments identified for 34 well-known amyloidogenic proteins. The annotation is assigned on a residue basis, thus containing 1260 residues in defined aggregation prone regions and 6472 residues located in non-aggregating segments.
Cordax validation dataset: This set consists of 96 hexapeptide segments derived from potentially mis-annotated non-amyloidogenic regions of the reg33 dataset that were predicted as aggregation prone segments after applying Cordax. Peptide segments were filtered for potential overlaps to the WALTZ-DB 2.0 set.
11. Comparative Analysis
Binary classification was utilized to determine performances of calculated aggregation propensities per hexapeptide fragment or per residue. As a result, predictions can be classified by comparison to experimental validation into true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), respectively. Performance is evaluated using the following metrics:
12. Design of Variant Peptides of a New Aggregation Prone Region (APR) Identified in Apolipoprotein A-I
12.1 Design of Variant Peptides which can Inhibit Aggregation of ApoA-I
A number of naturally occurring mutations of human apolipoprotein A-I (ApoA-I)—see for a reference to this protein: Frank P G and Marcel Y L (2000) J. Lipid Res. 41(6):853) have been associated with hereditary amyloidosis. Amyloidosis are a large group of heterogeneous diseases characterized by insoluble proteins inducing organ damage. Aggregation prone regions are critical regions for the aggregation of proteins able to form pathological aggregates. The Cordax algorithm of the invention was used to identify previously unknown aggregation prone regions (APRs) in apolipoprotein A-I. We identified the sequence LATVYV (SEQ ID NO: 172) present in the amino acid sequence of ApoA-I (corresponding with the amino acid sequence 38 to 43 in the protein sequence of ApoA-I) as a potential new APR.
Based on SEQ ID NO: 172 we explored the design of capping peptides. A capping peptide is a polypeptide which can inhibit the aggregation of a target protein. The term “capping peptide” is well known in the art. Typically, capping peptides have an amino acid length of between 5 and 10 amino acids and differ by one, two or three different amino acid substitutions of a contiguous aggregation prone region (APR) naturally occurring in a target protein.
In building our method we reasoned that for a candidate peptide to qualify as a capping peptide, it should strongly bind to the axial end of a growing amyloid core but at the same time the peptide should introduce sufficient structural disruption which prohibits further elongation along the fibril axis. The latter is in contrast to a wild type (or normal) elongating/nucleating sequence. The method below is illustrated with variants having one amino acid difference as compared to the sequence of the wild type APR region. Our method to design a capping peptide hinges on the availability of the 3D-structure of the amyloid core of SEQ ID NO: 172, here this 3-D structure was modelled based on the Cordax algorithm. Starting from the predicted 3-D structure of the amyloid core structure, a forcefield algorithm was used to calculate the interaction energies between a list of candidate capping peptides (see further) and the 3-D amyloid core structure. In the present example we have used the FoldX force field to calculate the thermodynamic stability of the putative interactions.
The first step in the methodology starts by generating an in silico list of variants of the amino acid sequence of the amyloid core (SEQ ID NO: 172). Thus, starting from the APR sequence an in silico list of variants is created wherein each amino acid in this APR sequence is substituted into all possible 19 different amino acids. In a subsequent step the candidate peptides (consisting of the in silico list of APR variants) are further used for calculating the interaction energies. By plotting the calculated interaction potential calculated through (1) on the x-axis and the potential from (2) on y-axis we end up with a quadratic profile of every of the variant sequences (see
Thus the instant invention provides a method to obtain a set of candidate capping peptides binding to a target protein that forms pathological aggregates comprising the following steps:
Candidate capping peptide sequences are depicted in Table 6.
The Th-T kinetics (see
12.2 Variant Peptides to Induce Aggregation
In what we can specify as the inverse experiment we also designed peptides which can induce the aggregation of ApoA-I). Here a favorable variant sequence (in the bottom left quadrant) has a negative delta G free energy for cross interaction with the three-dimensional structure of the APR core and also has a negative delta G free energy for elongation with the three-dimensional structure of the APR core with a variant sequence bound to the axial end. The bottom left quadrant corresponds to sequence variants that are predicted to act as aggregation inducing peptides against the identified APR template structure. Table 7 depicts sequences of candidate peptides which can induce the aggregation of Apo-AI.
The Th-T kinetics (see
Materials and Methods
Peptide Synthesis
Peptides derived from the Cordax validation set were synthesized using an Intavis Multipep RSi solid phase peptide synthesis robot. Peptide purity (>90%) was evaluated using RP-HPLC purification protocols and peptides were stored as ether precipitates (−20° C.). Peptide stocks were initially treated with 1,1,1,3,3,3-hexafluoro-isopropanol (HFIP) (Merck), then dissolved in traces of dimethyl sulfoxide (DMSO) (Merck) (<5%), filtered through 0.2 μm filters and finally in milli-Q water to reach a final concentration of 200 μM or up to 1 mM for dye-negative peptides. Dithiothreitol (DTT) (1 mM) was included in solutions of peptides spanning cysteine or methionine residues. All peptides were incubated at room temperature for a period of 5 days on a rotating wheel.
Thioflavin-T and pFTAA Binding Assays
Amyloid aggregation was monitored using fluorescent spectroscopy binding assays. Th-T (Sigma) or pFTAA (Ebba Biotech AB) was added in half-area black 96-well microplates (Corning, USA) at a final concentration of 25 μM and 0.5 μM, respectively. Fluorescence intensity was measured in replicates (n=6) using a PolarStar Optima and a FluoStar Omega plate reader (BMG Labtech, Germany), equipped with an excitation filter at 440 nm and emission filters at 480 nm and 510 nm, respectively.
Transmission Electron Microscopy
Peptide solutions were incubated for 5 days at room temperature in order to form mature amyloid-like fibrils. Suspensions (5 μL) of each peptide solution were added on 400-mesh carbon-coated copper grids (Agar Scientific Ltd., England), following a glow-discharging step of 30 s to improve sample adsorption. Grids were washed with milli-Q water and negatively stained using uranyl acetate (2% w/v in milli-Q water). Grids were examined with a JEM-1400 120 kV transmission electron microscope (JEOL, Japan), operated at 80 keV.
Congo Red Staining
Droplets (10 μL) of peptide solutions containing mature amyloid fibrils were cast on glass slides and permitted to dry slowly in ambient conditions in order to form thin films. The films were stained with a Congo red (Sigma) solution (0.1% w/v) prepared in milli-Q water for 20 minutes. De-staining was performed with gradient ethanol solutions (70% to 90%).
Determination of Peptide Propensities
Surface exposure and secondary structure analysis was performed using the FoldX energy force field on the available crystal structures for acylphosphatase-2 (PDB ID:1APS), amphoterin (PDB ID:1CKT and 1HME), apolipoprotein-C2 (PDB ID:115J), α-synuclein (PDB ID:1XQ8), β2-microglobulin (PDB ID:1A1M), casein (PDB ID:6FS5), gelsolin (PDB ID:3FFN), Het-S (PDB ID:2WVN), kerato-epithelin (PDB ID:5NV6), lactoferrin (PDB ID:1CB6), prolactin (PDB ID:1RW5), major prion protein (PDB ID:1E1G), repA (PDB ID:1HKQ), serum amyloid alpha (PDB ID:41P8), Sup35 (PDB ID:4CRN) and Ure2p (PDB ID:1HQO). Partition coefficients were calculated using P log P, which specialises in peptides with blocked termini72. Structural alignment and visualisation were performed with the aid of YASARA73. Sequence similarities were calculated using the BLOSUM62 matrix currently available under the Biostrings R library. Correlation plots were generated using the ggpairs( ) function available under the GGally R library and ROC curves were calculated using ROCR.
Dimensionality Reduction Analysis
A defined amyloid-forming sequence space was constructed by merging the experimentally determined amyloid sequences of the 96-peptide screen, identified by Cordax, to the amyloid sequence content extracted from WALTZ-DB. Prior to t-SNE analysis, scoring outputs using Cordax, PASTA23, TANGO7 and WALTZ21 were calculated for each peptide entry. Peptide description was complemented with a 20-dimensional vector using the available R package Peptides. All data points were reduced and embedded in 2D-space using the Rtsne package, with perplexity (p=45), iteration steps (n=5000) and learning rate (default) defined based on the initial guidelines proposed by van der Maaten & Hinton74. UMAP reduction was performed using the R umap package and three-dimensional PCA analysis was conducted using pca3d R package and visualised with scatter3D, respectively.
Tables 1 to 5
Number | Date | Country | Kind |
---|---|---|---|
20176563.3 | May 2020 | EP | regional |
This application is a national phase entry under 35 U.S.C. § 371 of International Patent Application PCT/EP2021/063691, filed May 21, 2021, designating the United States of America and published in English as International Patent Publication WO 2021/239629 on Dec. 2, 2021, which claims the benefit under Article 8 of the Patent Cooperation Treaty to European Patent Application Serial No. 20176563.3, filed May 26, 2020, the entireties of which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/063691 | 5/21/2021 | WO |