The invention relates to an information management system for managing biochemical annotations and pathways, and more particularly to equipment and software products for automatic creation and identification of biochemical annotations and pathways. As used herein, ‘biochemical’ means biological with or without extensions to chemistry. Biochemical annotations classify biochemical entities to categories. For example, Gene Ontology (GO) Consortium has defined ontologies for annotating gene products to molecular functions, biological processes and cellular components. In addition to the GO system, there are many other category systems, ontologies and controlled vocabularies which are used to annotate biochemical entities to particular categories, in order to describe the functions of the biochemical entities or processes in which they participate. Biochemical pathways are used to model biochemical networks wherein biochemical entities interact with each other.
Biochemical annotations, such as the above-mentioned GO ontology, are based on textual definitions of categories, and they are typically processed manually. Interpretation of such textual definitions of categories requires a biology expert, which may prove out to be a bottleneck in utilizing available information on annotations.
Commonly owned PCT publication WO2005/003999, which is incorporated herein by reference, discloses an exemplary system for modelling specific biochemical systems. While the prior art systems are good at modelling specific biochemical systems as textual categories or individual pathways, they exhibit shortcomings in exploiting similarities and common features between different biochemical systems. There are large amounts of textual information, available both on-line and in printed form, for verbally describing similarities and common features between different biochemical systems but known information systems are incapable of modelling them.
An object of the present invention is to provide equipment and software products for modelling biochemical systems such that the above shortcomings are alleviated. The object of the invention is achieved by a equipment and software products which are characterized by what is stated in the independent claims. The preferred embodiments of the invention are disclosed in the dependent claims.
An aspect of the invention is an electronic information management system for managing biochemical information, the information management system comprising data structures for modelling:
This extension of pathway modelling from molecule level to higher component levels, such as cellular compartment, cell, tissue, organ, organism, individual, population, environment, or categories of these entities) makes it possible to utilize automatic molecule-level modelling frameworks, such as those presented in said commonly-owned PCT publication WO2005/003999) where connection information of pathways is used to generate ordinary differential equation models or flux balance models for higher-level biological systems. The above-mentioned data structures support generalizations of biochemical entities and their quantitative variables (eg concentration of cells, tissues, or the like), interactions and their quantitative variables (eg rate of interaction producing cells, tissues) and connections (eg connecting generalized entities to generalized interactions) and their quantitative variables (eg flux via product and substrate connections). This makes it possible to apply similar automatic modelling solutions to all biological systems that are available in prior art systems for chemical or biomolecular systems. To mention just two examples, it will be possible to use flux balance analysis in the study of T-cell maturation process from prethymocytes through some characteristic middle steps to mature thymocytes, or in the steady state of production of epithelial cells when old skin is replaced by new.
A preferred embodiment of the IMS according to the invention further comprises a library of equivalent pathways of categories, wherein each equivalent pathway of a category comprises a set of connections which assigns the set of functions associated to the category to the biochemical entities associated to the category.
Another aspect of the invention is a computer program product, executable in a computer system. The computer program product comprises program code portions for creating the data structures according to claim 1. In other words, the computer program product according to the invention changes a conventional computer system into an IMS according to the invention.
In this IMS description, the references to biochemical entities, interactions or the like should be interpreted as references to data structures which model the biochemical entities, interactions, etc.
An IMS according to the invention is able to treat categories as building blocks of equivalent biochemical pathways.
According to an embodiment of the invention, the IMS further comprises an annotation logic for creating automatic annotations based on the library and specific instances of pathways. For example, the automatic annotations may be created based on pathway topology.
According to another embodiment of the invention, the IMS further comprises an instantiation logic for creating specific instances of pathways based on the library, and an input set of biochemical entities or annotations.
According to yet another embodiment of the invention, the IMS further comprises a generalization logic for creating new categories and/or annotations and/or general pathways based on an input set of specific instances of pathways.
Yet another embodiment of the invention relates to a consistency checker for checking consistency between the annotations, the specific pathways and the library, based on specific instances of pathways and/or general pathways. A benefit of the consistency checker is the ability to automatically check for inconsistencies between the generic and specific pathways and the annotations which define the categories. The annotation logic, instantiation logic, generalization logic and consistency checker may be implemented separately or in combination.
According to a further embodiment of the invention, at least one pathway comprises a hierarchical description of a biochemical entity and a hierarchical description of a location. A benefit of the hierarchical descriptions is the ability to describe biochemical entities and locations with as much detail as is required. The descriptions of biochemical entity and location may be built from a common set of biochemical components but the descriptions are independent from each other, which makes it possible to describe biochemical entities which are located in a non-native location.
Yet another preferred embodiment of the invention comprises means for storing and visualizing descriptions for the biochemical entities and locations in a variable description language (“VDL”). The variable description language comprises variable descriptions, each of which comprises one or more pairs of keyword and name but no line terminator. The pairing of keywords and names makes the VDL largely self-sufficient, or readily processable by computers. An extendible table of permissible keywords supports automatic checking of syntax and/or consistency, yet makes it possible to extend the VDL without programming skills.
In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which:
Components are basic building elements of biochemical systems, such as molecules, cellular compartments, cells (cell types), tissues, organs, organisms, individuals, populations and environments. Component data, which is denoted by reference numeral 202, describes the static properties of components, such as structural or functional features; detected, constant attributes and/or characteristic features. For example, carbon dioxide (CO2) is a component that may have component data. There may also be some variable attributes which do not alter the identity of a biochemical entity.
System data, denoted by reference numeral 204, describes how components are connected to form biochemical systems. The system data 204 also includes the kinetic laws of interaction rates depending on relevant state data, denoted by reference numeral 206. Interactions are transformations in which substrates are converted to products. If a substrate and a product are in different locations, the locations have a common interaction that transports substrates from one location to another as products.
There are connections between interactions and other components. It is advantageous to classify connections into categories which include substrates, products, controllers and outcomes.
In the example shown in
There is also a product type connection between the molecule M[x] and interaction I[1]. A product type connection means that the biochemical entity or the category at the terminating end of the connection is produced in the interaction at the originating end of the connection.
A controller type connection is a third type of connection, an example of which is the connection from the molecule M[x] to interaction I[3]. A controller type connection means that the biochemical entity or the category at the originating end of the connection (here: M[x]) controls the interaction (eg, its rate) at the terminating end of the connection (here: I[3]).
A fourth type of connection, namely an outcome type connection, means that the biochemical entity or the category at the originating end of the connection (here: M[x]) is modified in terms of attributes in the interaction at the terminating end of the connection (here: I[4]).
A connection may have an associated stoichiometric coefficient to describe kinetic laws (quantitative relations between substrates and products). If the kinetic laws are missing, interaction rates are unknown variables.
Reference numeral 206 collectively denotes state data. There are quantitative and qualitative variables, such as count, concentration, mass, etc., associated to biochemical entities. Quantity attributes are functions of flux rates via product and substrate connections. A representative quantity attribute describes a flux rate of an interaction which transforms a substrate into a product at a certain rate. Quality attributes are functions of outcomes. A representative quality attribute describes the growth of a cell, in which the size of the cell increases by no (new) products are produced. Such variables can be elements of a system's state, which may be described by a set of state data, such as a state vector. State data describes the values of these variables in time and space. For example:
V[concentration]U[mol/l]M[CO2]Ts[2005.06.22 15:00:00]L[my_location]=1.5
This is an expression of a variable (concentration) expressed in units (mol/l) of molecule CO2 at time stamp 22 June 2005 at 15:00 in a location called “my_location”. The value of the variable is 1.5. Such variables are preferably expressed in a systematic variable description language (VDL), which will be further described in connection with
Space can be a discrete location, eg “my_location”, which may be specified in terms of an environment, population, individual, organism, organ, tissue, cell type, or cellular compartment. Some of these location-specifying elements may be not applicable or be used to specify the location. In addition to specifying location information based on biochemical elements, the location information can be specified spatially, by using a reference coordinate system. For example:
V[concentration]U[mol/l]M[CO2]Ts[2005.06.22 15:00:00]L[my_location]X[0.5]Y[0.2]Z[0.5]=1.5
eXtendible markup language (XML) is one example of an extendible language that could, in principle, be used to describe biochemical variables. XML expressions are rather easily interpretable by computers. However, XML expressions tend to be very long, which makes them poorly readable to humans. Accordingly, there is a need for an extendible VDL that is more compact and more easily readable to humans and computers than XML is.
The idea of an extendible VDL is that the allowable variable expressions are “free but not chaotic”. To put this idea more formally, we can say that the IMS should only permit predetermined variables but the set of predetermined variables should be extendible without programming skills. For example, if a syntax check to be performed on the variable expressions is firmly coded in a syntax check routine, any new variable expression requires reprogramming. An optimal compromise between rigid order and chaos can be implemented by storing permissible variable keywords in a data structure, such as a data table or file, that is modifiable without programming. Normal access grant techniques can be employed to determine which users are authorized to add new permissible variable keywords.
As regards the syntax of the language, a variable description may comprise an arbitrary number of keyword-name pairs 31. But an arbitrary combination of pairs 31, such as a concentration of time, may not be semantically meaningful.
The T and Ts keywords implement the relative (stopwatch) time and absolute (calendar) time, respectively. A slight disadvantage of expressing time as a combination of relative and absolute time is that each point of time has a theoretically infinite set of equivalent expressions. For example, “Ts[2002-11-26 18:00:30]” and “Ts[2002-11-26 18:00:00]T[00:00:30]” are equivalent. Accordingly, there is preferably a search logic that processes the expressions of time in a meaningful manner.
By storing an entry for each permissible keyword in the table 38 within the IMS, it is possible to force an automatic syntax check on variables to be entered, as shown in
The syntax of the preferred VDL may be formally expressed as follows:
The purpose of explicit delimiters, such as “[“and”]” around the name is to permit any characters within the name, including spaces (but excluding the delimiters, of course).
A preferred set of keywords 38 comprises three kinds of keywords: what, where and when. The “what” keywords, such as variable, unit, biochemical entity, interaction, etc., indicate what was or will be observed. The “where” keywords, such as sample, population, individual, location, etc., indicate where the observation was or will be made. The “when” keywords, such as time or time stamp, indicate the time of the observation. The “what”, “where” and “when” keywords are separate and independent of one another, which makes it possible to describe the location of a biochemical entity independently of its function, for example.
In the set of permissible keywords 38 shown in
A key feature of the VDL described in connection with
Reference numeral 40 denotes a set of components for describing a hierarchical location. The outmost component of the set of components 40 is called an environment. The environment may be the natural environment of sample population or an individual, or it may determine the conditions of experiments. Environment can be registered as a component of a location. In general, the description of an environment may contain all the component classes smaller than the environment, such as populations, individuals, organisms, organs, tissues, cells, cellular compartments and molecules. If relevant, there can be progressively smaller location components hierarchically inside others.
A description of a location can be modelled to hold any set of relevant components from the following hierarchical levels of location: environment, population, individual, organism, organ, tissue, cell type and cellular compartment. Molecule classes are the most basic components that can be located in all upper level discrete locations. These levels correspond to main classes of biochemical entities. There may be hierarchical categories of biochemical entities at each main level of components. Each location instance specifies relevant instances of relevant hierarchical levels. Reference numeral 41 denotes an instance of a hierarchical location which is expressed in terms of the set of components 40. Reference numeral 42 is an even more specific location instance which further defines the location 41 by a three-dimensional coordinate system {X, Y, Z}.
Each location instance specifies relevant instances of relevant hierarchical levels. Comparability of different locations is supported by standardized main levels of location concept and available ontologies at least for some of the levels.
The hierarchical location information provides certain advantages. For example, a location information may be arbitrarily specific, down to spatial coordinates within a cell, yet searchable by queries which express the location in any hierarchical level, such as “heart”;“human” or “human heart”. In other words, the hierarchical location information can be seen as a mechanism for zooming in and out within the component structures. Component data, system data and state data can be applied at all different levels of systems
In addition to the function categories, there may be location categories and/or process categories. Location categories indicate where the entities associated with the category are located in or what they are part of. Process categories indicate processes in which the entities associated with the category participate in.
In the embodiment shown in
There is also a set of annotations 514. Each annotation has association relations, denoted by reference numeral 516, between a biochemical entity 518 and a category. Each biochemical entity 518 can be described by a hierarchy 520 of specifiers 521-529, whereby the biochemical entities can be described at any desired level of detail. For example, if the specifiers organism 524 and organ 525 are present, the biochemical entity can be a human heart or a feline eye. But further specifiers can be added to the hierarchy 520 to describe the biochemical entity in terms of a specific environment 521, population 522 or individual 523, or down to a detail level of a specific molecule 529.
The set of annotations 514 are collectively capable of forming a many-to-many relationship between the set of biochemical entities 518 and the set of categories 508. Such many-to-many relationship are shown in
The data structure shown in
Second, each connection 604 has an associated type element 610. The set of type values indicates the type of the connection. The set of type elements 612 includes at least substrate, product, outcome and controller. These types were previously described in connection with
Third, the biochemical entities 616 are described as hierarchies 618 which are composed of components, collectively denoted by reference numeral 620. A benefit of the hierarchical description of biochemical entities is the ability to describe the validity of pathways at any level of detail. For example, some pathways may be valid for any animals, while some are valid for only a specific organ or a specific individual.
Fourth, the pathway 602 has a relation to a specific location information 624. A location 624, which is separate from the biochemical entity 616, makes it possible to describe biochemical systems in which a biochemical entity is transferred to a location different from its native or original location. The location information may also comprise a hierarchy 626 composed of the components 620. But although the biochemical entity description 616 and the location hierarchy 626 are both hierarchical description composed of the components 620, they are separate information structures, whereby the pathway shown in
Finally, not only biochemical entities 616 but also categories 606 are connected to interactions 608 by connection data elements 604. A benefit of this feature is that the pathways can be more generic. For example, this feature saves memory. If each of a number N molecules is capable of acting as a biochemical entity 616 in a pathway 602, there is no need to store N separate pathways. Instead, each of the N molecules is associated to a category 606, which is then used as a building block in the pathway 602.
In addition to the above-described data elements, the data structure 600 describing a pathway may also include state data, which is collectively denoted by reference numeral 628. State data was previously described in connection with
For special high-volume cases, the interpretation logic may be automated.
In step 814 the interpretation logic produces the connections of pathways. In step 816 it creates the relevant library records.
In step 908 the interpretation logic identifies the connection types of the biochemical entities which are to annotated to the present category. In steps 911, 912, 913 and 914, the connection type is respectively prepared as substrate, product, outcome or controller. In step 916 the interpretation logic complements the pathway with relevant connections and types between the category and the interaction (see item 612 in
FIGS. 10 to 13 illustrate the operation of various automation logics, namely annotation logic, instantiation logic, generalization logic and a consistency checker, when these logics are seen as “black boxes”. Flowcharts for implementing exemplary embodiments of these logics will be described later, in connection with FIGS. 14 to 17.
While each of the annotation logic 1000, instantiation logic 1100 and generalization logic 1200 are usable on their own, a combination of all these three logics is particularly advantageous. In addition to these three logics, an advantageous embodiment of an information management system also comprises a consistency checker 1300, an embodiment of which is shown in
It should be understood that FIGS. 10 to 13 are simplified and only serve to illustrate the operation of these logics. In real-life situations, the general and specific pathways are typically much more complex than the simplified drawings shown in FIGS. 10 to 13. They also contain a far greater number of connections of various types which connect virtually any kinds of biochemical entities to any interactions. In addition to the inputs shown, the logics typically have a user interface via which a user may specify what operations to perform, what the input data set is, and so on.
In step 1404 the annotation logic identifies specific biochemical entities that appear to be valid replacements for the category. In step 1405 the annotation logic creates an annotation to the category for each identified biochemical entity (see item 510 in
When all connections have been processed the logic proceeds to step 1607 in which the logic creates a new functional category for the different entities having a controller type connection to interactions whose similarity meets a predetermined criterion. The new pathway in the new functional category is a generalization of the interactions and connections having similarity descriptors. For example, in case of
If full similarity is required, the new functional category is created only for interactions in which similar substrates are converted to similar products. If partial similarity is sufficient, the new functional category is created for interactions in which the combination of substrates and products differs in some respects.
In step 1701 the consistency checker receives an input which identifies category. In step 1702 the consistency checker searches a general pathway from the pathway library, based on the category identification. Step 1703 is a test to check if a matching general pathway is found. If not, the consistency checker proceeds to step 1711 for reporting a missing category. Otherwise the consistency checker searches through the stored entities for annotations of the category. The test in step 1705 checks if a matching entity is found. If not, the category is reported as empty in step 1712. In step 1706 the consistency checker searches for specific pathways which contain the entity found in step 1705. If none are found, the missing entity is reported in step 1713. Otherwise the consistency checker proceeds to step 1710 for reporting that the category has a formal library description and that the annotated entities have consistent specific pathways.
Reference numerals 1810A and 1810B denote, respectively, a GO identifier and a plaintext description for “transporter activity” which is another example of “molecular function”. Reference numerals 1812A and 1812B denote a GO identifier and a plaintext description for “binding”.
Finally, reference numerals 1814A and 1814B denote a GO identifier and a plaintext description for “toll binding”. The definition for toll binding 1814B is interesting in that it is subclass of both transporter activity 1810B and binding 1812. This means that the definition for toll binding 1814B inherits features from two parents. This is possible because of the category binders 502 shown in
One of the exemplary molecular functions shown in the data structure 1800 was “catalytic activity”, denoted by reference numerals 1804A and 1804B. The GO definition for this function is: “Catalysis of a biochemical reaction at physiological temperatures. In biologically catalysed reactions, the reactants are known as substrates, and the catalysts are naturally occurring macromolecular substances known as enzymes. Enzymes possess specific binding sites for substrates, and are usually composed wholly or largely of protein, but RNA that has catalytic activity (ribozyme) is often also regarded as enzymatic.”
The above-mentioned verbal definition can be easily stored in the IMS database for the benefit of human users, but its meaning is incomprehensible to current computers. Reference numeral 1820 denotes an equivalent pathway for providing a formal definition for the “catalytic activity” in terms of data structures 600 (see
Another exemplary molecular function shown in the data structure 1800 was “adenylate cyclase activity”, denoted by reference numerals 1806A and 1806B. The GO definition for this function is: “Catalysis of the reaction: ATP=3′,5′-cyclic AMP+diphosphate”. Reference numeral 1830 denotes a pathway for providing a formal definition for the above definition for adenylate cyclase activity. In the pathway definition 1830, reference numeral 1831 denotes the interaction, reference numeral 1832 the controller function (in this example: catalysis), reference numeral 1833 denotes the substrate (ATP molecule), while reference numerals 1834 and 1835 denote the two products of the interaction, namely 3′,5′-cyclic AMP and diphosphate.
Reference numeral 1840 denotes a pathway definition for the term “transporter activity”. The pathway definition 1840 is analogous to the pathway definition 1820, and a detailed description is omitted.
Reference numeral 1850 denotes a pathway definition for the term “binding”. Reference numeral 1851 denotes the interaction, reference numerals 1852 and 1853 denote the two substrate connections to the interaction 1851, while reference numeral 1854 denotes the product of the interaction.
Reference numeral 1860 denotes a pathway definition for the term “toll binding”. The GO definition for this term is: “Interacting selectively with the Toll protein, a transmembrane receptor”. In the pathway definition 1860, reference numeral 1861 denotes the interaction. The interaction 1861 has two substrate-type connections 1862 and 1864. The latter substrate-type connection leads from category “toll_binding” 1863, which also has a relation to location “transmembrane”. It is worth noting that the category “toll_binding” 1863 has a dual role in the pathway 1860 because the category 1863 also has a controller-type connection 1865 to the interaction 1861. The interaction 1861 has two product-type connections. Reference numeral 1866 denotes a product-type connection to category “product”, while reference numeral 1867 denotes the other product-type connection to category “bound receptor”, which has a relation to location “transmembrane”.
The set of state data 1920 for the pathway definition of “cell growth” comprises one boundary condition which states that variable 1921 (cell size at time T1) must be larger than variable 1922 (cell size at time T2) if variable 1923 (time T1) is larger than variable 1924 (time T4).
This definition reveals much about the nucleus but very little about the gene products which can be annotated to this category as “nucleus” gene products, if the GO guidelines are to be followed. While the above verbal definition describes several processes, they all take place outside the nucleus. Hence, a pathway definition 2000 for “nucleus” only contains a dummy product-type connection 2001 from an unspecified interaction to category “nucleus” 2002. This interpretation utilises a common part which applies to all biochemical entities annotated to this category. The common part is that all such biochemical entities are located in the nucleus. The pathway definition 2000 can be used to describe the structure and functionality of the nucleus itself. The simple pathway definition 2000 demonstrates the fact that the pathway definitions according to the invention tolerate pathways of which very little is known.
Finally,
It will be apparent to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.
Acronyms
Number | Date | Country | Kind |
---|---|---|---|
20055510 | Sep 2005 | FI | national |
20055547 | Oct 2005 | FI | national |