The invention relates to an information management system (“IMS” in short) for managing biochemical information. More particularly, the invention relates to an IMS specially adapted to describe biochemical pathways.
Biological research brings tremendous amounts of data at a rate which has never been seen in any discipline of science. A general problem underlying the invention relates to the difficulties in organizing vast amounts of rapidly-varying information. IMS systems can be free-form or structured. A well-known example of a free-form IMS is a local-area network of a research institute, in which information producers (researches or the like) can enter information in an arbitrary format, using any of the commonly-available or proprietary applications programs, such as word processors, spreadsheets, databases etc. A structured IMS means a system with system-wide rules for storing information in a unified database.
A specific problem underlying the invention relates to the fact that biochemical information is not valid everywhere. When a new piece of biochemical information is obtained, we don't know how widely that piece of information can be generalized. If a new reaction is discovered in a cultivation of murine cells of a specific type, we don't know if that reaction is able to describe other cell types.
An object of the present invention is to provide an information management system (later abbreviated as “IMS”) for solving the above problem. In other words, the object of the invention is to provide an IMS for storing biochemical information such that there is a systematic way to describe where each piece of information is valid. The object of the invention is achieved by an IMS which is further comprising what is stated in the independent claims. The preferred embodiments of the invention are disclosed in the dependent claims.
The problem is solved by providing an IMS with an explicit data element for location information. The data element for location information is preferably hierarchical. A preferred hierarchy comprises five levels of increasing detail: organism—organ—tissue—cell type—cellular compartment.
A further preferred embodiment increases the level of detail by storing a sixth level of hierarchy, namely a spatial point within a cellular compartment. Expressing a spatial point within a cell or cellular compartment is no trivial task because the shape of cells varies. Some cells resemble a sphere, some look like a brick, etc. For such cells, a polar or Cartesian coordinate system, respectively, can be used. But a simple polar or Cartesian coordinate system is clearly insufficient for nerve cells whose shape is extremely complex. Accordingly, the IMS preferably stores several spatial reference models, and the spatial point is expressed as a relevant area of a specific reference model. The location information may even be a combination of a specific reference model, an area within the specific model plus a coordinate set within that area.
Because the location information is hierarchical, the IMS tolerates incomplete information, as opposed to some systems that store the location implicitly, as part of the name of each biochemical entity, such as “murine_P53”. When a new piece of biochemical information is obtained, we can store the location information that matches the experiment from which the piece of information was obtained. Later, when more information is obtained, the location information can be further generalized or specified.
An IMS according to the invention is preferably capable of storing information about populations, individuals, reagents or samples of other biomaterials (anything that can be studied as a biological/biochemical system or its component).
The IMS preferably comprises an experiment database. An experiment can be a real-life experiment (“wet lab”) or a simulated experiment (“in-silico”).
According to a preferred embodiment of the invention, both experiment types produce data sets, such that each data set comprises:
Numerical values of each experiment are preferably stored, as scalar numbers, in a variable value matrix having a row-column organization. Such row-column matrixes can be further processed with a wide variety of off-the-shelf or proprietary application programs. There are separate row and column description lists for describing, respectively, the meaning of the rows and columns in the variable value matrix. A separate fixed dimension description describes the fixed dimensions that are common to all values in the variable value matrix. The row and column description lists, as well as the fixed dimension description, are written in a variable description language in order to link arbitrary variable values to the structured information of the IMS.
A benefit achieved by the use of the variable description language (=VDL) is that the IMS is largely self-sufficient. Little or no external information is needed to interpret the numerical values. Also, it is a relatively straightforward task to force an automated syntax check on the variable expressions. An essential feature of the VDL is that it permits the description of variables in varying detail level. For example, the VDL may describe a variable in terms of biomaterial (population—individual—sample; organism—organ—tissue, cell type, etc.), physical quantities and time, but we may omit details that are not essential to our current context.
XML (eXtendible Markup Language) is a well-known example of a language that can be used as a variable description language. A problem with XML is, however, that it is intended to describe virtually any structured information, which results in lengthy expressions that are poorly readable by humans. Accordingly, a preferred embodiment of the invention relates to a variable description language that is better suited to describing biological variables than XML is. Also, expressions in XML and its biological or mathematical variants, such as SBML (Systems Biology Markup Language) or CellML (Cell Markup Language) or MathML (Mathematical Markup Language), are generally too long or complex to serve as self-documenting symbols for describing biological variables in mathematical models. Accordingly, a further preferred embodiment of the invention comprises a compact but extendible VDL that overcomes the problems of XML and its variants.
A benefit achieved by storing the numerical values as a scalar matrix is that the matrix can be analyzed with many commercially available data-mining tools, such as self-organizing maps or other clustering algorithms, that do not readily process dimensioned values. Accordingly, the row and column descriptions are stored separately. A benefit achieved by the use of a third list, namely the fixed dimension description, is that dimensions common to rows and columns need not be duplicated in the row and column description lists.
The processing speed of the IMS can be increased by storing each data set (each data set comprising a variable value matrix, row and column description lists and a fixed dimension description) as a container for data, and storing only an address or identifier of the container in a database. Assuming that SQL (structured query language) or other database queries are used to retrieve the data sets, the single-container approach dramatically reduces the number of individual data items to be processed by SQL queries. When individual data elements are needed, the entire container can be processed with a suitable tool, such as a spreadsheet or flat-file database system.
According to another preferred embodiment of the invention, the IMS further comprises a biochemical entity database containing objects or tables. The variable description language comprises variable descriptions, each variable description comprising one or more pairs of keyword and name. For each object or table of the biochemical entity database, there is a keyword that references that object or table. This embodiment facilitates automated syntax or other checks made to information to be stored.
A further advantage of the data sets as described herein is good support for well-defined contexts. A context defines the scope of an experiment, either wet-lab or in-silico. Each context is defined in terms of biomaterials, variables and time.
In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which
The server (or set of servers) S also comprises various data processing tools for data analysis, visualization, data mining, etc. A benefit of storing the data sets as containers in a row-column organization (instead of addressing each data item separately by SQL queries) is that such data sets of rows and columns can easily be processed with commercially available analysis or visualization tools. Before describing embodiments for the actual invention, i.e., the IMS for managing workflows and software tools, preferred embodiments for describing biochemical data will be described in connection with FIGS. 2 to 11B. Detailed embodiments of the IMS for managing workflows and software tools will be described in connection with
Data Sets
Data sets 202 describe the numerical values stored in the IMS. Each data set is comprised of a variable set, biomaterial information and time organized in
The variable description language binds syntactical elements and semantic objects of the information model together, by describing what is quantified in terms of variables (eg count, mass, concentration), units (eg pieces, kg, mol/l), biochemical entities (eg specific transcript, specific protein, specific compound) and a location where the quantification is valid (eg human—eyelid_epith_nuc) in a multi-level location hierarchy of biomaterials (eg environment, population, individual, reagent, sample, organism, organ, tissue, cell type) and relevant expressions of time when the quantification is valid.
Note that there are many-to-many relationships from the base variables/units section 204 and the time section 206 to the data set section 202. This means that each data set 202 typically comprises one or more base variable/units and one or more time expressions. There is a many-to-many relationship between the data set section 202 and the experiments section 208, which means that each data set 202 relates one or more experiments 208, and each experiment relates to one or more data sets 202. A preferred implementation of the data sets section will be further described in connection with
The base variables/units section 204 describes the base variables and units used in the IMS. In a simple implementation, each base variable record comprises unit field, which means that each base variable (eg mass) can be expressed in one unit only (eg kilograms). In a more flexible embodiment, the units are stored in a separate table, which permits expressing base variables in multiple units, such as kilograms or pounds.
Base variables are variables that can be used as such, or they can be combined to form more complex variables, such as the concentration of a compound in a specific sample at a specific point of time.
The time section 206 stores the time components of the data sets 202. Preferably, the time component of a data set comprises a relative (stopwatch) time and absolute (calendar) time. For example, the relative time can be used to describe the speed with which chemical reactions take place. There are also valid reasons for storing absolute time information along with each data set. The absolute time indicates when, in calendar time, the corresponding event took place. Such absolute time information can be used for calculating relative time between any experimental events. It can also be used for troubleshooting purposes. For example, if a faulty instrument is detected at a certain time, experiments made with that instrument prior to the detection of the fault should be checked.
The experiments section 208 stores all experiments known to the IMS. There are two major experiment types, commonly called wet-lab and in-silico. But as seen from the point of view of the data sets 202, all experiments look the same. The experiments section 208 acts as a bridge between the data sets 202 and the two major experiment types. In addition to experiments already carried out, the experiments section 208 can be used to store future experiments. Preferred object-based implementations of experiments will be described in connection with
The biomaterial section 210 stores information about populations, individuals, reagents or samples of other biomaterials (anything that can be studied as a biochemical system or its component) in the IMS. Preferably, the biomaterials are described in data sets 202, by using the VDL to describe each biomaterial hierarchically, or in varying detail level, such as in terms of population, individual, reagent and sample. A preferred object-based implementation of the biomaterials section 210 will be described in connection with
While the biomaterial section 210 describes real-world biomaterials, the pathway section 212 describes theoretical models of biomaterials. Biochemical pathways are somewhat analogous to circuit diagrams of electronic circuits. There are several ways to describe pathways in an IMS, but
The biochemical entities are stored in a biochemical entity section 218. In the example shown in
A database reference section 220 acts as a bridge to external databases. Each database reference in section 220 is a relation between an internal biochemical entity 218 and an entity of an external database, such as a specific probe set of Affymetrix inc.
The interactions section 222 stores interactions, including reactions, between the various biochemical entities. The kinetic law section 224 describes kinetic laws (hypothetical or experimentally verified) that affect the interactions. Preferred and more detailed implementations of pathways will be described in connection with
According to a preferred embodiment of the invention, the IMS also stores multi-level location information 214. The multi-level location information is referenced by the biomaterial section 210 and the pathway section 212. For instance, as regards information relating to biomaterials, the organization shown in
According to a further preferred embodiment of the invention, the location information can also comprise spatial information 214-6, such as a spatial point within the most detailed location in the organism-to-cell hierarchy. If the most detailed location indicates a specific cell or cellular compartment, the spatial point may further specify that information in terms of relative spatial coordinates. Depending on cell type, the spatial coordinates may be Cartesian or polar coordinates. Spatial points will be further discussed in connection with
In addition to the six levels of location hierarchy shown in
A benefit of this kind of location information is an improved and systematic way to compare locations of samples and locations of theoretical constructs like pathways that need to be verified by relevant measurement results.
The multi-level location hierarchy shown in
Variable Description Language
eXtendible markup language (XML) is one example of an extendible language that could, in principle, be used to describe biochemical variables. XML expressions are rather easily interpretable by computers. However, XML expressions tend to be very long, which makes them poorly readable to humans. Accordingly, there is a need for an extendible VDL that is more compact and more easily readable to humans and computers than XML is.
The idea of an extendible VDL is that the allowable variable expressions are “free but not chaotic”. To put this idea more formally, we can say that the IMS should only permit predetermined variables but the set of predetermined variables should be extendible without programming skills. For, example, if a syntax check to be performed on the variable expressions is firmly coded in a syntax check routine, any new variable expression requires reprogramming. An optimal compromise between rigid order and chaos can be implemented by storing permissible variable keywords in a data structure, such as a data table or file, that is modifiable without programming. Normal access grant techniques can be employed to determine which users are authorized to add new permissible variable keywords.
As regards the syntax of the language, a variable description may comprise an arbitrary number of keyword-name pairs 31. But an arbitrary combination of pairs 31, such as a concentration of time, may not be semantically meaningful.
The T and Ts keywords implement the relative (stopwatch) time and absolute (calendar) time, respectively. A slight disadvantage of expressing time as a combination of relative and absolute time is that each point of time has a theoretically infinite set of equivalent expressions. For example, “Ts[2002-11-26 18:00:30]” and “Ts[2002-11-26 18:00:00]T:00:30]” are equivalent. Accordingly, there is preferably a search logic that processes the expressions of time in a meaningful manner.
By storing an entry for each permissible keyword in the table 38 within the IMS, it is possible to force an automatic syntax check on variables to be entered, as will be shown in
The syntax of the preferred VDL may be formally expressed as follows:
<variable description>::=<keyword>“[”<name>“]”{{separator}<keyword>“[”<name>“]”}<end>
<keyword>::=<one of predetermined keywords, see eg table 38>
<name>::=<character string>|“*” for any name in a relevant data table
The purpose of explicit delimiters, such as “[” and “]” around the name is to permit any characters within the name, including spaces (but excluding the delimiters, of course).
A preferred set of keywords 38 comprises three kinds of keywords: what, where and when. The “what” keywords, such as variable, unit, biochemical entity, interaction, etc., indicate what was or will be observed. The “where” keywords, such as sample, population, individual, location, etc., indicate where the observation was or will be made. The “when” keywords, such as time or time stamp, indicate the time of the observation.
After the opening delimiter, any characters except a closing delimiter are accepted as parts of the name, and the state machine remains in the second intermediate state 306. Only a premature ending of the variable expression causes a transition to an error state 312. A closing delimiter causes a transition to a third intermediate state 308, in which one keyword/name pair has been validly detected. A valid separator character causes a return to the first intermediate state 304. Detecting the end of the variable expression causes a transition to “OK” state 310 in which the variable expression is deemed syntactically correct.
Note that regardless of the language of humans using the IMS, it is beneficial to agree on one language for the variable expressions. Alternatively, the IMS may comprise a translation system to translate the variable expressions to various human languages.
The VDL substantially as described above is well-defined because only expressions that pass the syntax check shown in
Data Contexts
a), b) and c) are projections of d) which is the richest representation of the system. All data in the IMS exists in a three-dimensional context space that has relations to:
Reference numeral 500 generally denotes the N+2 dimensional context space having one axis for each of variables (N), biomaterials and time. A very detailed variable expression 510 specifies a variable (concentration of mannose in moles/l), biomaterial (population abcd1234) and a timestamp (10 Jun. 2003 at 12:30). The value of the variable is 1.3 moles/l. Since the variable expression 510 specifies all the coordinates in the context space, it is represented by a point 511 in the context space 500.
The next variable expression 520 is less detailed in that it does not specify time. Accordingly, the variable expression 520 is represented by a function 521 of time in the context space 500.
The third variable expression 530 does specify time but not biomaterial. Accordingly, it is represented by a distribution 531 of all biomaterials belonging to the experiment at the specified time.
The fourth variable expression 540 specifies neither time nor biomaterial. It is represented by a set 541 of functions of time and a set 542 of distributions for the various biomaterials.
By means of the various expressions made possible by the variable description language and suitably-organized data sets (to be described next), researchers have virtually unlimited possibilities to study the time-state space of a biochemical system as a multidimensional stochastic process. The probabilistic aspects of the system are based on the event space of relevant biomaterials, and the dynamic aspects are based on the time-space. Biomaterial data and time can be registered when the relevant experiments are documented.
All quantitative measurements, data analyses, models and simulation results can be reused in new analysis techniques to find relevant background information, such as phenotypes of measured biomaterials when the data needs to be interpreted for various applications.
Data Sets
The division of each data set (eg data set 610) to four different components (the matrixes 611 to 614) can be implemented so that each matrix 611 to 614 is a separately addressable data structure, such as a file in the computer's file system. Alternatively, the variable value matrix can be stored in a single addressable data structure, while the remaining three matrixes (the fixed dimension description and the row/column descriptors) can be stored in a second data structure, such as a single file with headings “common”, “rows” and “column”. A key element here is the fact that the variable value matrix is stored in a separate data structure because it is the component of the data set that holds the actual numerical values. If the numerical values are stored in a separately addressable data structure, such as a file or table, it can be easily processed by various data processing applications, such as data mining or the like. Another benefit is that the individual data elements that make up the various matrixes need not be processed by SQL queries. An SQL query only retrieves an address or other identifier of a data set but not the individual data elements, such as the numbers and descriptions within the matrixes 611 to 614.
In the example of
The matrixes 630 and 634 shown in
Pathways
As shown in
In an object-based implementation, the biochemical pathway model is based on three categories of objects: biochemical entities (molecules) 218, interactions (chemical reactions, transcription, translation, assembly, disassembly, translocation, etc) 222, and connections 216 between the biochemical entities and interactions for a pathway. The idea is to separate these three objects in order to use them with their own attributes and to use the connection to hold the role (such as substrate, product, activator or inhibitor) and stoichiometric coefficients of each biochemical entity in each interaction that takes place in a particular biochemical network. A benefit of this approach is the clarity of the explicit model and easy synchronization when several users are modifying the same pathway connection by connection. The user interface logic can be designed to provide easily understandable visualizations of the pathways, as will be shown in connection with
The kinetic law section 224 describes theoretical or experimental kinetic laws that affect the interactions. For example, a flux from a substrate to a chemical reaction can be expressed by the following formula:
wherein V is the flux rate of the substrate, Vmax and K are constants, [S] is the substrate concentration and [E] is the enzyme concentration. The reaction rate through the interaction can be calculated by dividing the flux by the stoichiometric coefficient of the substrate. Conversely, each kinetic law represents the reaction rate of an interaction, whereby any particular flux can be calculated by multiplying the reaction rate by the stoichiometric coefficients of the particular connections. The above kinetic law as the reaction rate of interaction EC2.7.7.14_PSA1 in
V[rate]I[EC2.7.7.14_PSA1]=Vmax•V[concentration]C[GTP]•V[concentration]P[PSA1]/(K+V[concentration]C[GTP])
The flux from interaction EC2.7.7.14_PSA1 to compound GDP-D-mannose can be expressed in VDL as follows:
V[flux]I[EC2.7.7.14_PSA1]C[GDP-D-mannose]=
c1•V[rate]I[EC2.7.7.14_PSA1]=Vmax•V[concentration]C[GTP]•V[concentration]P[PSA1]/(K+V[concentration]C[GTP]),
where c1 is the stoichiometric coefficient of the connection from interaction EC2.7.7.14_PSA1 to compound GDP-D-mannose and c1=1. In the above example, the kinetic law is a continuous function of variables V[concentration]C[GTP] and V[concentration]P[PSA1]. In addition, a proper description of some pathways requires discontinuous kinetic laws.
The kinetic law as the reaction rate of interaction X in
V[rate]I[X]=
k IF V[count]G[A]>0 AND V[count]P[B]>0 and V[count]C[RNA]>0 ELSE 0
The flux from interaction X to transcript mRNA can be expressed in VDL as follows:
V[flux]I[X]Tr[mRNA]=
c2•V[rate]I[X]=
k IF V[count]G[A]>0 AND V[count]P[B]>0 and V[count]C[RNA]>0 ELSE 0
where c2 is the stoichiometric coefficient of the connection from interaction X to transcript mRNA and c2=1.
Let the flux from interaction Y to compound RNA in
V[flux]I[Y]C[RNA]=
c3•V[rate]I[Y]=c3•k2•V[count]Tr[mRNA]
where c3 is the stoichiometric coefficient of the connection from interaction X to transcript mRNA and k2 is another constant of this kinetic law.
Each variable represented in the kinetic laws may be specified with a particular location L[ . . . ] if the concentration or count of a biochemical entity depends on a particular location.
A biochemical network may not be valid everywhere. In other words, the network is typically location-dependent. That is why there are relations between pathways 212 and biologically relevant discrete locations 214, as shown in
A complex pathway can contain other pathways 700. In order to connect different pathways 700 together, the model supports pathway connections 702, each of which has up to five relations which will be described in connection with
Pathway A, denoted by reference sign 711, is a main pathway to pathways B and C, denoted by reference signs 712 and 713, respectively. The pathways 711 to 713 are basically similar to the pathway 700 described above. There are two pathway connections 720 and 730 that couple the pathways B and C, 712 and 713, to the main pathway A, 711. For instance, pathway connection 720 has a main-pathway relation 721 to pathway A, 711; a from-pathway relation 722 to pathway B, 712; and a to-pathway relation 723 to pathway C, 713. In addition, it has common-entity relations 724, 725 to pathways B 712 and C 713. In plain language, the common-entity relations 724, 725 mean that pathways B and C share the biological entity indicated by the relations 724, 725.
The other pathway connection 730 has both main-pathway and from-pathway relations to pathway A 711, and a to-pathway relation to pathway C, 713. In addition, it has common-interaction relations 734, 735 to pathways B, 712 and C, 713. This means that pathways B and C share the interaction indicated by the relations 734, 735.
The pathway model described above supports incomplete pathway models that can be built gradually, along with increasing knowledge. Researchers can select detail levels as needed. Some pathways may be described in a relatively coarse manner. Other pathways may be described down to kinetic laws and/or spatial coordinates. The model also supports incomplete information from existing gene sequence databases. For example, some pathway descriptions may describe gene transcription and translation separately, while other treat them as one combined interaction. Each amino acid may be treated separately or all amino acids may be combined to one entity called amino acids.
The pathway model also supports automatic modelling processes. Node equations can be generated automatically for time derivatives of concentrations of each biochemical entity when relevant kinetic laws are available for each interaction. As a special case, stoichiometric balance equations can be automatically generated for flux balance analyses. The pathway model also supports automatic end-to-end workflows, including extraction of measurement data via modelling, inclusion of additional constrains and solving of equation groups, up to various data analyses and potential automatic annotations.
Automatic pathway modelling can be based on pathway topology data, the VDL expressions that are used to describe variable names, the applicable kinetic laws and mathematical or logical operators and functions. Parameters not known precisely can be estimated or inferred from the measurement data. Default units can be used in order to simplify variable description language expressions.
If the kinetic laws are continuous functions of VDL variables, the quantitative variables (eg concentration) of biochemical entities can be modelled as ordinary differential equations of these quantitative variables. The ordinary differential equations are formed by setting a time derivative of the quantitative variable of each biochemical entity equal to the sum of fluxes coming from all interactions connected to the biochemical entity and subtracting all the outgoing fluxes from the biochemical entity to all interactions connected to the biochemical entity.
On the other hand, if the kinetic laws are discontinuous functions of VDL variables, the quantitative variables (eg concentration or count) of biochemical entities can be modelled as difference equations of these quantitative variables. The difference equations are formed by setting the difference of the quantitative variable of each biochemical entity in two time points equal to the sum of the incoming quantities from all interactions connected to the biochemical entity and subtracting all the outgoing quantities from the biochemical entity to all interactions connected to the biochemical entity in the time interval between the time points of the difference.
If there are both continuous and discontinuous kinetic laws associated with an interaction that connects a biochemical entity, a difference equation is written from the biochemical entity such that continuous or discontinuous fluxes are added or subtracted depending on the direction of each connection.
In this way a complete “hybrid” equation system can be generated for simulation purposes with given initial or boundary conditions. Initial conditions and boundary conditions can be represented by the data sets described above (see
In the differential and difference equations described above, the biochemical entity-specific fluxes can be replaced by reaction rates multiplied by stoichiometric coefficients.
In a static case, the derivatives and differences are zeros. This leads to a flux balance model with a set of algebraic equations of reaction rate variables (kinetic laws are not needed), wherein the set of algebraic equations describe the feasible set of the reaction rates of specific interactions.
Users can provide their objective functions and additional constraints or measurement results that limit the feasible set of solutions.
Yet another preferred feature is the capability to model noise in a flux-balance analysis. We can add artificial noise variables that need to be minimized in the objective function. The noise variables are given in the data sets described above. This helps to tolerate inaccurate measurements with reasonable results.
The model described herein also supports visualization of pathway solutions (active constraints). A general case, the modelling leads to a hybrid equations model where kinetic laws are needed. They can be accumulated in the database in different ways but there may be some default laws that can be used as needed. In general equations, interaction-specific reaction rates are replaced by kinetic laws, such as Michaels-Menten laws, that contain concentrations of enzymes and substrates.
The equations can be converted to the form:
There are alternative implementations. For example, instead of the substitution made above, we can calculate kinetic laws separately and substitute the numeric values to specific reaction rates iteratively.
A benefit of such a structured pathway model, wherein the pathway elements are associated with interaction data, such as interaction type and/or stoichiometric coefficients and/or location, is that flux rate equations, such as the equations described above, can be generated by an automatic modelling process, which greatly facilitates computer-aided simulation of biochemical pathways. Because each kinetic law has a database relation to an interaction and each interaction relates, via a specific connection, to a biochemical entity, the modelling process can automatically combine all kinetic laws that describe the creation or consumption of a specific biochemical entity and thereby automatically generate flux-balance equations according to the above-described examples.
Another benefit of such a structured pathway model is that hierarchical pathways can be interpreted by computers. For instance, the user interface logic may be able to provide easily understandable visualizations of the hierarchical pathways as will be shown in connection with
Also, measured or controlled variables can be visualized and localized on relevant biochemical entities. For example, reference numeral 881 denotes the concentration of a biochemical entity, reference numeral 882 denotes the reaction rate of an interaction and reference numeral 883 denotes the flux of a connection.
The precise roles of connections, kinetic laws associated with interactions and the biologically relevant location of each pathway provide improvements over prior art pathway models. For instance, a model as shown in
This technique supports graphical representations of measurement results on displayed pathways as well. The measured variables can be correlated to the details of a graphical pathway representation based on the names of the objects.
Note that the data base structure denoted by reference numerals 200 and 700 (
Experiments
The IMS preferably comprises an experiment project manager. A project comprises one or more experiments, such as sampling, treatment, perturbation, feeding, cultivation, manipulation, purification, cloning or other combining, separation, measurement, classification, documentation, or in-silico workflows.
A benefit of an experiment project manager is that all the measurement results or controlled conditions or perturbations (“what”), biomaterials and locations in biomaterials (“where”) and timing of relevant experiments (“when”) and methods (“how”) can be registered for the interpretation of the experiment data. Another benefit comes from the possibility to utilize the variable description language when storing experiment data as data sets explained earlier.
The experiment output 920 connects relevant output, such as a biomaterial 922 (eg population, individual, reagent or sample) or a data entity 924 (eg measurement results, documents, classification results or other results) to the experiment, along with relevant time information. For instance, if the input comprises a specific sample of a biomaterial, the experiment may produce a differently-numbered sample of the same organism. In addition, the experiment output 920 may comprise results in the form of various data entities (such as the data sets shown in
Data traceability will be improved by the fact that the experiment input 914 and experiment output 920 have a relevant time, as denoted by items 915 and 921 respectively. The times 915, 921 indicate times when the relevant biochemical event, such as sample taking, perturbation, or the like, took place. Data traceability will be further described in connection with
An experiment has also a target 930, which is typically a biomaterial 932 (eg population, individual, reagent or sample) but the target of in-silico experiments may be a data entity 934.
The method entity 910 has a relation to a method description 912 that describes the method. The loop next to the method description 912 means that a method description may refer to other method descriptions.
The experiment input 914 and experiment output 920 are either specific biomaterials 916, 922 or data entities 918, 924, which are the same data elements as the corresponding elements in
Because the biochemical information (reference numeral 200 in
The experiment project manager preferably comprises a project editor having a user interface that supports project management functionality for project activities. That gives all the benefits of standard project management that are useful in systems biochemical projects as well.
A preferred implementation of the project editor is able to trace all biomaterials, their samples and all the data through the various experiments including wet-lab operations and in-silico data processing.
An experiment project can be represented as a network of experiment activities, target biomaterials and input or output deliverables that are biomaterials or data entities.
In terms of complexity,
In case of sampling the input section indicates a biomaterial to be sampled, and the output section indicates a specific sample. In case of sample manipulation the input section indicates a sample to be manipulated and the output section indicates the manipulated sample. In a combination experiment the input section indicates several samples to be combined and the output section indicates the combined, identified sample. Conversely, in a separation experiment the input section indicates a sample to be separated and the output section indicates several separated, identified samples. In a measurement experiment the input section indicates a sample to be measured and the output section is a data entity containing the measurement results. In a classification experiment the input section indicates a sample to be classified and the output section indicates a phenotype and/or genotype. In a cultivation experiment the input and output sections indicate a specific population, and the equipment section may comprise identities of the cultivation vessels.
In order to describe complex experiments, there may be experiment binders (not shown separately) that combine several experiments in a manner which is somewhat analogous to the way the pathway connections 700, 720, 730 combine various pathways.
If the project plan shown in
Assume that a researcher wishes to obtain four data sets, namely perturbation data 952 that describes a set or perturbations to be entered into a population 966 and sampled measurement data 954A-954C from the population 966. The population 966, labelled Po[popula] and specified in the data sets 952 and 954A-954C, is an instance of a biomaterial experiment target 932 and 930 (see
In this way, experiment targets 930 and intermediate experiments 904 and their inputs 914 and outputs 920 with required timing 915 and 921 can be determined by the information of data sets 952 and 954A-954C and predefined methods 910 and method descriptions 912 when variable data of data sets are mapped into methods in method descriptions 912.
The problem faced by the logic for creating automatic project plans is how to determine the intermediate steps from data sets 954A-954C to the population 966. The logic is based on the idea that in a typical research facility, any type of measurement data can only be created by a limited set of measurement methods. Assume that the first data set 954A contains data for which there is only one method description 912 (see
Furthermore, the logic can also infer advantageous time stamps for the acts of the project plan. As shown in
Biomaterial Descriptions
A loop 1010 under the organism element 214-1 means that the organism is preferably described in a taxonomical description. The bottom half of
The variable description language described in connection with
A benefit of this kind of location information is an improved and systematic way to compare locations of samples and locations of theoretical constructs like pathways that need to be verified by relevant measurement results.
Another advantage gained by storing the biomaterials section substantially as shown in
Data Traceability
Data traceability is based on the time information 915 and 921 associated with experiment inputs and outputs 914 and 921, respectively (see
At time 12 two further samples are obtained from sample 4. As shown by arrow 1108, sample 25 is obtained from sample 4 by separating the nuclei. Reference numeral 1112 denotes an observation (measurement) of sample 25, namely the concentration of protein P53, which in this example is shown as 4.95.
Showing images such as those contained in
It should be understood that real-life cases can be far more complex than what can reasonably be shown on one drawing page. Thus
Workflow Descriptions
Tools are defined in terms of tool name, category, description, source, pre-tag, executable, inputs, outputs and service object class (if not the default). This information is stored in a tool table or database 1208.
An input definition includes pre-tag, id number, name, description, data entity type, post-tag, command line order, optional-status (mandatory or optional). This information is stored into the tool input binder 1210 or tool output binder 1212. In a real-life implementation, it is convenient to store the tool 1208, the tool input binder 1210 and tool output binder 1212 in a single disk file, an example of which is shown in
The data entity types are defined to the system in terms of data entity type name, description, data category (eg file, directory with subdirectories and files, data set, database, etc). There are several data entity types that belong to the same category but having different syntax or semantics and consequently belong to different data entity type for compatibility rules of existing tools. This information is stored in data entity type 1214. Tool server binder 1224 indicates a tool server 1222 in which the tool can be executed. If there is only one tool server 1222, the tool server binder 1224 can be omitted.
Typed data entities are used to control the compatibility of different tools that might be or might not be compatible. This gives the possibility to develop a user interface in which the systems assists users to create meaningful workflows without prior knowledge about the details of each tool.
The data entity instances containing user data are stored in data entity 1216. When workflows are built the relevant data entities are connected to relevant tool inputs through workflow inputs 1204 or workflow outputs 1206. Reference numeral 1200 generally denotes the various data entities, which in real-life situations constitute actual instances of input or output data.
Each tool server 1244 comprises an executor and a service object that is able to call any standalone tool installed on the tool server. The executor manages executing all the relevant tools of a workflow with relevant data entities through a standardized service object. The service object provides a common interface for the executor to run any standalone software tool. Tool-specific information can be described in an XML file that is used to initialize metadata for each tool in the tool database (item 1208 in
A workflow/tool manager as shown in
Note that
As shown in
The embodiment shown in
One enhancement consists of the fact that the hierarchical workflow 1202, 1203 of
Another enhancement consists of the fact that the work input 1204′ and work output 1206′ are not connected to a data entity 1216 directly but via a data entity list 1226 which, in turn, is connected to the data entity 1216 via a data entity-to-list binder 1228. A benefit of this enhancement is that a work's input or output can comprise lists of data entities. This simplifies end-user actions when multiple data entities are to be processed similarly. Technically speaking, the data entity list 1226 specifies several data entities as an input 1204′ or output 1206′ of a work, such that each data entity in the list is processed by a tool 1208 separately but in a coordinated manner.
A third enhancement is a structured-data-entity-type binder 1230 for processing structured data entities, such as the data sets 610 and 620 shown in
Moreover, each tool 1208 may have associated options 1238 and/or exit codes 1239. The options 1238 may be used to enter various parameters to the software tools, as is well known in connection with script file processing. The options 1238 will be further discussed in connection with
Yet another optional enhancement shown in
The elements in FIGS. 13 relate to those in
The parent workflow being edited is an instance of workflow class 1202. The arrows 1356, 1364, etc., created by the graphical user interface in response to user input, represent instances of a work or workflow input 1204′, 1204. These arrows connect a data entity as an input to a work that will be done by executing the tool when the workflow is executed. The relevant tool is indicated with a “tool” type icon, such as icon 1354. The tool input binders 1210 enable type checking of each connected instance of a data entity. The arrows 1360 represent instances of a work or workflow output 1206, 1206′. These arrows connect a data entity as an output from a work that will be done by executing the tool when the workflow is executed. The relevant tool is indicated with a “tool” type icon. The tool output binders 1212 enable type checking of each connected instance of a data entity.
A benefit of this implementation is that the well-defined type definition shown in
Again, abstract concepts, such as child workflow and workflow input, workflow output, work input and work output are hidden from the users of the graphical user interface, but more concrete elements, such as data entities, tools, tool inputs and tool outputs are visualized to users as intuitive icons and arrows.
In case of quantitative data, the data entities 1216, 1352, etc. are preferably organized as data sets 610, 620, and more particularly as variable value matrixes 614, 624, that were described in connection with
The graphical user interface preferably employs a technique known as “drag and drop”, but in a novel way. In conventional graphical user interfaces, the drag and drop technique works such that if a user drags an icon of a disk file on top of a software tool's icon, the operating system interprets this user input as an instruction to open the specified disk file with the specified software tool. But the present invention preferably uses the drag and drop technique such that the specified disk file (or any other data entity) is not immediately processed by the specified tool. Instead, the interconnection of a data entity to a software tool is saved in the workflow being created or updated. Use of the familiar drag and drop metaphor to create saved workflows (instead of triggering ad-hoc actions) provides several benefits. For example, the saved workflows can be easily repeated, with or without modifications, instead of recreating each workflow entirely. Another benefit is that the saved workflows support tracing of workflows.
Dedicated tool input and output binders make it possible to use virtually any third-party data processing tools. The integration of new, legacy or third-party tools is made easy and systematic.
The systematic concept of workflows hides the proprietary interfaces of third-party tools and substitute the proprietary interfaces with a common graphical user interface of the IMS. Thus users can use the functions of a common graphical user interface to prepare, execute, monitor and view workflows and their data entities. In addition, such a systematic workflow concept supports systematic and complete documentation, easy reusability and automatic execution.
The concept of data entity provides a general possibility to experiment with any data. However, the concept of data entity type makes possible to understand, identify and control the compatibility of different tools. Organization of quantitative data as data sets, each of which comprises a dimensionless variable value matrix, provides maximal compatibility between the data sets and software tools from third parties, because the tools do not have to separate data from dimensions or data descriptors.
Because of the graphical interface, researchers with a biochemical expertise can easily connect the biologically relevant data entities to or from available inputs or outputs and get immediate visual feedback. Inexperienced users can reuse existing workflows to repeat standard workflows merely by changing the input data entities. The requirement to learn the of the syntactic and semantic details of each specific tool's command line can be delegated to technically-qualified persons who integrate new tools to the system. This benefit stems from the separation of the tool definitions from the workflow creation. Biochemical experts can concentrate on workflow creation (defined in terms of data entities, works, workflows, work inputs, workflow inputs, work outputs, workflow outputs), while the tool definitions (tools, tool input binders, tool output binders, options, exit codes), are delegated to Information-technology experts.
Automatic Population of Pathways from a Gene Sequence Database
An IMS having a pathway model substantially as described in connection with
Thus an IMS with a pathway model as described above, primarily in connection with
As used herein, biology's central dogma means current scientific view of microbiological processes, and more particularly, transcription of specific genes into specific transcripts and translation of specific transcripts into specific proteins. But systematic pathways with detailed biological central dogma information simply do not exist. Such pathways would be a reasonable starting point when building a realistic gene regulation network based on genes, transcripts and proteins. Prior art pathways only contain partial information (such as genes connected together if a product of one gene is a known regulator of another gene). Relationships of genes, transcripts and proteins are not largely described in machine-readable pathways. One explanation is that transcripts are not systematically identified and, consequently, they are not easily presented as elements of interactions in pathways. Creation of large pathways is also hampered by several problems, such as naming and modelling pathways scalability, etc. Pathways according to the central dogma tend to be complex, and it is far from trivial to realize that pathways of such complexity can be adequately modelled at all.
This embodiment takes well-identified genes from any typical DNA sequence database that contains identified genes with their DNA sequences. This input data does not include explicit pathway data, such as interactions, which may explain why the potential of the hidden pathway information in the DNA sequence database has been ignored so far. A typical DNA sequence database provides annotations of coding areas of each gene that provides a specific part of DNA sequence known to code a part of a transcript and/or part of a protein. Some DNA sequence databases are available in specific flat file formats or in XML formant, containing so-called feature tables or FT lines for specific keyword annotations (eg “CDS” for coding area/sequence) and a field that indicates sequential location of the annotated feature. Typically there are database references for genes and sometimes for proteins as well.
A gene can be identified objectively by its DNA sequence and its place on a chromosome and other genomic molecule carrying genes and subjectively by various names and database references.
A transcript can be identified objectively by its RNA sequence that is derived from the DNA sequence of the relevant gene. Messenger RNAs contain the RNA sequence that has been derived from the protein coding areas of the DNA sequence of the relevant gene. Each relevant transcript needs to be named. It can be named by the relevant gene if there is no other gene products otherwise it can be named by the gene and the protein it codes.
Three consecutive bases of a RNA sequence code one amino acid for the sequence of a protein. This means that one messenger RNA codes one protein that can be identified objectively by its amino acid sequence or subjectively by its several names or database references. The similarity of biochemical entities needs to be checked based on objective identification data. The names of biochemical entities must be used consistently in all applications that process the pathways.
This embodiment combines a pathway model, a logic for modifying and checking network topology of pathways and a management of objective and subjective identifications of biochemical entities (at least for genes, transcripts and proteins) based on gene sequence data, database reference data structure having the consistently used name of a biochemical entity associated with database name, id_name used in the database and id_string containing a subjective identification of the biochemical entity. The sequence data and subjective identifications are taken from a gene sequence databases that has no explicit interaction or pathway data.
In typical gene sequence databases, there are line identifiers, keywords, and sequential location or qualifier information for feature annotations. Although there are many different identifiers, keywords and qualifiers, it is possible to utilize some general commonalities.
For example, EMBL sequence database has feature tables as follows:
There are FT lines (feature table) having CDS (coding sequence) keywords indicating coding area and specific qualifiers that provide various database references to genes (/gene=“THBS3”) and their proteins (db_xref=“SWISS-PROT:P49746”). This means that the gene identified by THBS3 has a protein product identified by “SWISS-PROT:P49746” and there must be an mRNA between the gene and the protein. Names need to be converted to the recommended names (see the name tables 226 in
Let us assume that there are features annotated to have gene G1 (denoted by reference numeral 1402) with splice variant products P1, P2 and P3 (reference numerals 1442, 1444 and 1446). In such a case, an automatic population routine can infer that there must be three splice variant mRNAs, namely Tr1=mRNA from G1 to P1, Tr2=mRNA from G1 to P2, and Tr3=mRNA from G1 to P3. These splice variant mRNAs are denoted by reference numerals 1422, 1424 and 1426.
Let us further assume that there is a feature annotated to have gene G2, 1408 with one product P4, 1448. Then the automatic population routine can infer that there must be one mRNA, namely Tr4=mRNA, 1428, from G2 to P4.
Based on the above information, a skeleton pathway such as the one shown in
Initially, the transcription interactions can be mechanically completed with ribonucleotide substrates, and afterwards with known transcription factors. The translation interaction can be completed with amino acids and ribosome. The interactions are not yet complete but RNA sequence databases can be used to form translation interactions if there are annotated features with an identified mRNA and a protein.
In terms of hardware and software, the IMS needs an access to external databases. Many databases can be accessed with an ordinary Internet browser. Accordingly, the automatic population software needs to emulate an Internet browser or otherwise output compatible commands. In addition, the IMS needs a parsing logic and information on how the output of each database is arranged.
Coding sequence annotation (TRUE/FALSE)
Start point of exon (integer)
End point of exon (integer)
DNA sequence from start_point to end_point (string of acgt)
Database reference of gene (eg based on EMBL /gene qualifier)
Database reference of protein (eg based on EMBL db_xref)
In step 1453 the logic searches for the next gene from the exon records. If none is found, the process ends. In step 1455 the logic translates the database reference to a gene name via a database reference table (not shown separately). In step 1456 the logic searches for the next protein from the exon records related to the gene. If no proteins are found, the logic proceeds to step 1470. In step 1458, if no more proteins are found, the logic returns to step 1453. In step 1459 the logic translates the database reference to a protein name via a database reference table (not shown separately).
In step 1460 the logic checks if there are any transcripts connected between his gene and this protein in the pathway, such that the gene controls a transcription interaction AND the transcription interaction produces a transcript AND the transcript controls a translation interaction AND the translation interaction produces the protein. In step 1461, if any are found, the logic returns to step 1456. In steps 1462 to 1467, the logic creates pathway information as follows:
In step 1468, some other biochemical entities (eg amino acids and ribosome) may optionally be connected to transcription and translation. Then the logic returns to step 1453. The steps shown in
Spatial Reference Models
There are many cell types for which a simple Cartesian or polar coordinate system is insufficient. For example, stem cells are directional, which means that they have a front end and a back end. Nerve cells are even more complex. Accordingly, the IMS preferably comprises several spatial reference models, and the spatial point is expressed as a combination of a reference model and an area within the reference model.
Reference model 1510 is based on a division of a cell to several areas. The number of areas should be selected such that a piece of biochemical information is valid throughout the area. Reference model 1510 is suitable for a compact directional cell, such as a stem cell. The model 1510 is directional but rotationally symmetric. It has a front end area 1511, a rear end area 1516, a nucleus area 1514 and various intermediate areas 1512, 1513 and 1515. The front and rear ends can be selected relative to some gradient, such as a decreasing concentration of a compound.
Reference model 1520 is an example of modelling the topology of a nerve cell. It has a nucleus area 1521, various parts 1522, 1523 around the nucleus, a soma area 1524, an axon area 1525, etc. Normalized spatial coordinates can be used to increase detail level still further, if necessary. For instance, a point at the outer surface of an axon at its midpoint length-wise can be expressed {1520, 1525, (0.5, 1)}, wherein 1520 indicates the reference model, 1525 indicates the area within the reference model, 0.5 is a normalized length-wise coordinate along the axon and 1 means 100% of the radius along the cross section of the axon.
Pattern Matching
G[*]activates I[*] produces Tr[*] activates I[*] produces P[*] inhibits @3
This example comprises two special symbols. The asterisks “*”, denoted by reference signs 1652A, are wildcard expressions that match any character string. Such wildcard characters are will known in the field of information technology, but the use of such wildcard characters is only possible by virtue of the systematic way of storing biochemical information. The last term “@3”, denoted by reference sign 1652B, is another special character and means the third term in the search criterion 1652, ie, the interaction I[*], which is activated (=second term) by any gene G[*] (=first term). The fact that the pattern-matching logic 1650 can process special terms like “@3” 1652B that refer to a previous term in the search criterion 1652, enables the pattern-matching logic 1650 to retrieve pathways that contain loops.
In addition to the search criterion 1652 that may comprise wildcards, the pattern-matching logic 1650 may have another input 1654 that indicates a list of potential pathways. The list may be an explicit list of specific pathways, or it may be an implicit list expressed as further search criteria based on elements of the pathway model (for potential search criteria, see
For example, the pattern-matching logic 1650 can be implemented as a recursive tree-search algorithm 1670 as shown in
As regards realization of step 1682, in which tree structures are constructed from the pathway under test, tree-search algorithms are disclosed in programming literature. In a normal tree-search algorithm, loops are normally not allowed, but in step 1682 a loop is allowed if that loop matches a loop in the search criterion 1652.
The example shown in
In the embodiment shown in
The object classes of the connections (gene, transcript, . . . ) are as follows:
When the query 1690 is processed, its result set indicates the pathways that meet the above criteria. In the retrieved pathways the pattern (motif) 1660 is easy to localize as soon as the five connections have been identified by means of their id fields.
Generation of the search criteria contains the following steps:
If some of the entities in the pathway motif have been identified by a name of its own or by a GO class, the generation of the SQL query involves further conditions, wherein the name of the entity or the GO class connected by the annotation restricts entries to the result set.
Such a topological pattern matching by relatively simple database queries is greatly facilitated by the systematic pathway model described in connection with
It is readily apparent to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.
Acronyms
IMS: Information Management System
VDL: Variable Description Language
SQL: Structured Query Language
XML: Extendible Markup Language
Number | Date | Country | Kind |
---|---|---|---|
20031027 | Jul 2003 | FI | national |