The development of microarray technology has grown from modest beginnings to the present day where the ability to expression profile whole genomes is routine. However, high throughput gene expression profiling presents a unique difficulty in the need to identify and distinguish significant changes in gene expression from among the tens of thousands of genes that can be assayed simultaneously. Indeed, analysis of high throughput data in the context of disease processes can be a daunting task. Statistical algorithms such as Significance Analysis of Microarrays (SAM) and hierarchal clustering have been developed to help facilitate analysis of gene expression data from microarrays.
The SAM algorithm assigns a score to each gene represented on a microarray on the basis of change in gene expression relative to the standard deviation of repeated measurements, see Tusher et al., “Significance analysis of microarrays applied to the ionizing radiation response”, 5116-5121, PNAS, Apr. 24, 2001, vol. 98, no. 9, which is hereby incorporated herein, in its entirety, by reference thereto. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). However, a list of significantly regulated genes does not provide much context to the biologist studying a disease.
Hierarchical clustering applies statistical algorithms to group genes according to similarity among gene expression patterns, where similarity values are typically calculated by Euclidean distance or correlation coefficient, e.g., see Larkin et al., “Cardica transcriptional response to acute and chronic angiotensin II treatments”, Physiol Genomics, 18: 152-166, 2004, which is hereby incorporated herein, in its entirety, by reference thereto. Hierarchical clustering technique do not provide context to the disease or phenomenon being studied, but are useful in identifying and distinguishing sets of statistically significant genes.
Other approaches having included conducting studies using other analytical approaches in combination with SAM statistics. In particular, an article by Lopes et al., Pathophysiology of plaque instability: insights at the genomic level”, Prog Cardi ovasc Dis 44: 323-328, 2002, which is incorporated herein, in its entirety, by reference thereto, discusses the importance of identification of gene groupings towards developing an understanding of the causes and risks for atherosclerosis.
Although hierarchical clustering has been used as a pathway discovery tool (changes in expression of genes in activated networks would be expected to correlate, see Johnson et al., “Genomic profiles and predictive biological networks in oxidant-induced atherogenesis”, Physiol Genomics 13: 263-275, 2003, which is incorporated herein, in its entirety, by reference thereto) this ignores, among other things, the fact that some proteins are not transcriptionally regulated.
PathwayAssist, a commercially available pathway discovery program (Ariadne Genomics, http://www.ariadnegenomics.com/products/pathway.html) may be used to develop a pathway based upon genes identified as significant by any of the techniques described above. Although this program offers functionality as a pathway discovery tool, it lacks both objectivity and any form of mathematical expression of the connectedness of the genes plotted in the pathway that it generates.
More powerful tools and approaches are needed to provide context to high throughput data as it relates to a disease or other condition being studied, and for which the experiments that generated the high throughput data were conducted.
Methods, systems and computer readable media for network-based identification of significant molecules, for which at least one biological network is provided to include significant molecules to be identified. A node is identified in the network. A member-specific sub-network containing nodes connected to the identified node is identified for L levels of nearest neighbors, wherein L is a positive integer, and a connectivity score is calculated for the molecule represented by the identified node based on significance scores of each node contained in the member-specific sub-network. These steps are repeated for other nodes in the network.
Methods, systems and computer readable media for network-based identification of significant molecules, for which at least one biological network is provided to include significant molecules to be identified, a data set including data values characterizing molecules experimented on is provided, and an interesting list of molecules is provided as a subset of the molecules from the dataset, the interesting list including significance scores for the molecules in the list. Such identification includes identifying a node in the network; identifying a member-specific sub-network containing nodes connected to the identified node for L levels of nearest neighbors, wherein L is a positive integer; extracting the member-specific sub-network from the network; and repeating the steps of identifying a node, identifying a member-specific network and extracting the member-specific sub-network form the network for each of the other nodes in the network that corresponds to a molecule in the interesting list.
These and other advantages and features of the invention will become apparent to those persons skilled in the art upon reading the details of the methods, systems and computer readable media as more fully described below.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Before the present methods, systems and computer readable media are described, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a sub-network” includes a plurality of such sub-networks and reference to “the node” includes reference to one or more nodes and equivalents thereof known to those skilled in the art, and so forth.
The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
An “interesting list” refers to a list of molecules for a disease process and/or condition under study associated with some high throughput data that have been determined to be significantly differentially regulated relative to other molecules represented in a population of high throughput data, which may be high throughput data from gene expression, location analysis, proteomic and/or metabolomic studies.
“Nexus genes” refer to potentially regulatory molecules that may or may not be members of an “interesting list” of molecules and that are associated with a number of other molecules (at least some of which are members of the interesting list) in a biological diagram or biological network.
The term “local format” or “local formatting” refers to a common format into which knowledge extracted from textual documents, biological data and biological diagrams can all be converted so that the knowledge can be interchangeably used in any and all of the types of sources mentioned. The local format may be a computing language, grammar or Boolean representation of the information which can capture the ways in which the information in the three categories are represented. The local format thus refers to a restricted grammar/language used to represent extracted semantic information from diagrams, text, experimental data, etc., so that all of the extracted information is in the same format and may be easily exchanged and used in together. The local format can be used to link information from diverse categories, and this may be carried out automatically. The information that results in the local format can then be used as a precursor for application tools provided to compare experimental data with existing textual data and biological models, as well as with any textual data or biological models that the user may supply, for example.
The term “biological diagram”, “biological model” or “pathway”, as used herein, refers to any graphical image, stored in ay type of format (e.g., GIF, JPG, TIFF, BMP, etc.) which contains depictions of concepts found in biology. Biological diagrams include, but are not limited to, pathway diagrams, cellular networks, signal transduction pathways, regulatory pathways, metabolic pathways, protein-protein interactions, interactions between molecules, compounds, or drugs, and the like. A “biological network” refers to a graph representation (which may also include text, and other information) wherein biological entities and the interrelationships between them are represented as diagrammatic nodes and links, respectively. Examples of biological networks include, but are not limited to pathways and protein-protein interaction maps. A “pathway” refers to an ordered sequence of interactions in a biological network. An example of a pathway is a cascade of signaling events, such as the wnt/beta-catenin pathway, which represents the ordered sequence of interactions in a cell as a result of an outside stimulus, in this case, the binding of the wnt ligand to a receptor on the membrane of the cell. The terms “pathway” and “biological network” are sometimes used interchangeably in the art.
A “biological concept” or “concept” refers to any concept from the biological domain that can be described using one or more nouns according to the techniques described herein.
A “relationship” or “relation” refers to any concept which can link or “relate” at least two biological concepts together. A relationship may include multiple nouns and verbs.
An “entity” or “item” is defined herein as a subject of interest that a researcher is endeavoring to learn more about, and may also be referred to as a biological concept, i.e., “entities” are a subset of “concepts”. For example, an entity or item may be one or more genes, proteins, molecules, ligands, diseases, drugs or other compounds, textual or other semantic description of the foregoing, or combinations of any or all of the foregoing, but is not limited to these specific examples.
An “interaction” as used herein, refers to some association relating two or more entities. Co-occurrence of entities in an interaction implies that there exists some relationship between those entities. Entities may play a number of roles within an interaction. The structure of roles in an interaction determines the nature of the relationship(s) amongst the various entities that fill those roles. Interactions may be considered a subset of relationships.
A “node” refers to an entity, which also may be referred to as a “noun” (in a local format, for example). Thus, when data is converted to a local format nodes are selected as the “nouns” for the local format to build a grammar, language or Boolean logic.
A “link” refers to a relationship or action that occurs between entities or nodes (nouns) and may also be referred to as a “verb” (in a local format, for example). Verbs are identified for use in the local format to construct a grammar, language or Boolean logic. Examples of verbs, but not limited to these, include up-regulation, down-regulation, inhibition, promotion, bind, cleave and status of genes, protein-protein interactions, drug actions and reactions, etc.
“Phosphorylation” refers to the addition of phosphate groups to hydroxyl groups on proteins (side chains s, T or Y) catalysed by a protein kinase often specific) with ATP as phosphate donor. Activity of proteins is often regulated by phosphorylation. Phosphorylation is one type of post-translational protein modification mechanism.
“Activated” refers to the state of a biochemical entity wherein it is enabled for performing its function.
“Inhibited” is used to refer to the state of a biochemical entity wherein it is wholly or partially disabled or deactivated for performing its function.
“Up-regulated” refers to a state of a gene wherein its production of corresponding RNA (ribonucleic acid) transcript is significantly higher than in a reference condition.
“Down-regulated” refers to refers to a state of a gene wherein its production of corresponding RNA transcript is significantly lower than in a reference condition.
A “co-factor” is an inorganic ion or another enzyme that is required for an enzyme's activity.
A “rule” refers to a procedure that can be run using data related to stencils, nodes, and links. Rules can be declarative assertions that can be computationally verified, for example “an enzyme must be a protein”, or they can be arbitrary procedures that can be computationally executed using data related to stencils, nodes, and links, for example “if there is a relation such that entity A activates entity B, and if A is in state activated, then set B in state activated”.
A “stencil” refers to a diagrammatic representation which may contain one or more biological concepts, entities, times, interactions, relationships and descriptions (generally, although not necessarily, graphic descriptions) of how these interact. Stencils function similarly to macros in Microsoft Word or Excel, with respect to their functionality for generating more than one node or link at a time when constructing a biological diagram. Stencils may be comprised of graphical elements, such as shapes (e.g. rectangles, ovals), lines, arcs, arrows, and/or text. These elements have biological semantics; that is, elements represent types of biological entities, such as genes, proteins, RNA, metabolites, compounds, drugs, complexes, cell, tissue, organisms, biological relationship, disease, or the like.
A “database” refers to a collection of data arranged for ease and speed of search and retrieval. This term refers to an electronic database system (such as an Oracle database) that would typically be described in computer science literature. Further this term refers to other sources of biological knowledge including textual documents, biological diagrams, experimental results, handwritten notes or drawings, or a collection of these.
A “biopolymer” is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides (such as carbohydrates), and peptides (which term is used to include polypeptides and proteins) and polynucleotides as well as their analogs such as those compounds composed of or containing amino acid analogs or non-amino acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids (or synthetic or naturally occurring analogs) in which one or more of the conventional bases has been replaced with a group (natural or synthetic) capable of participating in Watson-Crick type hydrogen bonding interactions. Polynucleotides include single or multiple stranded configurations, where one or more of the strands may or may not be completely aligned with another.
A “nucleotide” refers to a sub-unit of a nucleic acid and has a phosphate group, a 5 carbon sugar and a nitrogen containing base, as well as functional analogs (whether synthetic or naturally occurring) of such sub-units which in the polymer form (as a polynucleotide) can hybridize with naturally occurring polynucleotides in a sequence specific manner analogous to that of two naturally occurring polynucleotides. For example, a “biopolymer” includes DNA (including cDNA), RNA, oligonucleotides, and PNA (peptide nucleic acid) and other polynucleotides, regardless of the source. An “oligonucleotide” generally refers to a nucleotide multimer of about 10 to 100 nucleotides in length, while a “polynucleotide” includes a nucleotide multimer having any number of nucleotides. A “biomonomer” references a single unit, which can be linked with the same or other biomonomers to form a biopolymer (for example, a single amino acid or nucleotide with two linking groups one or both of which may have removable protecting groups).
A “chemical array”, “array” or “microarray”, unless a contrary intention appears, includes any one-, two- or three-dimensional arrangement of addressable regions bearing a particular chemical moiety or moieties (for example, biopolymers such as polynucleotide sequences) associated with that region. An array is “addressable” in that it has multiple regions of different moieties (for example, different polynucleotide sequences) such that a region (a “feature” or “spot” of the array) at a particular predetermined location (an “address”) on the array will detect a particular target or class of targets (although a feature may incidentally detect non-targets of that feature). Array features are typically, but need not be, separated by intervening spaces. In the case of an array, the “target” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by probes (“target probes”) which are bound to the substrate at the various regions. However, either of the “target” or “target probes” may be the one which is to be evaluated by the other (thus, either one could be an unknown mixture of polynucleotides to be evaluated by binding with the other). An “array layout” refers to one or more characteristics of the features, such as feature positioning on the substrate, one or more feature dimensions, and an indication of a moiety at a given location. “Hybridizing” and “binding”, with respect to polynucleotides, are used interchangeably. A “pulse jet” is a device which can dispense drops in the formation of an array. Pulse jets operate by delivering a pulse of pressure to liquid adjacent an outlet or orifice such that a drop will be dispensed therefrom (for example, by a piezoelectric or thermoelectric element positioned in a same chamber as the orifice).
When one item is indicated as being “remote” from another, this is referenced that the two items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart.
“Communicating” information references transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network). “Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data.
A “processor” references any hardware and/or software combination which will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of a mainframe, server, or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic or optical disk may carry the programming, and can be read by a suitable disk reader communicating with each processor at its corresponding station.
“May” means optionally.
Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events.
A pathways-based approach to analysis of high throughput data as described herein may provide context for identifying significant therapeutics from among a large list of significantly regulated genes or other high throughput data determined to be significantly differentiated from a larger total population of that high throughput data. Identification of networks of interactions between significant genes represented by the significant high throughput data, may provide crucial information for complex diseases where multiple genes and the environment interact. Described herein are methods and systems-based approaches to studying complex diseases in terms of gene-gene interactions among significantly regulated genes. Further, highly connected genes may be identified, referred to as ‘nexus’ genes, which may be considered attractive candidates for therapeutic targeting.
A pathways-based approach can account for the fact that some proteins are not transcriptionally regulated and, at the same time, take account of prior knowledge by expanding the context beyond the genes and gene changes in the current experiment. A more comprehensive analysis of this type is particularly suited for complex diseases, where genes and the environment interact. It is not realistic, for example, to attempt to understand the inner workings of an automobile simply by disassembling it into its various parts, determining vital components and choosing to study those individual components. In the same way, analysis of a complex disease should be conducted with a more systems-based approach that allows for in-depth study of gene-gene interactions, and gives prominence to interactions among genes known to be differentially modulated in disease progression.
Described are systems, methods and computer readable media for analyzing disease processes in terms of gene-gene interactions and/or for identification of highly connected genes as potential therapeutic targets. Input for these analysis techniques may be high throughput data from gene expression, genotyping, location analysis, proteomic and/or metabolomic studies from which significantly differentially regulated molecules for the disease process have been identified. A list of such significantly differentially regulated molecules is referred to herein as an “interesting list”.
Each of the molecules in the interesting list has a score associated with it, which represents its significance. The score is referred to as the “significance score”. For example, the score can be based on the d-scores associated with SAM analysis or the ranking of molecules in terms of the relative differential expression (most differentially expressed molecules have lower ranks, hence higher scores). A comprehensive network of molecular interactions involving molecules in the interesting list may be constructed from any one or combination of the following: (i) language parsing of published literature; (ii) merging of existing pathway databases, metabolic reaction databases, protein-protein interaction databases; (iii) manually created network maps; and (iv) automatic network generation from experimental data.
A method and system for knowledge extraction is described in co-pending, commonly owned application Ser. No. 10/154,524 titled “System and Method for Extracting Pre-Existing Data from Multiple Formats and Representing Data in a Common Format for Making Overlays”, filed on May 22, 2002. Application Ser. No. 10/154,524 is hereby incorporated by reference herein, in its entirety, by reference thereto. Further, a method and system for using local user context to extract relevant knowledge is described in co-pending and commonly assigned application Ser. No. 10/155,304 filed May 22, 2002 and titled “System, Tools and Methods to Facilitate Identification and Organization of New Information Based on Context of User's Existing Information”. Application Ser. No. 10/155,304 is hereby incorporated by reference herein, in its entirety, by reference thereto. Described are methods and systems wherein automated text mining techniques are used to extract “nouns” (e.g. biological entities) and “verbs” (e.g. relationships) from sentences in scientific text. Thus, knowledge extraction from scientific literature, e.g. via text mining, can identify biological entities that are involved in a relationship, for example a promotion interaction involving two genes. The resulting interpretation is represented in a restricted grammar, referred to as “local format”.
Co-pending and commonly owned application Ser. No. 10/642,376 filed Aug. 14, 2003 and titled “System, Tools and Method for Viewing Textual Documents, Extracting Knowledge Therefrom and Converting the Knowledge into Other Forms of Representation of the Knowledge” describes conversion of text to the local format using an interactive text viewing tool. This tool can automatically identify and extract entities and relationships found in a passage of text, and then provide an interface by which a user can interactively refine and disambiguate the extracted knowledge, which the present invention converts to a local format, thereby greatly improving the accuracy and reliability of the knowledge generated, as a result of the process. The local format serves as a structured way for the user to review and encode the relevant knowledge contained in scientific text. It also serves as a biological object model that can be manipulated by other computational tools. Application Ser. No. 10/642,376 is hereby incorporated by reference herein, in its entirety, by reference thereto.
Co-pending and commonly assigned application Ser. No. 10/641,492 filed Aug. 14, 2003 and titled “Method and System for Importing, Creating and/or Manipulating Biological Diagrams” discloses systems and methods for mapping biological concepts and relationships to regions, on graphical images that have biological semantic meaning, where those concepts and relationships are located. Such superimposition allows researchers to examine their data of interest in the form that they prefer (e.g., native data format, text format or graphical format) in the context of previously defined knowledge which is represented by the diagram. Moreover, such an overlay can allow for easy understanding of data with respect to a static model represented by the diagram.
Biological diagrams may be generated from a variety of input formats. The system may import graph data structures from pre-existing databases, for example. Separate import modules may serve on a database-specific basis to allow a biological diagram to be created given information in the format of each such specific database. A collection of local format objects may be imported to the system to construct a biological diagram. Diagrams created and/or imported by the present system may be saved and loaded.
Another functionality provided is the ability to import static graphical images and convert them to interactive biological diagrams. For example, a system may process an image of a biological diagram and determine a mapping to the coordinates of biological concepts found in the graphic. As noted above, the system can process diagrams from virtually any source. Examples of such sources include, but are by no means limited to: Boehringer-Mannheim charts, Kyoto Encyclopedia of Genes and Genomes (KEGG), and directed acyclic graphs of the Gene Ontology (GO) classification scheme. The system may also simultaneously make use of a combination of diagrams from a single source or a combination of sources. Further details and capabilities of the above-described systems and methods are found in application Ser. No. 10/641,492, which is hereby incorporated herein, in its entirety, by reference thereto.
Co-pending, commonly assigned application Ser. No. 10/784,523 filed on Feb. 23, 2004 and titled “System, Tools and Method for Constructing Interactive Biological Diagrams” provides a visual grammar, to accompany the local format, and to represent interrelationships amongst biological entities and activities. The visual grammar is based upon a library of stencils that graphically represent common types of biological entities and connections between them. The present invention also provides lightweight software tools for composing and editing the stencils, as well as tools for linking the elements of stencils, and their values, to other data elements, datasets, and the local format. Stencils may be comprised of graphical elements, such as shapes (e.g. rectangles, ovals), lines, arcs, arrows, and text. These elements have biological semantics; that is, elements represent types of biological entities, such as genes, proteins, RNA, metabolites, compounds, drugs, complexes, cell, tissue, organisms, biological relationship, disease, or the like.
The biological semantics facilitate linking of the stencils with other forms of biological data. Further, stencils represent composites of biological activity, and therefore may function like “macros” for easier and more rapid building of biological diagrams. Stencils permit two-way interactions between textual documents and diagrams, or between diagrams and other forms of data such as experimental data, for example. Further stencils support user-controlled graphical exploration of alternatives, such as alternatives to pre-existing diagrams. Stencils may be used collaboratively among multiple users, whether by providing a blank set of stencils as a starter template, sharing of filled-in stencils, collaboratively filling in stencils, or any combination of these. Further details about stencils and systems for building biological diagrams are found in application Ser. No. 10/784,523, which is hereby incorporated herein, in its entirety, by reference thereto.
Based upon at least one dataset produced by a gene expression, genotyping, location analysis, proteomic or metabolomic study (the invention is particularly well-suited to datasets produced by high throughput techniques) and an interesting list of members of the at least one dataset that have been determined to be differentiated from the remainder of the population of the dataset(s), in addition to a biological diagram that models interactions between the members included in the dataset(s), the present invention further processes this information to provide more contextual meaning of the data as it relates to a disease or other subject of the study being conducted. As noted, the biological diagram may be a pre-existing diagram that models interactions between the members (or concepts) in the data, or may be constructed from any one or combination of the following: (i) language parsing of published literature; (ii) merging of existing pathway databases, metabolic reaction databases, protein-protein interaction databases; (iii) manually created network maps; (iv) automatic network generation from experimental data; (v) and modification of a pre-existing diagram using any of the previously mentioned sources.
A sub-network may be extracted from the biological diagram mentioned above, such that all the nodes are members of the “interesting list”, which forms what is referred to as an “interesting sub-network”. Another method may extract a sub-network such that all nodes of the sub-network are part of the microarray (or the given high throughput experimental data set). Such a network is referred to as a “data sub-network”.
Based on any of the entire network 100, the interesting sub-network 110I or the data sub-network 110D, as described above, a connectivity analysis may next be performed to rank members of the network according to connectivity scores. Note that there may be multiple disconnected sub-graphs in an extracted interesting sub-network 110I or data sub-network 110D. Further, the original biological diagram/network 100 may have multiple disconnected sub-diagrams/networks. Neither of these situations impact the processing described herein, however. Whichever network is used as a basis for performing connectivity scores, each node in that network is assigned a significance score for use in computing connectivity scores.
All nodes of the interesting sub-network 110I already have assigned significance scores, as provided in the interesting list. For example, SAM or some other known statistical algorithm may be used to calculate the significance scores. When using SAM, one or two threshold values may be set for calculating the significance scores. For example, a single threshold may be set, above which, data members having significance scores having absolute values that exceed the threshold value are assigned to the interesting list. Members having significance scores, the absolute values of which do not exceed the threshold are simply assigned a significance score of zero in this case. Similarly, two threshold values may be set, a positive threshold value and a negative threshold value, the absolute values of which do not have to be equal. In this case, those members with negative significance scores need to have a significance score less than the negative threshold value to make the interesting list, and those members with positive significance scores need to have a significance score greater than the positive threshold value to make the interesting list, All other members are assigned significance scores of zero.
Alternatively, significance scores may be calculated for all members of the data set, rather than assigning significance scores of zero to those members not on the interesting list. When using the full biological diagram 100, those nodes that are not members of the dataset are assigned a significance score of zero regardless of the method used to assign significance scores to members of the dataset.
A connectivity score is computed for each member in the network 100, interesting sub-network 110I or data sub-network 110D based on identifying the links that its representative node has in the network or sub-network and by identifying the members that the node under examination links to. For example, for each member of the network 100, interesting sub-network 110I or data sub-network 110D, all its neighbors up to a pre-defined and user modifiable distance level may be extracted. Neighbors may be limited to direct interactions with other members in the network 100, interesting sub-network 110I or data sub-network 110D, or may also include indirect interactions, and this is determined by the user-modifiable distance level at the time of the connectivity score computation. Any node directly interacting with the node being currently examined/analyzed is its first neighbor. A member that is a first neighbor of the first neighbor of the node being currently analyzed, but not the first neighbor of the node being currently analyzed is the second neighbor of the node being currently analyzed, (distance=2), and so on.
For the current node being analyzed, e.g., node A in
There are well-defined and currently available functions that may be applied to accomplish weighting, including, but not limited to: inverse of distance, exponential, etc. For weighting by inverse of distance, the weighting factor for node “i”, referred to as “W(i)” is given by: W(i)=1/distance(i,A), wherein distance (i,A) is the distance of node i from node A, when node A is the node for which a connectivity score is being computed. In this case, node A is assigned a weighting value of one (i.e., W(A)=1) as the inverse distance is not defined for node A, since the distance of A from A is zero. Exponential weighting values may be calculated by Wexp(i)=exp(−distance(i,A)), values of which, like the previously mentioned calculations, decrease with increasing distance. Thus, the weighting value applied to A itself using this approach is also 1, i.e., Wexp(A)=e0=1. Regardless of which weighting formula is applied, each resulting connectivity score may be normalized by dividing it by the sum of all the weights of the nodes considered for calculation of that connectivity score. For example, a connectivity score for node A may be defined as:
where the variable “i” represents the nodes in the neighborhood considered for calculation of the connectivity score, and “n” is the total number of nodes considered. As noted earlier, the neighborhood may be defined to include only direct interactions (first neighbors) or indirect interactions (e.g., up to and including second neighbors, where L=2, up to and including third neighbors when L=3, etc.) Note that node A is always considered to be a neighbor of node A, regardless of the value of L.
After calculating a connectivity score for each member of the network 100, interesting sub-network 110I or data sub-network 110D, the members may then be ranked (e.g., in decreasing order) according to their connectivity scores. Members with high connectivity scores are then identified as “nexus” members or highly interacting nodes representing molecules that may be potential therapeutic targets for a disease process under study.
A further normalization or thresholding function may be applied to normalize the connectivity scores of all the molecules in network 100, interesting sub-network 110I or data sub-network 110D. Some example techniques for normalization or thresholding may include (any combination of and not restricted to) the following: (i) normalize each connectivity score by dividing by the number of nodes or edges/links in the member-specific sub-network; (ii) set a threshold on the number of nodes or edges in the member-specific sub-network, such that all nodes with a corresponding sub-network with the number of nodes or edges less than the threshold are given a connectivity score of zero.
For example, the connectivity score for “A” in
Connectivity scores may be computed directly from the biological diagram, interesting sub-network or data sub-network, without extracting member-specific sub-networks, if desired. That is, given a node, all the node's neighbors (up to the pre-defined level L) may be located by traversing the links in the network (e.g., biological network, interesting sub-network or data sub-network) and computing the connectivity score from the significance scores of the given node and identified neighboring nodes. Once accomplished, member-specific sub-networks may then be extracted to construct a super-network, as described, or a super-network extraction may be performed to extract all of the identified nodes and neighbors (or a subset thereof as determined by ranked connectivity scores that exceed a threshold) to thereby construct the super-network. Member-specific sub-networks can be determined directly from the biological diagram in the same manner as described with regard to the interesting sub-network or data sub-network. Filtering may first be performed to eliminate connectivity scores based on all nodes that have been determined to be non-significant by the fact that they do not appear on the interesting list. Alternatively, connectivity scores for all nodes may be computed.
After extracting member-specific sub-networks as described, extracted member-specific networks may be combined to form a super-network. For example, the member-specific sub-networks for the highest ranked nodes representative of the highest ranked members (those with connectivity scores greater than a user-defined and modifiable threshold) may be combined together to form a super-network of interest that potentially significantly discriminates the disease process from the normal process, or more generally, that discriminates the experimental condition being studied from the control. In other words, the super-network is constructed by merging the “member-specific sub-network” for every member whose connectivity score is greater than a threshold. If a member-specific sub-network does not have a node in common with the super-network that is being generated, it may be displayed alongside the super-network without any connecting links between it and the super-network constructed thus far. “Nexus” members refer to those members with the highest relative connectivity scores and are included within the super-network. The resulting super-network and “nexus” members define a significant context around the disease process/condition being studied, and can be further analyzed for therapeutic targeting.
The “Gene Name” column lists the name of the gene as commonly identified and may also list known or suspected functions of the gene. In column 208, the number of nodes that were identified in the member-specific sub-network for the gene reported in column 202 is reported. The significance value for the member is reported in column 210. In this case, the significance value is in terms of a d-score of the gene being reported on, as determined by SAM analysis. The significance score may be either a positive or a negative value. The higher the absolute value of the significance score, the more significant is the gene considered to be. A cumulative significance score (in this example, cumulative d-score) is calculated by summing the absolute values of the significance scores of all nodes in the member-specific sub-network and is reported in column 212. The average significance score (in this case, the average d-score) is calculated by dividing the cumulative significance score by the total number of nodes in the member-specific sub-network and is reported in column 214. Note that the connectivity score for a gene is set to the value of the average significance score calculated from the member-specific sub-network for that gene.
Columns 216 and 218 report values for thresholds that may be changed by a user. In column 216, Boolean flags (such a “0” and “1” or, as shown in
Even if connectivity scores such as average connectivity scores are normalized, the user may wish to further filter the connectivity scores by number of nodes or number of edges/links that are contained in the member-specific sub-network being considered. Consider, for example, a case where a member-specific sub-network has only two nodes and both nodes score relatively high for significance. Even with normalizing, this member-specific sub-network will receive a high average significance score. However, another member-specific sub-network may have ten nodes with five of the nodes scoring relatively high for significance. This larger member-specific sub-network will score a substantially lower average significance score when the cumulative score is divided by ten, but may relay more useful information to a user than the member-specific sub-network containing only two nodes, since the larger member-specific sub-network contains five significant nodes/genes, while the smaller member-specific sub-network contains only two significant nodes/genes. To address this issue, the user may set a threshold so that very small member-specific sub-networks are not considered in the analysis. In the example of FIG. 4, the user has chosen to ignore member-specific sub-networks having a total of four nodes or less. As noted before, the value of this threshold may be changed by the user. As with column 216, Boolean values are entered into column 218 to indicate whether each member-specific sub-network considered passes the minimum node or link threshold requirement.
Since all genes (i.e., not only genes on the interesting list) were considered in this example, column 220 contains Boolean values to indicate whether the particular gene being considered was determined to be a significant member as determined by its significance score. The threshold level for what is considered to be significant may also be changed, as is compared to the absolute value of the significance score of the member being considered. Thus, column 220 identifies those members that make up an interesting list. Column 222 identifies the names of all nodes (representing members, in this case genes) that are included in the member-specific sub-network being considered.
Again, only a small portion of the total number of genes analyzed is shown in
The members in chart 300 have been sorted according to cumulative significance score (i.e., connectivity score) and may be selected for building a super-network based on this order.
A super-network 400 was generated from the member-specific sub-networks extracted for those members on the interesting list in the experiment described with regard to
A heatstrip 406 is displayed beneath each node to indicate the expression level of each cell (experiment) in the row of the array for the particular gene represented by that particular node. Further details regarding the visualization of heat strips can be found in co-pending, commonly owned application Ser. No. 10/928,494 filed Aug. 27, 2004 and titled “System and Methods for Visualizing and Manipulating Multiple Data Values with Graphical Views of Biological Relationships”, which is hereby incorporated herein, in its entirety, by reference thereto. Heatstrip 406 is also color coded, where yellow bars 406y represent expression of the diabetes class and blue bars 406b represent expression of the control class (no diabetes). It can be observed from super-network 400 that nodes il6, lif, c-src, tgif, igf1 and il1ra were the most highly connected nodes (genes) in the super-network 400, with il6 having the highest connectivity score of all (as already noted, the cumulative significance scores were used as the connectivity scores for the nodes in this experiment), having a score of 52.4669. Thus, il6 was identified as a nexus gene in coronary atherosclerosis and a key target in the pathology of diabetic coronary disease.
CPU 702 is also coupled to an interface 710 that includes one or more input/output devices such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, CPU 702 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 712. With such a network connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.
The hardware elements described above may implement the instructions of multiple software modules for performing the operations of this invention. For example, instructions for calculating connectivity scores may be stored on mass storage device 708 or 714 and executed on CPU 708 in conjunction with primary memory 706.
While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.
This application is a continuation-in-part application of Ser. No. 10/641,492, filed Aug. 14, 2003, pending which is incorporated herein by reference in its entirety and to which application we claim priority under 35 USC §120. This application also claims the benefit of U.S. Provisional Application No. 60/682,048, filed May 17, 2005, which application is incorporated herein, in its entirety, by reference thereto.
Number | Date | Country | |
---|---|---|---|
60682048 | May 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10641492 | Aug 2003 | US |
Child | 11264259 | Oct 2005 | US |