This disclosure relates generally to bioinformatics and more particularly to predicting biological pathways from biologic data stored in disparate biological data sources.
Biological pathways may be considered as a combination of Metabolic Pathways, Signal Transduction Pathways and perhaps others. Prior to the completion of the human genome project, researchers generally attempted to discover pathways in a wet lab environment. Researching pathways in a wet lab environment typically begins after discovering a new protein. Once a new protein has been discovered, researchers run assays and protein gels to separate various proteins involved in formation of the new protein. The researchers then classify each protein individually and build experiments designed to inhibit production of one or more of the proteins expressed in the gel. The researchers derive the pathway through a series of inhibition experiments and classification experiments of the expressed proteins. A drawback associated with developing pathways in the wet lab environment is that it generally takes years to develop and classify each individual protein expressed in a pathway.
Developing pathways has changed in light of the large amount of data generated from the human genome project and other projects that involve understanding disease mechanisms and additional cellular processes. Instead of using the wet lab environment to exclusively develop pathways, pieces of the pathways (e.g., proteins, protein expressions, protein interactions, protein functional information, protein structures, etc.) are found in publications generated as a result of the above-noted projects. To develop a pathway from the many pieces of biologic data, researchers have to manually search through public databases containing the publications and try to find data in the vast amount of literature that can be linked and correlated. If the researchers are successful, they can generate hypothetical models representing pathways. The researchers then can build experiments that test the hypotheses embodied in the hypothetical models. This approach to developing pathways is time consuming and researchers typically have to continually perform updated searches in order to ensure that all relevant data to a particular pathway is captured.
Researchers have contemplated using automated search tools to overcome some of the problems associated with developing a pathway from a manual search of public databases. A problem associated with using automated search tools in the hypothesis generation of pathways is that currently available computing techniques are unable to efficiently organize biological data (e.g., proteins, protein expressions, protein interactions, protein functional information, protein structures, etc.) stored in the many different public databases with useful annotations that advance pathway development. A reason that it is difficult to efficiently organize the biological data with useful annotations is that the databases each have their own unique schema and approach of representing pathways and biological data. For example, some databases focus primarily on protein-protein interactions, while other databases contain other information such as the direction of interactions and annotations that describe interacting proteins in a textual format. Another problem is that inconsistencies exist in the naming conventions used to represent protein and genomic names in each of the databases. Consequently, querying and associating the large amounts of data across these sources with currently available computing techniques is difficult and becomes more complex as the amount of biologic data generated increases.
Therefore, there is a need for an approach that can automatically generate hypothesis prediction of new pathways from the large amount of biologic data stored in databases having different schemas and approaches to representing, the data.
As biological research proceeds beyond the genomic era, the variety and amount of experimental data will continue to grow requiring new computational tools to be developed to aid in analysis. As this data explosion continues, the opportunity exists for bioinformatics to develop new algorithms and databases aimed at solving the puzzle of reconstructing biological pathways and deciphering their roles in cellular function and more importantly disease mechanisms. However in order to create these algorithms, comprehensive databases must be created which integrate current bioinformatics tools and database such as BIND, Transpath, MINT, Pronet and SMD into a single comprehensive and well annotated resource. The system and method presented herein that integrate pathway and microarray databases is a first step toward accomplishing this goal.
In a first embodiment of this disclosure, there is a system for building a biological pathway. In this embodiment, there is a data extraction module that automatically extracts biological data from a plurality of biological data sources. A pathway database contains the extracted biological data. A pathway analysis module assimilates the biological data into a hypotheses prediction for generating a pathway. A visualization module generates a visual representation of the pathway generated by the pathway analysis module.
In another embodiment of this disclosure, there is a system for building a biological pathway. In this embodiment, there is a plurality of biological data sources each containing biological data. A data extraction module automatically extracts biological data from the plurality of biological data sources. A pathway database contains the extracted biological data and a pathway analysis module assimilates the biological data into a hypotheses prediction for generating a pathway. A visualization module generates a visual representation of the pathway generated by the pathway analysis module.
In a third embodiment of this disclosure, there is a method and computer readable medium that stores instructions for instructing a computer system, to build a biological pathway. This embodiment comprises automatically extracting biological data from a plurality of biological data sources; storing the extracted biological data; assimilating the biological data into a hypotheses prediction for generating a pathway; and generating a visual representation of the pathway using the hypotheses prediction.
Embodiments of the disclosure provide data schema and data models for integrating disparate protein interaction and pathway data with experimental data from microarray chips. This integrating of public genomic and proteomic databases containing protein-protein and protein-DNA interactions with microarray data, enables a comprehensive platform for new bioinformatics analysis methods to be developed for elucidating biological pathways. This is a unique and comprehensive resource for analysis and elucidation of biological pathways. The herein described systems and methods will give new insights into disease mechanisms, such as those that underlie breast and other types of cancer, toward the development of new diagnostics and therapeutics. Other advantages also exist.
The input/output devices may comprise a keyboard 18 and a mouse 20 that enter data and instructions into the computer system 10. Also, a display 22 may be used to allow a user to see what the computer has accomplished. Other output devices may include a printer, plotter, synthesizer, speakers, and other devices. A communication device 24 such as a telephone or cable modem or a network card such as an Ethernet adapter, local area network (LAN) adapter, integrated services digital network (ISDN) adapter, or Digital Subscriber Line (DSL) adapter, that enables the computer system 10 to access other computers and resources on a network such as a LAN, a wide area network (WAN) or the Internet. A mass storage device 26 may be used to allow the computer system 10 to permanently retain large amounts of data. The mass storage device may include all types of disk drives such as floppy disks, hard disks and optical disks, as well as tape drives that can read and write data onto a tape that could include digital audio tapes (DAT), digital linear tapes (DLT), or other magnetically coded media. The above-described computer system 10 can take the form of a hand-held digital computer, personal digital assistant computer, notebook computer, personal computer, workstation, mini-computer, mainframe computer or supercomputer.
A pathway database 32 stores the biological data retrieved by the data extraction module 30. In addition to the protein interactions, annotated protein sequences and textual information retrieved from the Pronet, BIND, Transpath, Swiss Prot, and PubMed databases. The pathway database 32 may store other data from these databases. For example, the BIND database provides other data with the protein interactions such as molecule short names, molecules types, species, experimental conditions and publication links. In addition to protein interaction data, Transpath includes molecule short names, synonyms, molecule full names, molecule classes and publication links. In addition to annotated protein sequences, Swiss Prot includes molecule short names, synonyms, molecule full names, species, homologs, publication references, amino acid sequences, molecular weights, lengths, tissue specificities and locations. Beside publications, PubMed includes other information such as full text abstracts, molecule short names, molecule full names, synonyms and interactions. All of this data, as well as other data, is capable of being extracted and stored in the pathway database 32.
The pathway database 32 is an object-oriented database, however, one of ordinary skill in the art will recognize that the pathway database may be a relational database.
As shown in
Referring again to
A visualization module 36 generates a visual representation of the pathway generated by the pathway data analysis module 34. For example, visualization module 36 may enable a set of integrated visualization and mapping algorithms to draw the associated data into viewable annotated representations of biological pathways. Users of the system may view the data (e.g., through a graphical interface (GUI)) that displays proteins of interest as nodes in a directed network, and interactions between the proteins as directed edges showing pathways as cascades of interacting proteins. In addition, edges are annotated as described and mined from the various public data sources.
Referring again to
If desired, the system 44 may have functionality that enables authentication and access control of users accessing the pathway generation system 28 and pathway database 32. Both authentication and access control can be handled at the web server level by the pathway generation system 28 itself, or by commercially available packages such as Netegrity SITEMINDER. Information to enable authentication and access control such as the user's name, location, telephone number, organization, login identification, password, access privileges to certain resources, physical devices in the network, services available to physical devices, etc. can be retained in a database directory. The database directory can take the form of a lightweight directory access protocol (LDAP) database; however, other directory type databases with other types of schema may be used including relational databases, object-oriented databases, flat files, or other data management systems.
In this implementation, the pathway generation system 28 may run on the web server 54 in the form of serylets, which are applets (e.g., Java applets) that run a server. Alternatively, the pathway generation system 28 may run on the web server 54 in the form of CGI (Common Gateway Interface) programs. The servlets access the pathway database 32 and biological data sources 40 and 42 using JDBC or Java database connectivity, which is a Java application programming interface that enables Java programs to execute SQL (structured query language) statements. Alternatively, the servlets may access the pathway database 32 and biological data sources 40 and 42 using ODBC or open database connectivity. Using hypertext transfer protocol or HTTP, the web browser 48 obtains a variety of applets that execute the pathway generation system 28 on the computing unit 46 allowing the user to perform processing operations discussed below. Also, the web browser may be used to view Web pages containing biological data and access analysis tools, plotting tools, graphics programs, etc.
The system constructs the Pathway database by integrating several public databases containing protein-protein, and protein-DNA interactions, genomic data, and proteomic data. These include databases such as BND, TransPath, MINT, KEGG and commercial resources such as BioCarta and ProNet that have been designed to capture protein-protein and protein-DNA interactions obtained from high throughput experiments and represent this information in the form of biological pathway maps. In addition to these merged curated databases, we have further supplemented these data with interactions that were mined using a natural language processing engine that parses PubMed abstracts for protein-protein, and protein-DNA relationships.
The spider 58 is similar to the spider 56, except that it uses a natural language parser 62 because the data sources 42 contain textual information. The natural language parser 62 analyzes the whole structure of the sentences retrieved by the spider 58 from the data sources 42 and extracts relationships from the articles and abstracts. In this disclosure, the natural language parser 62 uses a database of text extraction patterns 64 to assist in extracting relationships from the retrieved articles and abstracts. The natural language parser 62 operates by making multiple passes of the retrieved articles and abstracts and reducing the text to a set of tagged words. The thesaurus of molecules 60 also assists the natural language parser 62 in the tagging of words. An illustrative, but non-exhaustive list of tags made by the natural language parser 62 include protein and peptide names (short and long), molecule names (short and long), disease names (short and long), experiment names (short and long), cell names (short and long), action words (interaction keywords) and negators. As an example, the natural language parser 62 may tag the molecule lectin-like oxidized low density lipoprotein as the long name and LOX-1 as the short name.
The natural language parser 62 uses the tags to extract interactions between molecules. In particular, the natural language parser 62 examines the tags that relate to molecules and cell names and looks for other tags that indicate relationships between the molecules and cell names. Tags that indicate relationships between molecules and cell names include action words (interaction keywords) and negators such as “does not inhibit”, “inhibits,” etc. Below is an example of how the natural language parser 62 parses a sentence received from the spider 58. The sentence in this example is: x“IL-10 inhibits the synthesis of a number of cytokines, including IFN-GAMMA, IL-2, IL-3, TNF and GM-CSF.”
For this sentence, the natural language parser 62 tags IL-10, IFN-GAMMA, IL-2, IL-3, TNF and GM-CSF as short name molecules. The natural language parser 62 also tags “inhibit” as an interaction keyword. The natural language parser 62 then extracts the following interactions:
The natural language parser 62 then places the extracted interactions in the pathway database 32.
Below is an example of how the natural language parser 62 would process an abstract stored in the data source 40. The abstract in this example is: IL-18 (0-100 ng/ml) specifically upregulated ICAM-1 expression on monocytes in human PBMC as demonstrated in our previous study. In the present study, we examined whether the synergistic upregulation of ICAM-1 occurred after the stimulation with the combination of IL-18 and IL-12 and whether the synergistic production of IFN-gamma was dependent on the interaction between ICAM-1 on monocytes and LFA-1 on NK/T cells. The effect of IL-12 on ICAM-1 expression on monocytes was marginal even at the highest concentration (100 ng/ml). However, in the presence of IL-12 (100 ng/ml), the expression of ICAM-1 induced by IL-18 was significantly enhanced as compared with that obtained by IL-18 alone. In addition to the expression of ICAM-1 on monocytes, IFN-gamma production was synergistically stimulated by IL-18 and IL-12. Anti-ICAM-1 and anti-LFA-1 Abs exhibited significant inhibitory effect on enhanced production of WFN-gamma by the combination of two cytokines, in particular, anti-ICAM-1 showing the complete inhibition. These results as a whole indicated that synergistic effect of IL-18 and IL-12 on IFN-gamma production in human PBMC is ascribed to the synergism of the effect of two cytokines on ICAM-1 expression on monocytes and that the subsequent ICAM-1/LFA-1 interaction plays an important role in the enhanced production of IFN-gamma.
The natural language parser 62 tags the above abstract as follows:
IL-18 (0-100 ng/ml) specifically upregulated ICAM-1 expression on monocytes in human PBMC as demonstrated in our previous study. In the present study, we examined whether the synergistic upregulation of ICAM-1 occurred after the stimulation with the combination of IL-18 and IL-12 and whether the synergistic production of IFN-gamma was dependent on the interaction between ICAM-1 on monocytes and LFA-1 on NK/T cells. The effect of IL-12 on ICAM-1 expression on monocytes was marginal even at the highest concentration (100 ng/ml). However, in the presence of IL-12 (100 ng/ml), the expression of ICAM-1 induced by IL-18 was significantly enhanced as compared with that obtained by IL-18 alone. In addition to the expression of ICAM-1 on monocytes, IFN-gamma production was synergistically stimulated by IL-18 and IL-12. Anti-ICAM-1 and anti-LFA-1 Abs exhibited significant inhibitory effect on enhanced production of IFN-gamma by the combination of two cytokines, in particular, anti-ICAM-1 showing the complete inhibition. These results as a whole indicated that synergistic effect of IL-18 and IL-12 on IFN-gamma production in human PBMC is ascribed to the synergism of the effect of two cytokines on ICAM-1 expression on monocytes and that the subsequent ICAM-1/LFA-1 interaction plays an important role in the enhanced production of IFN-gamma.
The natural language parser 62 then extracts the following information:
Molecules in Abstract
In some embodiments, a Pathway Database may be built by integrating several public databases containing protein-protein, and protein-DNA interactions, genomic data, and proteomic data. In addition to these merged curated databases, these data may be supplemented with interactions mined using a natural language processing engine that parses other data sources, e.g., PubMed abstracts or the like, for protein-protein, and protein-DNA relationships.
The Microarray Database may be constructed by populating the schema with data generated internally from wet lab experiments in addition to data that is publicly available. In the following paragraphs, data-models for creating these databases as well as design and schema for the integrated PMD (Pathway Microarray Database) are disclosed.
Embodiments of a Pathway Database may be designed to store information about individual genes, proteins and small molecules and their functional relationships in an effort to explore and research biological pathways. A general model for this database is depicted in
One advantage gained by the addition of these two relationships to the storage of protein interactions and biological pathways is that it allows for this platform to easily integrate disparate databases. In many cases, two or more of the public databases co-reference each other, or both independently reference a third or fourth database such as LocusLink or GenBank. By finding and storing the accession identifiers in the collaborating source table, the above described data model allows records from different databases describing the same compound to be reconciled and merged. When the accession numbers for other databases are not available for this type of integration, resolution of the same compound record in two or more data-sources can be achieved through the use of the compound name dictionary.
The above disclosed technique compares the common abbreviations for genes, proteins, and small molecule names against the names present in the individual records of each database being integrated allowing for semi-accurate integration to occur. One possible limitation of this approach is the infrequent generation of false positives when two unrelated records are merged due to ambiguity in resolving the cited names in the records being integrated. Despite this possible limitation, this technique in combination with the co-referencing technique allows for the integration of disparate genomic, proteomic, and interaction databases into a comprehensive database to be accomplished.
In some embodiment, a Microarray Database is designed to store and organize experimental data obtained from gene expression chips or other lab experiments. The object relationships used to store this data are described in
This hierarchy for storing experimental procedures and results allows for the capture of most microarray experiments with annotation. The simplicity and organization of this model also allows for the easy integration of data from other microarray databases such as the Stanford Microarray Database (SMD), RNA Abundance Database (RAD), GeneX, and the Yale Microarray Database (YMD). In addition, this model fits within the developing standards of MIAME and MAGE-ML. Overall, the flexibility represented in this model and its compliance to the emerging standards enable future expansion and the easy addition of new information sources and public microarray databases as they become available.
Embodiments of the Pathway and Microarray Database (PMD) are designed to merge the Pathway Database and Microarray Database using data references common to both databases. The ability to combine these data sources leverages the mechanism described for building the Pathway Database. As both the Microarray Database and the Pathway Database leverage the use of external databases such as LocusLink and Genbank to identify records, the use of the collaborating source table with captured data containing external database accession numbers serves as the integration point for these two disparate sources.
In the rare case where an accession identifier in the Microarray Database is not found in the lists of accession identifiers in the Pathway Database, the same algorithm used to resolve records using name matching in the Pathway Database can be applied as it is often the case that microarray data will contain gene names in addition to the accession identifiers in the annotation of each spot on the chip.
Embodiments of the system and method may be implemented as a web based database and visualization tool developed to facilitate the integration, organization, and display of information pertaining to protein, gene, and small molecule interactions and their roles in biological pathways.
As shown, data from from any number of public databases may be integrated into a common data schema. These databases include BIND, caBIO, GENBANK, KEGG, LocusLink, MINT, ProNet, SWISSPROT and TransPath. Research from PubMed has also been added using an automated natural language engine developed to identify biological interactions from unstructured text sources. Researchers can access the tool via the Internet and search its contents through the use of intuitive search pages and data filters.
For example, when using the tool researchers are presented with several search options allowing them to navigate the database and build comprehensive annotated maps of pathways, protein-protein, and protein-DNA interactions. Researchers using the search page can query the database by entering the name of protein, gene or small molecule as shown in
The operations described above allow the user to easily navigate the tool's accumulated database of molecular interactions. In addition, the tool supports new user-guided searches from the public databases and PubMed. Results of these requests are returned via e-mail to the requester, and can be viewed from within the website using the described search and display mechanisms. The user can also set up periodic searches to constantly mine for new information pertaining to a given molecule or interaction.
Additionally, since the biological data is mapped into an internal schema as previously described, users are able to search on any pathway or protein-protein interactions using the query capabilities supported by the underlying database. All searches generated by the interface are passed directly to the database for processing. This allows the visualization tool to directly exploit all searching capabilities provided by the database without imposing any additional constraints on the types of searches that can be performed.
The following is a discussion of a extraction and natural language processing (NLP) scheme implemented in some of the disclosed embodiments. One method for extracting protein, gene and small molecule (PGSM) interactions from unstructured texts can be divided into three separate parts: (1) a pathway database (PDB) consisting of dictionaries that are used by (2) a lexical analyzer to tokenize and tag relevant terms from scientific abstracts retrieved from PubMed (or other sources) whose output stream of tokens is then passed to (3) a parser constructed around a context free grammar (CFG) that is used to interpret the collection of tokens and output interactions based on the rules of the grammar.
The PDM may consist of two distinct dictionaries: (1) a name dictionary for recognizing PGSM names and their synonyms, and (2) a category/keyword dictionary for identifying terms described by interactions. The name dictionary may be constructed by combining a limited set of PGSM names (e.g., from Swiss-Prot, GenBAnk, KEGG, or some other source). The resulting name dictionary may consist of an appropriate number (e.g., 67,326) unique names and synonyms describing a total number (e.g., 37,546) distinct entities. The category/keyword dictionary may be adapted from other sources (e.g., the NIH relevant term list for oncogene expression) with additional categories and keywords found to be prevalent in the corpus.
The lexical analyzer may be designed to accept both unstructured text in addition to structured (e.g., PubMed) sources. The lexical analyzer then parses the input and generates a stream of tagged tokens based on a predetermined set of descriptions.
The lexical analyzer tags the input text by iterating through the document as shown in
The resulting output steam of tokens is available for the parsing phase of the overall process. This phase is responsible for analyzing the token stream using the set of CFG productions for the purposes of extracting interaction information. As illustrated in
The parser was developed using a concise set of grammar production rules allowing for the detection of PGSM interactions. The production rules were derived by manually analyzing a large corpus of 500 non-topic specific scientific abstracts pulled from PubMed containing various representations of interaction data in unstructured text. The abstracts were also read by humans to determine relevant sentences describing interactions that were then used to derive the production rules. The resulting production rules were combined and represented in a CFG. Other methods of developing a CFG are also possible. Examples of CFG and interaction keywords may be found in an article by some of the named inventors, which can be found at Mark R. Gilder, et al., Extraction Of Protein Interaction Information From Unstructured Text Using A Context-Free Grammar, Bioinformatics, vol. 19, no. 16 pp. 2046-2053, and which is hereby incorporated by reference.
The foregoing figures show embodiments of the functionality and operation of the system. In this regard, some of the blocks represent a module, component, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figure or, for example, may in fact be executed substantially concurrently or in the reverse order, depending upon the functionality involved. Furthermore, the functions can be implemented in programming languages such as Java, however, other languages can also be used.
The above-described systems comprise an ordered listing of executable instructions for implementing logical functions. The ordered listing can be embodied in any computer-readable medium for use by or in connection with a computer-based system that can retrieve the instructions and execute them. In the context of this application, the computer-readable medium can be any means that can contain, store, communicate, propagate, transmit or transport the instructions. The computer readable medium can be an electronic, a magnetic, an optical, an electromagnetic, or an infrared system, apparatus, or device. An illustrative, but non-exhaustive list of computer-readable mediums can include an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random access memory (RAM) (magnetic), a read-only memory (ROM) (magnetic), an erasable programmable read-only memory (EPROM or Flash memory) (magnetic), an optical fiber (optical), and a portable compact disc read-only memory (CDROM) (optical).
The computer readable medium may comprise paper or another suitable medium upon which the instructions are printed. For instance, the instructions can be electronically captured via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It is apparent that there has been provided a system, method and computer product for predicting biological pathways. While the invention has been particularly shown and described in conjunction with a preferred embodiment thereof, it will be appreciated that variations and modifications can be effected by a person of ordinary skill in the art without departing from the scope of the invention.
This application is a continuation in part of U.S. application Ser. No. 10/307,556, filed Dec. 2, 2002 and is related to U.S. application serial No. ______, filed ______, titled “SYSTEM, METHOD AND COMPUTER PRODUCT FOR PREDICTING PROTEIN-PROTEIN INTERACTIONS,” client docket no. RD-130,448 which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 10307556 | Dec 2002 | US |
Child | 10840426 | May 2004 | US |