The present invention relates to the use of computers to assist in the identification of drugs, including the determination of further indications for existing drugs, and for other pharmaceutical investigations.
The development of new drugs has tended to follow the conventional pattern of scientific and medical research. Thus initially a disorder, such as an illness, symptom, syndrome or disease, is discovered and investigated, thereby permitting characterisation of the disorder in terms of the symptoms that it exhibits. Next an attempt is made to understand the metabolic and biochemical pathways underlying the disease. Typically such pathways involve one or more proteins, which in turn are coded by corresponding genes in the human genome (or in the genome of an infectious organism, if relevant).
Once the protein(s) involved in a disorder have been identified, attempts are made to find compounds (i.e. drug candidates) that bind to a relevant protein. The intention is to discover a drug that modifies the action of the protein in such a manner as to treat, rectify or at least alleviate the disorder, such as by masking undesired symptoms, or by managing a disorder. (Most drugs act by modifying the properties of a protein directly, although drugs can also work in other ways, such as by binding to DNA, RNA, fatty acids, or carbohydrates, or by catalysing modifications of these chemicals).
For example, a particular illness may be attributed to a change in the concentration in the body of a certain substance outside the normal limits. One possible counter to this problem might be to find a drug that is active against a protein responsible for making the substance, so as to modify the endogenous manufacturing process, and thereby alter the level of the substance in the human body. Alternatively, there may be a disposal or buffering process in the body, responsible for degrading or removing the substance from the human body. If a drug can find a protein target to suppress this disposal or buffering process, then this may also have the desired effect of altering the level of the substance in the body. Another strategy could be to design a compound to mimic the effect of the natural substance, or alternatively to administer the natural substrate directly to the patient from an exogenous source.
In the above drug development procedure, the initial discovery of a disease or illness is generally performed by health researchers and clinicians. Pharmaceutical companies are primarily involved in the two subsequent steps, namely identifying potential drug targets based on the biochemistry of a disorder, and then producing suitable drug candidates that are active against such targets. This work is often very challenging, involving many highly trained scientists, and with no certainty of a positive outcome being obtained.
Furthermore, even after a candidate drug has been identified from such research, it still has to survive several further phases of clinical evaluation and development before it can be marketed as a treatment for the relevant disorder. In particular, a series of trials must be performed to demonstrate the safety and efficacy of the drug. These trials are typically arranged in three phases, with phase one addressing toxicology and other safety issues, phase two addressing efficacy in relatively small-scale clinical trials, and then phase three looking at larger-scale clinical trials. The data obtained from this testing is submitted to a body such as the Food and Drugs Administration (FDA) in the United States, the Medicines Control Agency (MCA) in the United Kingdom, the European Medicines Evaluation Agency (EMEA) in the European Union, or the Pharmaceutical and Medical Devices Evaluation Center (PMDEC) in Japan, in order to obtain marketing approval of the drug. The widespread clinical testing necessary for obtaining approval from a regulatory body means that marketing approval may not be obtained until many years after the initial identification of a candidate compound.
The entire drug discovery and development process is therefore very expensive. It has been estimated that the expenditure on research and development followed by the clinical testing for taking a new drug through to market might typically be in the region of $800 million. Of course, there are significant costs associated with work on drug candidates that never survive to marketing, whether because of safety or efficacy concerns or due to other considerations. The magnitude of drug development costs impacts the number and nature of drug research projects that the pharmaceutical industry can support.
There have been various attempts to improve the efficiency of the drug discovery and development procedure by applying large-scale computing technology. One approach has been to try to exploit the bioinformatics tools and infrastructure used to sequence and analyse the human genome. In particular, the Human Genome Project has identified and sequenced approximately 25,000 genes in the human genome, along with their corresponding proteins. This has significantly improved the process of target identification for drug discovery purposes. For example, the use of computationally intensive sequence similarity algorithms (such as BLAST) can search the entire human genome to identify relationships in sequences of amino acids between an unknown protein and various known proteins. Such similar or shared sequences of amino acids may indicate possible homologies, and therefore give clues as to the behaviour, structure or functionality of the unknown protein. In addition, it may be possible to estimate the likelihood of finding an effective drug against an unknown protein, again based on homology with other proteins having a common or similar amino acid sequence.
Another area in which the use of computing power is being introduced to help the drug development process is the provision of in silico cellular models. Although these are still largely in their infancy, it is hoped that such models can be used to simulate the behaviour of cells. These simulations can then lead to a better understanding of a disorder, such as by mimicking the effect of an excess or deficit of a particular protein. In addition, such models may be useful for exploring ideas about how to remedy such disorder, for example by investigating where to intervene in a particular pathway in order to correct the disorder.
WO 02/21420 describes creating and using knowledge patterns, such as a self-organising knowledge map, for recognising previously unseen or unknown patterns from large amounts of pharmaceutical data obtained by virtual screening. However, such an approach can be difficult from a user perspective due to the inherent complexity of the algorithms employed to determine the pattern matching and so on.
US 2002/0187514 describes the use of a two-dimensional table that maps compounds against targets. The table also stores experimental results from screening the compounds against the targets. The table can be used to help predict the potential use of a new compound as a drug, by looking in the database for targets that are known to interact with compounds associated with the new compound.
Computing in genomics and for biochemical modelling can therefore provide a way to accelerate the traditional drug development process. In particular, computers typically enable targets for new drugs to be identified more rapidly.
However, it is not generally appreciated that the large majority (about 90%) of all drugs approved each year can be classed as improvements upon existing drugs. In contrast, completely new drugs, which generally represent the primary focus of conventional drug research, form only a small proportion of marketing approvals. Thus each year the FDA typically approves about 40 drugs and biologics (therapeutics derived from living sources), and the majority of these cover modifications or enhancements of previous approvals.
For example, an existing drug may be approved for use in a different treatment regime, or in combination with certain other drugs, or for treating disorders that are closely related to the disorder for which the drug was originally approved. (Here, a closely related disorder may be regarded as generally sharing the same patho-physiological mechanisms and also covered within the same therapeutic area, e.g. depression and anxiety).
A rather different category of marketing approvals is where a previously approved drug is found to be valuable in a new and different context, such as in a different therapeutic area from the originally approved indication. Research has indicated that such secondary indications of drugs can be highly significant. For example Gellings et al examined the top twenty best selling US blockbuster drugs, and found that 40 percent of the revenues came from sales for secondary indications. (Gellings et al, (1998), New England Journal of Medicine, Volume 339, Number 10, pages 693-698). Moreover, 90 percent of the top twenty blockbusters were reported to have sales for such secondary indications. Similarly, Pritchard et al analysed the top 50 best selling drugs in the UK in 1999, and found that overall only 62 percent of revenues were for the original indication. (Pritchard et al, (2001), “Capturing the Unexpected Benefits of Medical Research”, Office of Health Economics, London). A further 25 percent of sales were for new and unlicensed indications, rather than for the originally launched indication. (The remaining 13 percent of prescriptions were classified as unknown, but many of these may have been for secondary indications as well). About half of the drugs examined in this survey had sales for additional indications.
One particularly well-known example where a secondary indication has proved of great commercial significance is for the drug sildenafil, developed by Pfizer Inc (and marketed under the trademark of Viagra). While this drug was being tested for the treatment of cardiac problems, it was observed that the drug was in fact active against male erectile dysfunction, which has since become the primary market for the drug.
In fact sildenafil was relatively unusual among such discoveries of new drug indications, in that it occurred around the time of the first testing in healthy human volunteers. In contrast, unexpected benefits for known medicines are usually observed after the drug is already on the market, since at this stage a large and heterogeneous patient population with a range of underlying disease is exposed to the new agent.
Another example of the discovery of additional indications is the drug finasteride, developed by Merck & Co Inc (and marketed under the trademarks of Proscar and Propecia). This drug was originally approved for the treatment of benign prostatic hyperplasia in 1992. However, it was subsequently observed that the drug was also useful for the treatment of alopecia. Finasteride was approved for this secondary indication in 1998, and this has since become the primary market for the drug. Further studies, published in 2003, have revealed that finasteride may also be effective against prostate cancer.
Unfortunately, many potential discoveries of additional indications for existing drugs are lost or delayed, due to the huge amount of clinical data that is available once a drug goes onto the market. Much of this data may appear in the medical research literature, but never be returned from the various hospitals and doctors in the field to the pharmaceutical company responsible for the drug. Furthermore, even if a particular side effect is observed and reported to the relevant pharmaceutical company, the team working on a particular drug is normally specialised in the therapeutic area for which the drug was originally targeted. This team is likely to regard the side effect as a problem in the use of the drug for its primary indication, and is unlikely to appreciate that the same side effect may in fact have potential benefit in a completely different therapeutic area.
Consequently, although the discovery of new indications for existing drugs has been of considerable commercial significance, the pharmaceutical industry has generally concentrated instead on the traditional development of new drugs using a conventional scientific approach. The discovery of new indications for existing drugs has largely been left to serendipity.
Moreover, even in circumstances where the potential value of searching for secondary indications of existing drugs has been appreciated, see for example the article on therapeutic switching at www.arachnova.com, the implementation of such searches remains difficult. For example, the sheer volume of information available from clinical and biomedical literature databases, combined with the heterogeneous origins and terminology of such literature, represent formidable obstacles to the use of such databases for the identification of possible secondary indications.
Accordingly, one embodiment of the invention provides a method of computer-assisted pharmaceutical investigation using a pharmaceutical knowledge base containing multiple information items. Typically the pharmaceutical knowledge base is stored in one or more computers. The method involves providing at least three axes representing pharmaceutical knowledge in a multi-dimensional coordinate space, in which a first axis pertains to disease, a second axis pertains to targets, and a third axis pertains to drug compounds. Pharmaceutical knowledge is mapped into the multi-dimensional coordinate space by assigning an information item to one or more locations in the coordinate space, dependent upon the data contained within the information item.
Such an approach can be used to integrate various and diverse sources of textual, numerical, and graphical data to assist in identifying drugs and indications. A resulting analysis supports the systematic identification of potential indications and other medical utilities for drugs and drug targets, in contrast to earlier reliance on chance and serendipity.
In one embodiment, each axis is defined by multiple entities along the axis. Thus each entity on the first axis is a disease, each entity on the second axis is a target, and each entity on the third axis is a compound. A unique identifier is allocated to each entity. This addresses the frequent situation that a single entity has multiple names, for example, tuberculosis might also be referred to as TB, as consumption, as phthisis, or as Mycobacterium infection. The use of the unique identifier therefore helps to prevent the same underlying entity from appearing multiple times on the same axis.
In one embodiment, one or more ancillary parameters are provided for at least some of the multiple entities. The ancillary parameters can be used to describe properties of the entity concerned. One possibility is to provide a set of synonyms for the entity, which again helps to address the widespread variations in terminology. In other words, if an entity is allocated the name of tuberculosis, then TB, consumption, phthisis, and Mycobacterium infection might all be listed as synonyms. The use of such synonyms allows all information items that relate to tuberculosis to be identified, irrespective of how they refer to the disease.
Another potential ancillary parameter may be used to map from a first entity on one axis to a second entity on another axis. For example, a compound (drug) entity may store the names of diseases that the drug is used to treat. Such information is typically available from industry databases, e.g. of available drugs. The approach described herein is primarily intended to go beyond such known mappings, and to uncover associations that have not hitherto been generally recognised (even if they may have been suggested somewhere in the literature).
Thus an information item (typically a research paper or such-like) is assigned a location in the multi-dimensional coordinate space by identifying a link between the information item and two or more entities. The position of the linked entities on their associated axes determines the location of the information item in the coordinate space. Turning this around, the existence of the information item can be regarded as providing evidence of some linkage between the entities concerned.
In one embodiment, an information item is linked to an entity by performing a textual search of the information item for the name of the entity. Typically the information items represent entries in a literature database of pharmaceutical, biological and medical research papers. The information items may incorporate the whole text of the papers, or perhaps just their abstracts, potentially with other bibliographic details.
As previously indicated, there is frequently a range of terminology that can be used with any given entity, as represented by the set of synonyms for the entity. Accordingly, in one embodiment, an information item may also be linked to an entity by performing a textual search of the information items for the synonyms (as well as the name) associated with the entity. The use of synonyms in this manner is found to significantly enhance the power of the approach described herein.
Another embodiment of the invention provides a method of computer-assisted pharmaceutical investigation using a pharmaceutical knowledge base containing multiple information items. A computer system can be used to store a first set of named entities corresponding to one axis and a second set of named entities corresponding to another axis. The axes are selected from a disease axis, a target axis, and a drug compound axis. Each entity incorporates a set of synonyms for the entity name. The information items can then be searched for any linkage between a specified entity on a first axis and each of the set of entities on a second axis, where a linkage is potentially indicative of a pharmaceutical connection between the entities concerned.
Typically, a linkage is found between a first entity and a second entity if both the first and second entities are related to a single information item. In one embodiment, an entity is related to an information item if the name or any synonym of the entity is present in the information item. In other embodiments, more sophisticated tests of linkage might be employed, for example based on semantic analysis, that might be used to assign a confidence or relevance to the linkage.
In one embodiment, the output from the searching is presented as a (text-based) listing of the entities on the second axis. The listing may omit entities on the second axis that do not have any linkage to the specified entity of the first axis, in other words, those entities for which no connecting information items were located. Typically, the entities on the second axis are ordered in the listing according to the number of information items for which there is a linkage between the specified entity and the entity on the second axis. Thus if there are many information items linking an entity on the second axis to the specified entity on the first axis, this is suggestive of a strong connection, and so may be presented near the top of the listing.
As previously indicated, certain associations between the axes may already be known, and recorded in information associated with one or more of the relevant entities. In one embodiment therefore, the listing may omit entities from the second axis that have such a recognised linkage to the specified entity of the first axis. This helps the user to focus on any linkages that have not hitherto been appreciated, and which are therefore of potentially the greatest interest.
Generating search results for each entity on the second axis is often a computationally intensive task. In one embodiment therefore, the search results are (pre)computed on a periodic basis, and then stored for subsequent retrieval in response to particular user requests. Since it is not known in advance which entity on the first axis the user will specify, this generally involves precomputing and storing listings for every entity on the first axis (or at least, precomputing and storing some form of data structure(s) from which the relevant listings can be recreated).
Typically the first axis is different from the second axis. For example, the listing might represent linkages between a compound entity on the first axis, and disease entities on the second axis. However, the same approach may be used even where the first and second axes both relate to the same property—e.g. both are disease axes, or both are compound axes. Finding linkages between the same form of axes may be pharmaceutically useful, for example to understand co-occurrences of diseases.
In one particular embodiment, a third set of named entities is provided. The first set, second set and third set of entities correspond to different ones of a disease axis, a target axis, and a drug compound axis. These three axes then define a space that can accommodate all relevant pharmaceutical knowledge.
In order to improve the power of the system, it is possible for the set of entities for at least one axis to be substantially comprehensive for pharmaceutical knowledge relating to that axis. For example, the entities on the disease axis may incorporate substantially all known diseases, while the set of entities for the drug compound axis may be substantially comprehensive for compounds currently marketed or under public development as drugs.
Another embodiment of the invention provides a method of computer-assisted pharmaceutical investigation. The method includes specifying a candidate hypothesis of the generic formula “A is related to B”, where A is selected from a first axis, and B is selected from a second axis. Queries are then generated for investigating the candidate hypothesis in a systematic and comprehensive manner with respect to each possible value of B along the second axis. These queries can then be used for searching a pharmaceutical knowledge base containing multiple information items for evidence in support of the candidate hypothesis for each possible value of B along the second axis.
The above approach can be used for investigating a very wide range of hypotheses including the identification of:
(a) compounds B that may be useful medicaments for the treatment of a disease A.
(b) targets B that may be useful for the treatment of a disease A.
(c) further disease indications B for a compound A that is known to be active against at least one other disease indication.
(d) further disease indications B for a target A that is known to be relevant to at least one other disease indication.
(e) compounds B that may be useful biomarkers or diagnostics for a disease A.
(f) compounds B that are known to have no effect in relation to a disease A.
(g) compounds B that have an adverse effect in relation to a disease A.
(h) compounds B that have an interaction with a compound A for determining drug-drug synergies.
(i) compounds B that have an interaction with a compound A for determining drug-drug adverse effects.
(j) compounds B that have an interaction with a target A.
(k) targets B with which a compound A has an interaction.
(l) diseases B that have a co-occurrence relationship with a disease A.
In one embodiment the first axis corresponds to disease, target or compound, and the second axis corresponds to disease target or compound. Other embodiments may support additional possibilities for the first and second axes, such as anatomy, tissue type, cell type, experimental procedure and so on. The second axis may the same as or different from the first axis.
In order to permit a complete and systematic analysis, the axes themselves can be set up to be substantially comprehensive. For example, the disease axis may be derived from one or more dictionaries or encyclopaedias of diseases. The compound axis may be derived from databases of drugs that are being marketed or that have been disclosed as under development. Although such databases are not comprehensive for all possible compounds, they do include all known marketed, experimental and prototype drugs, and so allow a complete search for secondary indications to be made. The target axis may be derived from the list of genes and protein products expressed from one or more genomes. In another embodiment the target axis is derived from targets that are known to interact with compounds on the compound axis. Although this is less complete than deriving the target axis from an entire genome, it helps to focus on targets that are known to be susceptible to drug action (these will be referred to herein as having good druggability). Such targets may correspond to a relatively small proportion of the entire genome.
An axis based on anatomy, cell type, tissue type, and so on can likewise be set up based on information from biological encyclopaedias and other appropriate reference sources. Note that the number of entities on such an axis may be significantly less than the number of entities on an axis such as disease or compound. For example, there may only be hundreds of different cell types defined as entities on an entity axis, whereas there may be thousands of different diseases defined as entities for a disease axis.
Various filters may be applied to the axes and/or search results, in order to improve their usefulness. For example, if the second axis represents target, the values of B along the second axis might be filtered to exclude those targets for which no drug compound has been launched. This therefore concentrates investigations onto targets that are known to have some marketed drug available. Another possibility is that values along the target axis are filtered to exclude those targets having poor druggability.
The search results are generally presented as an ordered listing of the values of B for which the generated queries provided evidence in support of the candidate hypothesis. Usually, the listing is ordered according to the number of information items that support the candidate hypothesis, this being some indication of the amount of evidence to back up the relevant hypothesis. Results may also be ranked (and/or filtered) using semantic algorithms, which typically generate and rank correlations between terms.
Another possibility is that the listing is ordered according to confidence in the candidate hypothesis. This reflects the fact that various information items may provide different amounts of support for a particular hypothesis. One way of assessing such confidence is to perform some form of semantic processing on the information items, rather than simply scanning for the presence of particular text strings.
It is also possible to order the results in accordance with some ontology relevant to the second axis itself, rather than strength of support for the candidate hypothesis. One advantage of this approach is that spatial relationships in the listings may then have physical implications. For example, if the target axis is ordered in accordance with some protein property, and many links are found between a disease and targets that have a similar set of protein properties, this will then appear as a cluster in the listing, which may have pharmaceutical significance. It will be appreciated that such clustering can be detected by visual inspection of suitable graphical plots of the data, or by using appropriate statistical techniques.
More broadly, a wide range of triaging, filtering, ranking, clustering and sorting methods may be employed with respect to investigations of the information items and/or the output listings from the queries. Such investigations may also employ text-mining, semantic algorithms, statistical pattern matching, network analysis, heuristic algorithms, neural network algorithms, and so on.
Another embodiment of the invention provides a method of determining a drug for the treatment of a disease by identifying the drug as a potential treatment for a disease using the approach described above, and then confirming by experiment that the drug can be used as a treatment for said disease. If the confirmation is successful, then the drug can proceed through development and testing to manufacture.
The approach described herein therefore supports computer-assisted drug or indications discovery based on the systematic and comprehensive calculation of potential scientific hypotheses relevant to drug investigations, and in particular involving compounds, targets and diseases. Various data sources may be searched for evidence to support the generated hypotheses. The data sources may be provided by a single, combined or federated database of information (for example the MEDLINE collection of biomedical literature), by single entry additions, by feeds of non-database information, such as news-wires, by proprietary documents or results, (e.g. internal company reports) and so on. It will be appreciated that two or more such data sources can be combined as appropriate.
Compounds that may be useful medicaments for the treatment of a disease may be identified by the systematic analysis of known compounds (as defined in databases such as the British National Formulary, the Investigational Drugs Database, or a proprietary database of biologically modulating agents). Likewise, targets that may be useful in identifying medicaments for the treatment of a disease may be identified by the systematic analysis of known targets. Potential targets include all the gene and protein products expressed from an organism's genome, gene transcript products such as RNA, and DNA itself in order to modulate gene expression or function. Such analysis is greatly enhanced by using synonyms of the compounds or targets concerned. Similarly, disease, indications and medical utilities for a known compound or target which may lead to a useful medicament for the treatment of another disease may be identified by the systematic analysis of known diseases, indications and their synonyms (as defined in databases such as The International Statistical Classification of Diseases and Related Health Problems). The systematic analysis is performed against a literature database and/or other set(s) of data sources.
Similar forms of analysis can also be used to identify new combinations of medicaments for therapeutic purposes, and also to identify new biomarkers and surrogate markers (whether biochemical, metabolic, protein, genetic, physiological, phenotypic or technological) to aid drug discovery, clinical diagnostics and/or patient profiling to identified indication(s). The analysis can also be directed towards questions of toxicity, adverse effects, or other drug safety data, to aid in the development of medicaments for therapeutic purposes, and/or towards finding drug-drug interactions for identifying adverse effects or undiscovered synergies for the development of medicaments for therapeutic purposes. Another possibility is to investigate questions relating to drug absorption, excretion, metabolism, and/or transportion properties. In addition, disease co-occurrences and epidemiological hypotheses can be identified and explored.
Such analysis can therefore be regarded from one perspective as a form of virtual throughput screening, for example to identify medicaments for therapeutic purposes, or to identify targets that bind existing medicaments, drugs or biologically active compounds. Such screening can be used to systematically and comprehensively calculate all potential scientific hypotheses relevant to drug discovery and to search various data sources for evidence to support the generated hypotheses. The comprehensive nature of such screening is feasible since the total number of drug discovery hypotheses is limited by number of known genes found in the human genome (for example the ˜25000 protein expression genes in the human genome), although this is also expandable to include the genomes of known pathogenic organisms, such as viruses and bacteria, and the total number of recognised human diseases (for example those listed in disease dictionaries). Note however that even where such screening does lead to possible or suggested drugs or targets, in many circumstances there may still be a considerable amount of effort and ingenuity required in the laboratory in order to confirm and exploit the results of the virtual screening.
A systematic, computerised analysis of various data sources may also assist with the development of ontologies and classification systems for diseases, indications, proteins, drug targets, medicaments and so on. In particular, clustering, semantic correlation, and other statistical techniques can be used to analyse various ontologies, and to determine those that are particularly valuable for pharmaceutical investigations by revealing unanticipated connections in the data sources.
In order to allow searching to be performed on the basis of structural similarity (rather than chemical name), for example using the Tanimoto method of similarity, knowledge of chemical structure, such as derived from databases, dictionaries, and/or modelling programs, can be associated, linked or embedded into the system. Likewise the system can be provided with a knowledge of target structure, for example based on or derived from gene sequence, in order to allow searching on the basis of target structure (for example using protein structure similarity algorithms such as threading, Dali, or Papia). In some embodiments, the search engine or database may natively support searching by structural similarity. In other embodiments, a tool may be used to derive the names of chemicals (or targets) having a structural similarity, with these names then being used as synonyms during the search.
The above approach for pharmaceutical investigations may be implemented in the form of a method, a system, a computer program and/or a computer program product. It will be appreciated that these various forms will all generally benefit from the same particular features described herein. Note that program instructions for implementing the invention are typically provided on some fixed, non-volatile storage such as a hard disk or flash memory, and loaded for use into random access memory (RAM) for execution by a system processor. Rather than being stored on the hard disk or other fixed device, part or all of the program instructions may also be stored on a removable storage medium, such as an optical (CD ROM, DVD, etc), magnetic (floppy disk, tape, etc), or semiconductor (removable flash memory) device. Alternatively, the program instructions may be downloaded via a transmission signal medium over a network, for example, a local area network (LAN), the Internet, and so on. Data for manipulation by the program instructions may be provided with the program instructions themselves, and/or may be provided from additional source(s).
Various embodiments of the invention will now be described in detail, by way of example only, with reference to the following drawings, in which like reference numerals pertain to like elements, and in which:
Knowledge for pharmaceutical drug discovery purposes can be located as appropriate within the three-axis space of disease, target and compound shown in
In particular,
The presence of point A may of course suggest that compound C1 is useful for treating disease D1 by acting upon target T1. Alternatively, there may be other reasons for the linkage shown, such as that compound C1 in acting upon target T1 is known to cause disease D1 (this will be discussed in more detail below). Note that any given information item may define multiple vectors in the matrix, for example, an information item may discuss the use of a range of compounds against a particular target.
Each vector in the matrix of
[D, dx, dy, dz . . . ; T, tx, ty, tz . . . ; C, cx, cy, cz . . . ]
Here, D, T and C represent the disease, target and compound identifiers respectively, while dx, dy, and dz represent additional parameters associated with the disease; tx, ty, and tz represent additional parameters associated with the target; and cx, cy, cz represent additional parameters associated with the compound or drug. We will refer herein to D, T and C as the primary parameters, since they define the three axes of the matrix, and additional parameters such as dx, tx, and cx as ancillary parameters. As just indicated, for any given information item, one or more of the primary and/or ancillary parameters may be missing.
The ancillary parameters associated with a disease might include clinical information such as therapeutic area, epidemiological data, such as number of sufferers, and so on. The ancillary parameters associated with a target might include genetic information, such as known polymorphisms, chemical information, such as crystallography data, and so on. The ancillary parameters associated with a compound or drug might include chemical information, such as formula, physical properties (molecular weight, melting point, etc), medical information, such as toxicological studies, business information such as current marketplace status (approved, in phase 2 trials, etc), as well as ownership of patent rights, and so on.
Many information items may contain parameter values for only two of the axes in the matrix. The vectors representing such items can then be located on a plane passing through the origin and normal to the axis corresponding to the missing data item. As an example,
One special form of two-dimensional diagram is where the same parameter is plotted on both axes. This is illustrated in
It will be appreciated that in general there is no intrinsic ordering of the different axes (e.g. there is no inherent linear scale of disease). The axes can therefore be constructed in some quasi-arbitrary fashion, for example by alphabetic (or numerical) ordering of a unique identifier for the primary parameters of the respective axes. Alternatively, the ancillary parameters may be used to determine the ordering of one or more axes, such as by defining the location of the relevant primary parameter(s) on the corresponding axes. Thus the disease axis may be ordered in terms of clinical area, so that cardiovascular disorders (for example) are clustered together on the disease axis.
The benefit of ordering the different axes depends in part on how the matrix is being used. Thus if the main objective is to discover point intersections (as described in more detail below), then such activities are relatively independent of the ordering of the axes. On the other hand, there may be circumstances where the spatial relationships between different vectors in the matrix are potentially valuable. For example, it may be known that certain compounds are effective against a particular target, but that precise details of the interaction are poorly understood. If the compound axis is plotted in terms of some physical property (e.g. pH) that leads to the interacting compounds being clustered together, then this may give some insight into the underlying biochemistry. In other words, the ancillary parameters can be used to establish various classification schemes or ontologies for the different axes, and these can then be used to organise and hence further investigate the information items.
Subsequently, we assume that scientific research into the disease discovers one or more targets that are potentially relevant to the disease. Each such target can be defined by a line in the plane of
Further research may now be performed, this time in order to discover a compound that is effective against the target (T1), which is known to be relevant to the disease of interest (i.e. D1). Each candidate compound can be defined by a line in the plane of
Having formed both lines A and B in
As previously discussed, the activities of both
One aspect of the new drug discovery strategy is schematically illustrated in
Proceeding to
Considering the intersection of D1 and T1 at point I5, this allows us to define line A in
The plots of
It will be appreciated that in some respects the drug discovery procedure of
Note also that the research and development effort associated with the procedure of
It should also be noted that investigations of targets and diseases (such as depicted in
It will be appreciated that although the knowledge underlying intersection I7 may in fact already be available in the public domain, the huge volume of medical literature renders the chances of discovering intersection I7 by serendipity alone very slim. In contrast, the use of the pharmacological matrix permits such discoveries to be sought in a systematic and structured manner.
Database 750 is shown in
System 700 further includes a content based retrieval engine 730 that accesses items in database 750. An index 740 is provided to facilitate such access (this index may be maintained as part of the retrieval engine 730 or as integral to the database 750 itself). In the current implementation, the retrieval engine comprises the Verity K2 Enterprise product available from Verity Inc of California, USA.
Although retrieval engine 730 could be used on an ad hoc basis for processing user queries, for performance reasons that will become clearer later, system 700 generally precomputes query results, which are then stored into database 755. Accordingly, user queries are generally satisfied from database 755, rather than underlying data source 750. The information in database 755 is then updated on a periodic basis, for example weekly, although the update interval can be varied as required (e.g. daily, or after a certain number of updates have been made to database 750). Of course, in other embodiments, users might interact directly with database 750, thereby obviating the need to precompute any results.
System 700 also includes relational database 760, which comprises three tables, one for each axis in the pharmacological matrix. Thus a first table 761A comprises records relating to diseases, a second table comprises records relating to targets 761B, and a third table comprises records relating to compounds or drugs 761C. Each table stores the primary and ancillary parameters as well as the synonym information for the corresponding axis.
(It will be appreciated that the logical table model shown in
System 700 also has access to one or more external databases 765. These can be used to obtain additional information about items stored in tables 716A, 716B, and 716C. For example, with respect to target information 716B, system 700 may have a link to a gene database that provides a fill sequence listing for the gene corresponding to this particular target. Note that external database 760 may be (partly) internal to the pharmaceutical company, although external to system 700 per se, e.g. one such database might list which research groups within the company are working on which particular targets. System 700 provides convenient (and in some cases seamless) access to these databases, which can then be used to supplement and augment the findings of the Pharmamatrix system itself.
The two remaining portions of system 700 are a client portion 710, which in one embodiment is provided by a conventional Internet browser, and a server application portion 720. The server portion 720 defines multiple views 711A, 711B of the underlying data, which are defined to reflect the structure and intended workflows within the pharmacological matrix.
The server application portion 720 is responsible for formulating search queries, dependent upon the view chosen by the user, as qualified (e.g. filtered) by any particular user selections. For example, the user may request to see a certain type of data relating to a specified clinical area. The application portion 720 therefore has to access relational database 760 in order to retrieve a listing of diseases (including synonyms) corresponding to that clinical area. This listing is then used in performing the search of database 750.
In the embodiment of
It will be appreciated that compute grid 777 may comprise computers of varying compute capacities and running different operating systems, which may be dedicated to grid tasks or may be shared with other non-grid related tasks. The computer grid 777 is inherently scaleable to very large sizes, and so is able to provide search results in direct response to user queries in a reasonable time, thereby avoiding having to pre-compute query results. One advantage of this is that new information can be queried as soon as it has been loaded into grid 777, without having to wait for this data to be incorporated into the next set of precomputed search results. In addition, the architecture of
In one particular embodiment, grid compute task distribution engine 776 may comprise Sun Grid Engine software, and compute grid 777 may comprise at least a Sun 6800 server and a Sun E450 server (all available from Sun Microsystems Inc.). The text mining implementation for processing the query may be implemented by the LexiMine product from SPSS, Inc. The result processing engine 778 may be implemented as a straightforward application to concatenate the results and to return them to the web and application server 720.
It will be appreciated that although the architecture of
In constructing the system of
With regard to the first of these aspects (comprehensiveness), a significant contribution to system 700 is the recognition that the universe of available pharmaceutical knowledge is finite. Consequently, such knowledge can be feasibly incorporated into and investigated by a single system. In particular, two of the axes in the pharmacological matrix are inherently limited, namely, the disease axis, which can be generated from appropriate medical encyclopaedia listing known diseases, and the target axis, which can be generated from genes sequenced as part of the human genome. The set of compounds for the compound axis is in contrast infinite (in theory). However, if the primary use of system 700 is to search for secondary indications, then the compound axis now also becomes finite, since it is restricted to compounds that are already known to have some pharmacological activity.
With regard to the second aspect, various information items may refer to the same underlying identifier in different terms, particularly where the information items come from a diverse range of heterogeneous sources. For example, one article may refer to the disease tuberculosis, but another to TB or to consumption or phthisis or Mycobacterium infection. Likewise, one article may use the chemical name of a drug, such as sildenafil, while another article may use the trade name (Viagra or Patrex or Penegra or Wan Ai Ke). Yet other papers may refer to the same compound as sildenafil citrate or UK-92,480 or UK-92480 or UK92480 or refer to it by its CAS registry number 171599-83-0 (or 139755-83-2 for the free base version). A further possibility is to use the chemical IUPAC name 5-[2-Ethoxy-5-(4-methylpiperazin-1-ylsulfonyl)phenyl]-1-methyl-3-propyl-6,7-dihydro-1H-pyrazolo[4,3-d]pyrimidin-7-one citrate. Similarly for targets, it is common the same biological entity, such as a protein, to be known by a variety of synonyms. For example the protein phosphodiesterase 5, could also be written as phosphodiesterase type 5 or PDE 5 or phosphodiesterase type V, or phosphodiesterase V or PDE V.
For each axis therefore, a thesaurus of synonyms has been developed. Each group of synonyms for a disease, target or compound has been assigned a unique identifier. This unique identifier is then used to provide a consistent location for information items pertaining to that disease (or other parameter(s)) within the pharmacological matrix. The use of synonyms in this manner can be applied to both the primary and ancillary parameters as appropriate. In addition, the synonyms of a particular entity may be grouped in a variety of ways, depending upon the particular ontologies and classifications systems employed.
In some embodiments it is useful for the synonyms of compounds that interact with a particular target to also be included in the list of synonyms for that particular target. This leads (for example) to the synonyms for phosphodiesterase 5 being combined with the synonyms for sildenafil (and/or vice versa).
Various ancillary parameters have been entered with respect to this disease, for example the class of the disease. Thus malaria is indicated as belonging to the anti-parasitic and anti-infectives disease areas, as well as being a neglected disease (an indication that it has been the subject of relatively little pharmaceutical research to date). In addition, malaria is indicated as having a medical need score of 4.88. This is a quantitative assessment of the medical value of developing a drug that is effective against malaria. A high medical need score would tend to indicate a large number of sufferers, a serious disease, and a lack of or problems with existing treatments. The “yes/no” button for “TA Interest” indicates that there is currently a therapeutic area looking at the disease malaria (i.e. it is indicative of current operations within Pfizer).
The disease relevant in vivolin vitro assay (DRIVA) field in
The two remaining fields in
The “Search Terms” box is used if searching external databases that require predefined search terms (e.g. certain keywords), rather than being able to search on any given word. For instance, literature relating to malaria might be indexed in a particular database using the abbreviated term “MALR”, which would then be used for searching purposes. However, the literature database 750 utilised in the current implementation does not impose any limitations on search terminology, and so searches are conducted using the disease name plus the full range of synonyms. Hence no special search terms are provided in
In addition, a set of ligands are provided. These are compounds that bind to or otherwise interact with the relevant target. It will be appreciated that these ligands are therefore compounds, and so link to the third axis in Pharmamatrix (for compounds). Note that these links represent connections that are already recognised in various formal industry databases, such as the Investigational Drugs Database (IDDB). In contrast, the searching capability of system 700 is aimed at finding potential links that are suggested in the wider set of literature, as represented by database 750, but that have not yet been fully recognised or exploited.
Two further ancillary parameters shown in
Further information provided for the compound entry includes links to both the other two axes of Pharmamatrix. Thus finasteride is indicated as being used against three indications, namely prostatic hypertrophy, urinary dysfunction, and alopecia. In addition, two targets for finasteride are identified, namely alpha reductase and testosterone 5 alpha reductase (which are both indicated as being enzymes). Information is also provided on the various known mechanisms whereby the compound interacts with its targets.
At the bottom of
Note that in the current implementation, synonym data for the compound axis is stored in a separate external database 765 rather than in system 700 itself, and is accessed as and when required. As an example of the listing of synonyms for a compound, those provided for finasteride include: CP-087534 (Pfizer Compound File), andozac (Trade Name), chibro-proscar (Trade Name), eutiz (Trade Name), finaspros (Trade Name), finasteride (USAN, BANN, INN), finastid (Trade Name), mk-0906 (Research Code), mk-906 (Research Code), procure (Trade Name), prodel (Trade Name), propecia (Trade Name), proscar (Trade Name), prostide (Trade Name), ym-152 (Research Code). Of course, it will be appreciated that in other embodiments, the compound synonym information could be stored in system 700 itself, along with the other data shown in
The axis data for tables 716A, 716B, and 716C of the database 760 can be obtained from various standard sources, whether hard copy or on-line. Depending upon the data source(s), this information may have to be entered into database 760 by hand, such as by using the screens of
In the current implementation, the disease information is obtained from various medical dictionaries and encyclopaedia, such as the International Statistical Classification of Diseases and Related Health Problems Revision 10, ISBN 92 4 154419 8. Note that diseases can include conditions that may be unwanted for cosmetic or other reasons and which can potentially be treated or prevented by pharmaceuticals (e.g. baldness, pregnancy, etc). It will be appreciated that very obscure or rare diseases (e.g. that only affect people with an extremely uncommon genetic disorder) may be omitted from Pharmamatrix for reasons of practicality (such diseases would in any event be considered as having very low medical need).
The compound information is obtained primarily from the International Drugs Database (IDDB). This includes entries for publicly disclosed drugs at different stages of development. As previously discussed, the IDDB contains only a subset of possible pharmacologically active compounds, although it can be considered as largely complete for the purpose of searching for secondary indications of existing drugs. Of course, there are many other databases of chemical compounds available, and these could be added to system 700 if so desired.
Note that pharmaceutical companies tend to be particularly interested in drugs formed from small compounds, since these generally provide the most convenient and flexible medicaments. Thus small compound drugs can normally be provided in pill form for oral administration. In contrast, larger molecules, such as proteins, are typically unable to pass through the stomach wall and/or are broken down by enzymes in the intestine, and so often have to be administered by a less convenient route, such as injections. Accordingly, additions to the compound axis of the pharmacological matrix may focus preferentially on smaller compounds as being the most attractive for pharmaceutical development.
In terms of the target axis, one possible route for populating this is to utilise the full set of human genes sequenced as part of the human genome project. In the current implementation however, a somewhat different strategy has been used, which is to incorporate all targets that are known to have at least one drug active against them. This information can be derived from the IDDB and other similar sources, by extracting the target information for each listed drug.
One motivation for adopting this approach is that only a certain proportion of genes in the complete genome appear to be amenable to small compound ligand-binding, which is the conventional mode of action for most pharmaceuticals. Moreover, only a subset of these genes actually seem to have direct relevance for therapeutic purposes. For example, there is a lot of redundancy built into the genome, so that even if the behaviour of one gene is somehow modified, this alteration can often be compensated for or masked by other genes. Indeed, one estimate is that there may only be a few hundred genes that provide medically useful targets for small compound drugs (see A. L. Hopkins & C. R. Groom, “The Druggable Genome.”Nature Reviews Drug Discovery, 1, 727-730 (2002)).
In such circumstances, it is generally most efficient for the pharmalogical matrix to focus on those targets that are already known or suspected to be pharmaceutically relevant, based on the action of current drugs and drug candidates (as derived, for example, from the IDDB). Nevertheless, it will be appreciated that other embodiments may expand the target axis to accommodate the entire human genome (plus any other potential targets, such as the genome of known parasites).
As previously indicated, the data relating to the axes of the pharmacological matrix has been carefully curated (i.e. checked for consistency, etc.). The performance of this curation is routine for those of ordinary skill in the art, albeit somewhat time-consuming, since it is generally performed by hand. This especially applies to the creation of links between the different axes (such as the target field and the indication field shown in
Although system 700 is initially populated during the development phase, it will be appreciated that by its nature the system is subject to further modification, in order to update or insert new information. Thus there is ongoing work to enhance the system, for example, to accommodate newly recognised diseases (e.g. the recent outbreak of the SARS virus), or newly discovered drugs, etc., or simply to add further synonyms that have been found in various papers.
It will be appreciated that once database 760 has been created, then it can be accessed using standard database technology. For example, views 711A 711B can be developed to perform selection (filtering) of specific records within the database, and of specific fields within the records. Results can then be presented with rows and columns ordered as appropriate.
As shown in
The two remaining search types shown in
In the example shown in
To the right of each entry shown in
Assuming that the user selects the third icon, corresponding to a search of the Pharmamatrix system, this takes us to the screen of
The results shown in
The amount of processing to generate the screen of
In order to reduce response time, these searches are performed in advance, and the results stored into database 755. Accordingly, the information for screen
Apart from computational difficulties, the very large number of articles available in a typical medical literature database 750 can cause other problems. In particular, there is a danger that a pharmaceutical researcher trying to investigate a particular disease suffers from “information overload”, given the vast number of available papers. For example,
However, the presentation of
Note the benefit here of using the curated lists to define the axes of the pharmalogical matrix. Thus taking as an example the numbers given above, namely 10 synonyms for malaria and 20 synonyms for a typical target, it will be appreciated that each row of
Each target entry in
Selecting the Count column in the screen of
The entry for each information item in
(It will be appreciated that ketotifen is a compound rather than a target per se. However, it has been found useful in the current implementation to include compounds that are known to act against a particular target as synonyms for the target itself—in this case ketotifen acts against the histamine h1 receptor. This then provides a direct mapping from disease to drug compound, as in the example of
The second icon illustrated for each information item in
In the particular context of
Hitting the Search Pharmamatrix button in
Pursuing this last option (i.e. selecting the fourth icon) leads us to the screen of
As discussed in relation to
Each entry in
Returning to
Returning now to the top screen of the Pharmamatrix system (see e.g.
A further example of performing such a search, which could be followed by selecting this first question, is illustrated in
Returning to the top menu (see
More particularly, the method starts at step 801, and proceeds to loop first by disease (step 805) and then by target (step 810). For the relevant disease-target combination, the method now loops by disease synonym (step 815) and by target synonym (step 820). Within the innermost loop, search results are retrieved from database 750 for the relevant combination of disease synonym and target synonym (step 825). These results are then accumulated for the particular target-disease combination (step 830).
Note that the form of search at step 825 may vary according to the particular embodiment. In the current implementation, the database 750 incorporates abstracts and other bibliographic information (rather than the full text of the articles). Accordingly, the searches are performed within the available abstracts and fields. However, in other embodiments the full text of the articles may be available for searching.
In addition, the precise data retrieved at step 825 may vary from one implementation to another. In one embodiment, only a reference is retrieved to a matching article (i.e. an article that contains both of the search terms). This reference can then be stored in database 755, thereby allowing other information about the article to be readily accessed in the future. In an alternative embodiment, the system 700 retrieves and stores in database 755 all information needed to populate the screen of
Once all the results have been accumulated for all synonyms of a given disease-target combination (steps 835, 840), they are counted and saved to the particular target. This can be viewed as completing one line of
The processing of
It will be appreciated that the general processing of
Although the current implementation of Pharmamatrix provides certain predetermined usage strategies, it will be appreciated that there is a very wide range of other investigations that may be performed with the system 700. Such investigations may be performed either by the development of additional views 711, or by using standard database access facilities to access the data in the relevant databases, or by any other appropriate mechanism.
For example, a facility could be provided to search by compound (although to some extent this is obviated in the current implementation by the provision of compounds as synonyms for targets). This would ensure that the order in which the data in system is accessed is arbitrary and can be selected by a user at the time of submitting a query. In particular, it would be possible to enter initially from the compound, target or disease perspective and then to extend the analysis along any axis.
The results of a compound search could be categorised either by disease or by target. The former option would produce a view resembling that of
The latter option, mapping a compound against all targets, can be employed for the discovery of new drug targets associated with a drug, and thus can be used as a way of virtual screening. It is not uncommon to discover that a drug binds to more than one target. The drug action of the second target may elucidate the mechanism of action of a new indication, pharmacological property or toxicological (safety) concern.
The system so far described produces a simple yes/no for each information item, according to the sole criterion of whether or not the relevant textual search terms appear in the information item. As previously mentioned, this process identifies a variety of connections between axes. For example, in a search of disease A against compound B, the presence in a single information of both disease A and compound B might potentially be due to one (or more) of the following reasons:
(a) compound B is potentially effective as a treatment against disease A;
(b) compound A has no effectiveness as a treatment against disease A;
(c) disease A is a side effect of taking compound B for some other purpose;
(d) compound B increases (or decreases) vulnerability to disease A; and
(e) compound B is potentially effective as a biomarker for disease A (e.g. the presence of compound B in the bloodstream is indicative that the patient is suffering from disease A).
The above list is not exhaustive. One other possibility is that the mention of A in combination with B may be purely coincidental and have no direct pharmaceutical relevance: e.g. some people in a trial were observed to have disease A, and some disease C, and some of those with disease C were taking compound B for treating disease C. In other cases, the form of interaction may be somewhat more complex, but potentially of interest: e.g. when treating disease A with compound D at the same time (and in the same person) as treating disease C with compound B, the effectiveness of compound D might be reduced (or enhanced).
It will be appreciated that analogous sets of possible relationships exist between the compound and target axes, and also between the target and disease axes. Accordingly, Pharmamatrix can be used to search for a wide range of classes of interaction. For example, the system can be employed not just for finding targets or compounds that might be used to treat a particular disease, but also for identifying targets or compounds that might be useful as a biomarker for that disease.
Rather than simple yes/no counting based on the presence (or otherwise) of the selected search terms, a more sophisticated analysis of the information items could be performed. One possibility is to estimate a relevance, weight or confidence for each information item by using the bibliographic information—e.g. precedence might be accorded to more recent articles, or to those in certain more prestigious journals. The text of the article (or abstract) can also be used for determining relevance. For example, the presence of a search term in the title of an article generally indicates a higher relevance than simply having the search term in the abstract (or main text) of an article. Likewise repeated mentions of the search term generally indicate a higher relevance and confidence than a solitary mention. The absence of other search terms might also indicate a higher degree of relevance for the particular search term that is present (although this is computationally more time-consuming to determine).
More specialised criteria for assessing relevance can also be used. For example, papers that report results from human trials could be given precedence over results from animals trials, which in turn could be given precedence over in vitro experiments. This form of assessment might be made by simply searching for predetermined words or phrases in an information item (e.g. “animal trial”). This approach could be formalised by building a dictionary or vocabulary of key words to be used in ranking (or filtering) articles. Alternatively, a more complex semantic analysis might be performed (natural language processing).
Further methodologies and criteria for assessing relevance are known to the person of ordinary skill in the art (such as those used in Internet search engines). It will be appreciated that the various techniques for assessing relevance may be combined as appropriate.
If relevance information is determined it can be utilised in various ways. For example, a listing of articles, such as shown in
The relevance information might also be used in relation to the view of
As previously discussed,
(a) language of the information item (e.g. a user might only be interested in locating English language articles);
(b) application area (such as whether relevant primarily for human treatment or for veterinarian uses);
(c) source of information (e.g. limiting the text search to articles from a defined group of journals recognised as having particular importance);
(d) mode of available compound delivery (such as whether available in a form for oral administration); and
(e) patent situation (including status and ownership of any relevant patents).
Note that the filtering may be applied at various stages of the analysis. Thus in some circumstances, the filtering may be applied, prior to the search, to the data of the relevant axis 716, utilising the relevant ancillary parameters. (This is the case for
The various filtering criteria may also be used after the search, for ranking the results. For example, an article in a prestigious journal might be valued ahead of an article in a less prestigious journal when assessing relevance. Similarly, drug compounds available in pill form might be ranked above drug compounds that have to be taken intravenously.
Some of the techniques discussed for filtering or ranking (assessing relevance) can also be helpful in automatically allocating information items to one of the possible types of relationship listed above (as (a) to (e)). Again, this filtering might simply be based on scanning for certain words (e.g. “treatment”, “marker”, etc), and/or by performing a more complex semantic analysis.
Note that data and ontologies relating to the axes (as held in database 760) can also be used in determining and enhancing the relevance of results. Thus one possibility might be to provide the user with an option to filter out recognised associations. For example, referring to
Another example of the use of axis data to determine relevance is where the ontology of the axis provides some mechanism for weighting the search results involving that axis. For example, as previously indicated, not all genes are susceptible to small compound binding. Consequently, one might establish an ontology for the target axis based on one or more parameters such as druggability (i.e. how likely a small compound binding is to be found for the target) and therapeutic usefulness (i.e. whether interacting with the target is expected to impact biochemical behaviour). Such parameters can potentially be estimated from research into the human genome, for example, and then used to limit or to order the search results. For example, the target entries in the view of
Note that in
In one implementation, the Pharmamatrix system can be used to map one axis onto itself. This might be used, for example, to derive a listing analogous to
Investigating the disease-disease mapping locates information items that reference multiple diseases, and can be valuable in uncovering co-occurring diseases or other disease or epidemiological associations. Such disease-disease associations can then be mapped onto biochemical pathways to reveal previously unknown biochemical or molecular pathways, or to find environmental or infectious agents as a common pathology between two or more previously unconnected diseases.
Similarly, calculating a target versus target matrix locates information items that contain a link or association between two different targets. Such target versus target information can be valuable for elucidating protein-protein interactions or for uncovering synergies that might be the basis for combination therapies. In addition, a compound-compound mapping may be used to find links or associations between drugs, which can be valuable for identifying potential combination therapies.
The mappings described so far have generally been:
(i) two-dimensional—in other words finding information items that pertain to X and Y (where X and Y may be taken from the same or different axes); and
(ii) first order—in other words, the retrieval for X and Y looks for information items that directly contain both X and Y.
However, the Pharmamatrix system may be expanded to relax both these constraints if appropriate.
For example, in some circumstances three-dimensional mappings might be utilised to find information items pertaining to X, Y, and Z (again X, Y, Z may be taken from the same or different axes). There are various ways in which such a multi-dimensional query might be formulated. For example, searching for articles that mention a particular disease, target and compound, listed perhaps by compound, or articles that mention a disease and two particular targets, listed perhaps by disease.
Similarly, Pharmamatrix might be searched for second or higher order associations. Thus if X and Y both appear in (or are otherwise linked by) a single article, there is a first order link between X and Y. A second order link between X and Y then occurs if there is a first order link between X and Z and another first order link between Z and Y (with higher order links defined analogously). An example of a second order search might be to locate a second order link between a compound and a disease, where the compound has a first order link to a target, and there is also a first order link from the target to the disease.
It will be appreciated that output in the current implementation, such as shown in
Nevertheless, it will be appreciated that a graphical presentation provides a valid representation of the underlying data, and accordingly may be utilised as appropriate for the particular circumstances. For example, if targets are ordered in correspondence with location on the human genome, then spatial location of various target along the target axis might possibly have pharmaceutical relevance. This could then be investigated visually on a graphical plot, or by using statistical (spatial) clustering or other such analysis techniques.
Furthermore, in the embodiments so far described, the compound axis 716C has primarily been defined on a textual basis, by using the names of the relevant compounds. However, in other embodiments non-textual parameters might be utilised, such as chemical structure. Note that some information about structure is already stored on the compound axis (see
One possibility is to impose a structure-based ontology onto the presentation of results. For example, if the system supports a view of search results by compound (analogous to the view by target of
Another possibility is that information on chemical structure could be used during the search itself, rather than simply in the presentation of search results. For example, searching for a given compound already incorporates searching for name synonyms of this compound. This concept of synonyms could be extended to include searching for chemical homologues or analogues of the specified compound (i.e. to include compounds that are closely related from a structural or chemical perspective to the compound to be searched).
There are various ways in which such searching of structural synonyms might be implemented. In certain embodiments, database 750 might directly support searching for structural synonyms. In other words, a chemical structure might be input as a search term, and database 750 would have the ability to match to corresponding or similar structures.
Alternatively, structural synonyms might be handled in a similar manner to name synonyms. In other words, a listing of compounds that are structural synonyms of the compound to be searched could be generated, with each entry in the listing being separately searched, and the results then collated for the entire listing. The information for deriving the listing of structural synonyms could be incorporated as one or more ancillary parameters within the compound axis 716C. Alternatively, this might perhaps be implemented by a dedicated tool that accepts a compound name, and then returns a listing of compounds having a structural similarities to the originally provided compound. Such a tool could interface as appropriate to system 700, such as to compound axis 716C or search engine 730.
There are a number of potential uses for the ability to accommodate structural synonyms. One possible situation (as contemplated above) is where results may be summed across a set of structural synonyms to provide stronger evidence for an interaction than can be obtained from any one compound within this group. Another circumstance is when a certain drug is known to be pharmacologically effective, but to suffer from disadvantages (e.g. high toxicity). In this case, the database might be searched for evidence to support the use of a compound that has structural similarities to the known drug, and so might possibly share its efficacy, yet might not suffer from its disadvantage(s).
On the other hand, there may be situations where it is nevertheless desirable to perform a search solely in relation to a specific compound, without including structural synonyms. Accordingly, the facility to include structural synonyms could be made optional, whereby it can be switched on or off for any particular view or search.
The above techniques for investigating structural synonyms could also be implemented on the target axis, based typically on similarities in DNA sequences in genes, or amino acid sequences in proteins. Such a facility could be used for example to identify compounds that are known to be effective against targets that are structurally synonymous with the particular target under investigation (and so might also be effective against this target). Note that suitable facilities for identifying similarities in gene sequences already exist, such as the BLAST algorithm mentioned above.
In one embodiment of the invention, the Pharmamatrix system is extended to support further axes in addition to (or potentially instead of) disease, compound, and target, such as axes for anatomy, tissue type, cell type, or experimental methodology. It will be appreciated that the entities for an axis for anatomy, tissue type, and cell type can be readily derived from medical encyclopaedias and other references sources, and can be constructed in a relatively complete fashion. The total number of entities on such axes is somewhat smaller than on the disease or target axes (typically hundreds rather than thousands).
As an example of the use of such additional axes, there may be a report in the literature that a particular drug tends to accumulate in a certain part of the anatomy (say the brain) or in a certain tissue type, even if this does not appear to cause any adverse medical condition (i.e. no disease). The accumulation may be irrelevant to the primary indication of the drug, which may perhaps relate to heart medication. However, the accumulation of the drug in the brain may be of potential interest to a researcher who is looking for a mechanism to deliver a different compound to the brain. The report of the drug accumulation in the brain could then be found within the Pharmamatrix system by searching along the compound axis for the anatomy entity of “brain”, analogous to the search performed along the compound axis for the disease entity of malaria (see
The pharmaceutical investigations described above have been mainly presented in the context of human medical applications, but can also be applied to veterinary medicine. In this case, appropriate other sources of information can be utilised for defining the axes of the Pharmamatrix system, and also for the providing the database(s) of information items to search. One particular benefit of being able to handle both human and veterinary medicine is the ability to discover linkages between human diseases and animal diseases, for example by searching with human diseases on one axis and animal diseases on another. This may be especially significant in terms of certain infectious diseases (such as BSE in cows and CJD in humans).
In conclusion, a variety of particular embodiments have been described in detail herein, but it will be appreciated that this is by way of exemplification only. The skilled person will be aware of many further potential modifications and adaptations that fall within the scope of the claimed invention and its equivalents.
Number | Date | Country | Kind |
---|---|---|---|
UK 0321708.0 | Sep 2003 | GB | national |
Number | Date | Country | |
---|---|---|---|
60512382 | Oct 2003 | US |