This disclosure relates generally to bioinformatics techniques, and more particularly to an apparatus and method for finding genes associated with diseases.
Biological and medical literature (including written papers, books, studies, and/or reports) are now increasingly being electronically published or stored in electronic media. For example, MedLine <http://www4.ncbi.nlm.nih.gov/PubMed/> is an electronic database containing over 11 million citations (titles and/or abstracts) covering publications since 1960 as compiled by the National Library of Medicine. By utilizing these collections of information, it may be possible to discover novel gene expression pathways that can help in the development of new or improved methods for treating particular human diseases.
However, a researcher having access to this electronic collection of information is also required to be able to identify and filter out the irrelevant articles. For example, the word “leukemia” appears in over 22,177 articles in MedLine. Thus, a great amount of effort and time would be required to manually extract useful information embedded in such a large volume of stored data.
Various methods are available for automated extraction of biomedical knowledge. However, these methods do not sufficiently reduce the amount of retrieved articles that are irrelevant to the topic being searched. For example, these current methods would result in the retrieval of many citations that are false positives because these methods are unable to disambiguate the relevant citations that are stored in an electronic database. Therefore, the current technologies are limited to particular capabilities and suffer from various constraints.
In an embodiment of the present invention, a method of finding genes associated with a disease, includes: finding all potential gene symbols in articles (or titles/abstracts) in a database (or some repository); folding any aliases into official gene symbols; and computing the relevance of each official symbol to the disease. The method may further include, eliminating non-gene symbols by use of contextual clues.
In another embodiment, an apparatus for finding genes associated with a disease, includes: a database for storing information; and a server coupled to the database and configured to find all potential gene symbols in the stored information, to fold at least one alias into official gene symbols, and to compute the relevance of each official symbol to the disease. The server may be configured to eliminate non-gene symbols by use of contextual clues.
These and other features of an embodiment of the present invention will be readily apparent to persons of ordinary skill in the art upon reading the entirety of this disclosure, which includes the accompanying drawings and claims.
Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified.
In the description herein, numerous specific details are provided, such as examples of components and/or methods, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that an embodiment of the invention can be practiced without one or more of the specific details, or with other apparatus, systems, methods, components, materials, parts, and/or the like. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of embodiments the invention.
In describing the process of method 200, Medline is used as an example of the database 105 (
Gene Frequencies in MedLine (or Other Database)
In procedure or action (205) in
Additionally, the procedure (205) may record the publication date of each article and may determine whether the article's abstract or title contained a word or words pertaining to a particular disease or gene expression pathway. For example, if the search were focusing on the leukemia disease, then a search is made for the words “leukemia” or “leukaemia” in the Medline database. The method of procedure (205) can then isolate the lists of genes in those articles pertaining to leukemia.
Coping with Alias Symbols
It is noted that gene names can be represented by gene symbols (see, e.g., <http://www.gene.ucl.ac.uk/public-files/nomen/ens2.txt>) and aliases (see, e.g., <http://www.gene.ucl.ac.uk/public-files/nomen/ens3.txt>) typically listed by three (3) online gene databases: HUGO (Human Genome Organization), OMIM (Online Mendelian Inheritance in Man), and LocusLink (an online database of gene loci). The use of gene symbols and/or aliases for a given gene name adds to the current difficulty in distinguishing between relevant and irrelevant articles in databases searches for that given gene, since a given gene may have multiple identifiers.
The process of identifying gene mentions by the occurrence of gene symbols is also naturally error prone. A gene symbol can coincide with another common acronym, or with an acronym constructed by the author for the purposes of the article. For example, an author might have used the acronym CGH to mean “comparative genomic hybridization”, while CGH might be recorded as an alias for the gene HTC2. As long as the errors are equally likely to occur within the focus set as in all of Medline, the embodiments of algorithms (as disclosed herein) will not be misled by the errors.
However, when an acronym is specific to a focus set, and yet does not represent a gene, further processing is needed to disambiguate the meaning of the acronym. Applicants present their approach or method to dealing with this problem by use of a procedure (220) as illustrated in
Even when a word in a document is being used to denote a gene, frequently the word is an alias rather than an approved gene name. Thus, in an embodiment of the invention, a post-processing procedure (210) may be required to match an alias to a particular gene, as shown by the flowchart in
To match an alias to a particular gene, a count is performed all occurrences of gene names (official symbols and aliases) within the entire article set and within the focus subset. Here, “entire article set” might refer to the Medline database, while “focus subset” might pertain to only those articles whose titles or abstracts contain the word “leukemia”, for example. For each alias occurrence, the procedure (210) adds to the count of both the alias and the official gene or genes it represented. For example, if the symbol OS, an alias for MID1, occurred in 49 articles, while MID1 occurred in 3, MID1 would have a count of 52. The procedure (210) keeps track of the fact that 49 of the counts for MID1 originated with OS to be able to relate back to the articles and to modify the document gene lists as described below. Because OS frequently stands for “overall survival”, it is important to keep track of its contribution to MID1's counts, as MID1 could otherwise be incorrectly related to a disease.
In procedure (210), there is a modification of the PMID/gene lists for the entire set and the focus subset to account for alias symbols. For each alias symbol, there are typically four possibilities:
In all cases, the procedure (210) keeps the information about where the counts originally came from and indicates this information in our results. For example, let's say our results implicate an obscure official symbol, which almost always appeared as the well-used alias symbol, in some disease. The original counts would show the user that 95% of the time that the gene was mentioned in connection with the disease, it was mentioned as the alias and not as the obscure official symbol, hopefully mitigating any confusion.
The procedure (210) in
If the alias symbol is, for example, an alias “A” of only one official name, for example, “O” (procedure 315), the various following conditions are considered. If “O” is mentioned elsewhere at least once in an article (procedure 320), then the alias symbol is deleted (335). If “O” is never mentioned in any article (procedure 325), then the symbol “A” is changed (335) to “O”. If the article under consideration contains both “A” and “O” (procedure 330), then the symbol is deleted (335).
If the alias symbol is, for example, an alias “A” of several official names “O”, “P”, etc. (procedure 340), then the various following conditions are considered. If none of “O”, “P”, etc. is ever mentioned (procedure 345), then the alias symbol is kept (310). If only one of “O”, “P”, etc (say “O”) is ever mentioned in any document (procedure 350), then the symbol “A” is changed (355) to “O”. If more than one of “O”, “P” are mentioned in other articles (procedure 360), then the symbol “A” is kept, and an attempt to remove ambiguity is later performed by considering the text (procedure 370). If the article under consideration contains “A” and one of “O”, “P”, etc. (procedure 365), then the symbol is deleted (355).
Counting N-Tuple Occurrences
From the simplified PMID/gene lists, the method 200 can create data sets containing counts for each n-tuple of genes. For example, the Medline article with PMID number 8563753 discusses human myeloid leukemia and mentions the genes NUP98, HOXA9, and NUP214. So from this article, we obtained one count for each of these three genes, one count each for the pairs NUP98-NUP214, NUP98-HOXA9, and NUP214-HOXA9, as well as one count for the triple containing all three genes NUP98-HOXA9-NUP214.
In the method (200), we initially created data sets for individual gene occurrences (post-modification for aliases), gene pairs and gene triples.
Measuring the Relevance of Individual Genes
A detailed discussion is now made on the procedure (215) for sorting the relevance of genes to a disease. A discussion is first made on a method 380 (
As shown in
Focusing on leukemia, consider, for example, the gene MLL, which our measure shows to be most tied to leukemia. The official HUGO symbol MLL stands for myeloid/lymphoid or mixed-lineage leukemia (trithorax (Drosophila) homolog). The gene MLL aliases include HTRX1, HRX, and ALL-1.
The symbol MLL occurs in 548 of the 39710 articles mentioning leukemia and containing a gene symbol, and 633 times in the 2 million articles containing gene symbols. If we put aside for the moment that the name MLL itself states the relationship of the gene to leukemia, we could we use the above data to determine how strong the relationship is between MLL and leukemia.
We do this by measuring (382) how unlikely it would be to see the number of gene mentions in SL, given how frequently the gene is mentioned overall. Let's represent all the MLL documents with black balls, and all other documents as white balls. If we assume that there is no correlation between MLL and leukemia, then the distribution of the number of MLL documents in SL (the number of black balls drawn) is given by the Binomial distribution.
The expected number of MLL documents is given by, E[nMLL]=NL*pMLL, where pMLL is the probability of drawing a black ball or 0.0003, and NL is the number of documents in the SL (the number of draws from the urn). The standard deviation is given by σ(nMLL)={square root}{square root over (NL*(1−pMLL)*pMLL)}. Also, nMLL is the number of observed documents (in this case, in the leukemia set) with MLL. We measure the strength of the relationship (cMLL) between MLL and leukemia by measuring how much the observed number of MLL documents (black balls) deviates from the expected number had the draw been random, as shown in equation (1).
We find that cMLL=133.5, which is a very high value. We have used the normal approximation to the binomial distribution, valid in the case of large N. Using the normal distribution we can also find that the probability that 548 or more MLL documents are found among a random draw of 39710 documents is less than 10−16. Our finding is consistent with a summary from the Atlas of Genetics and Cytogenetics in Oncology and Haematology <http://www.infobiogen.fr/services/chromcancer/index.html> “MLL is implicated in at least 10% of acute leukemias (AL) of various types”.
Most genes, however, show little or negative correlation with leukemia as demonstrated in the distribution 400 in
Table 1 shows an example of the output of the algorithm identifying relevant breast cancer genes. The results shown in Table 1 may be shown, for example, in the display 115 of the server 110 (
If and when the algorithm does make mistakes, it is in rare cases where the symbol is absent from the gene alias databases. An error can also occur when the gene symbol is genuine but overlaps with another common acronym and has no supporting definitions occurring in text. For example the FOR alias for the WWOX gene occurs 139 times in articles mentioning breast cancer. However, it is never accompanied by a definition, and so is rejected as a gene symbol based on the overall likelihood that FOR is a gene symbol which is only about 10%. The WWOX gene symbol itself would nevertheless be identified as relevant, as it occurs 4 out of 5 with the words “breast cancer/tumor”.
Relevance of Gene Pairs
As shown in the method 390 in
Method (b) uses the probabilities of A and B occurring in the entire document collection. This means that most pairs of genes that were individually relevant to SL will appear positively correlated simply because they occur more frequently in SL, increasing the chance that they occur together in SL. Hence, method (a) is preferable to (b) in determining whether A and B act together with regard to SL.
Method (c) can be used to measure the relevance of a gene pair to a disease, just as one can measure the relevance of a single gene. If a gene pair occurs more frequently in SL than in the entire document collection, then the pair is considered relevant to SL. Using method c), we find that the CBFB-MYH11 pair occurs 28 times with leukemia, and 32 times overall, giving the pair a relevance score of 32.49 to leukemia.
Searching through the literature we find why CBFB and MYH11 are complementary to such an extent: “In human acute myeloid leukemia samples with chromosome 16 inversion, a fusion gene CBFB-MAYH11 is created and expressed. This novel gene includes most of the CBFB gene, a hematopoietic transcription factor, and the last half of MYH11”<http://www.umassmed.edu/pgfe/faculty/castilla.cfm>.
We find that genes located on the same chromosome are frequently studied together, which may or may not indicate an interesting gene interaction.
Disambiguating Gene Symbols
When attempting to extract gene symbols from text, we face the problem of polysemy—the use of one symbol to refer to several terms. Ideally, we would like to know whether a symbol refers to a gene in order to correctly match genes to particular diseases or conditions. As shown in the method 220 in
The method 220 calculates the likelihood that a symbol represents a gene by comparing the number of article titles and abstracts containing the symbol as well as words such as “gene”, “DNA”, “inhibit”, “express”, to the total number of articles in which the symbol occurs. The higher the value of the ratio rG, the greater the likelihood that any given instance of the symbol is a gene reference. Thus, if the ratio rG is above a threshold, then the method 220 can accept (435) the symbol as a gene reference. Typically, the threshold may be set to approximately 0.5. Otherwise, ratio rG is below a threshold, then the method 220 can reject (440) the symbol as a gene reference.
While using rG alone can be useful for positively identifying gene symbols with little ambiguity (i.e., the symbol is almost always used to refer to a gene), additional information may be needed to disambiguate symbols with multiple meanings. For example, the symbol DCC, used to denote the “deleted in colon cancer” gene, also occurs in the Medline abstracts as an abbreviation for “dextran coated charcoal”, “dicyclohexylcarbodiimide”, “day care center” and many other concepts. Its rG is only 0.46, which places it below our threshold of 0.5. This information alone does not allow us to judge with certainty whether the symbol DCC refers to the gene in any given article.
Fortunately, authors sometimes offer on first mention a definition followed by the symbol itself in parenthesis. In procedure (405), the method 220 extracts the words preceding the parentheses and selects those most likely to form a definition, and then compares the definitions with the official gene name or names associated with an alias, if available. It is typically necessary for this operation to be fuzzy as definitions are not always exact matches. For example, one author may define the symbol ER as “estrogen receptor” (an exact match for the definition) while another may define it as “estrogen receptors.” To support this variability the algorithm used attempts to break definitions into smaller components and compare the overlap of those to the initial definition. Specifically, the technique used is the deconstruction of definitions into n-grams, or substrings of length n. The 3-grams for “estrogen receptor,” for example, are: est, str, tro, rog, etc. The power of such a technique is that it extracts “root” meanings from terms that are impossible to determine by direct comparison. For example, “estradiol receptor” and “estrogen receptor” are basically the same thing, but only a technique such as n-grams will be able to determine this. The distance between the official definition and the proposed definition is:
Where the numerator is the number of intersecting n-grams between the true definition, A, and the proposed definition, B. The denominator a normalization factor based on the number of n-grams in both definitions. The resulting similarity value is then compared to a threshold. If the match is above a threshold, then the symbol is accepted (410) as a valid gene symbol. If the match is below the threshold, then if there are few definitions, the symbol is accepted (420) as a valid gene symbol because this condition sets forth there is a high overall likelihood that the symbol is valid. In contrast, if there are many definitions, then the symbol is rejected (425) as a valid gene symbol.
As an example, Table 2 lists an evaluation of the symbol DCC as a possible reference to the “deleted in colon cancer” gene for two diseases: breast cancer and colon cancer. The number of occurrences and the matching score (0 to 1 low to high) is given after each extracted definition of the symbol. Thus, Table 2 shows how the symbol “DCC” is disambiguated in two contexts, one of breast cancer and the other of colon cancer. Although the symbol occurs twice as often in documents dealing with breast cancer, an embodiment of the invention allows us to recognize that DCC in the context of colon cancer stands for the “deleted in colon cancer” gene, but stands for “dextran coated charcoal” in the breast cancer context. Dextran coated charcoal assay is the preferred method used to quantify the presence of estrogen and progesterone receptors in breast cancer tissue. This makes the symbol DCC highly relevant to breast cancer, but not the gene DCC itself. By analyzing the definitions accompanying the symbol, we were able to give opposite, but correct, classifications for DCC in two different contexts. The results shown in Table 2 may be shown, for example, in the display 115 of the server 110 (
Alternative Features or Other Modifications
The various engines or modules discussed herein may be, for example, software, commands, data files, programs, code, modules, instructions, or the like, and may also include suitable mechanisms.
Reference throughout this specification to “one embodiment”, “an embodiment”, or “a specific embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment”, “in an embodiment”, or “in a specific embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Other variations and modifications of the above-described embodiments and methods are possible in light of the foregoing teaching.
Further, at least some of the components of an embodiment of the invention may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, or field programmable gate arrays, or by using a network of interconnected components and circuits. Connections may be wired, wireless, by modem, and the like.
It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application.
It is also within the scope of the present invention to implement a program or code that can be stored in a machine-readable medium to permit a computer to perform any of the methods described above.
Additionally, the signal arrows in the drawings/Figures are considered as exemplary and are not limiting, unless otherwise specifically noted. Furthermore, the term “or” as used in this disclosure is generally intended to mean “and/or” unless otherwise indicated. Combinations of components or actions will also be considered as being noted, where terminology is foreseen as rendering the ability to separate or combine is unclear.
As used in the description herein and throughout the claims that follow, “a”, “an”, and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
Number | Date | Country | |
---|---|---|---|
Parent | 10107377 | Mar 2002 | US |
Child | 11188538 | Jul 2005 | US |