The technical field relates to methods of identifying common properties within a set of biomolecules and properties that connect two or more sets of biomolecules, and also relates to methods for deriving functional explanations or hypotheses to explain the relationship between a set of biomolecules (e.g., genes, proteins) and between multiple sets of biomolecules.
Biomedical research is in the midst of an unprecedented data explosion. Complete genome sequences of prokaryotic organisms are appearing in the literature and on the World Wide Web on almost a weekly basis. See e.g., http://igweb.integratedgenomics.com/GOLD/. Several complete genomes from model eukaryotic organisms have also been sequenced, and many more sequencing projects are in various stages of planning and execution See e.g., hftp://www.nih.gov/science/models/. The sequence of the human genome is also now freely available in “finished” form. See e.g., http://www.ncbi.nlm.nih.gov/genome/guide/human/ or http://www.ensembl.org/Homo_sapiens/. Combined with the growing availability of high-throughput and genome-wide experimental methods, this deluge of data facilitates the potential for comparisons of sequence, structure, mRNA- or protein-expression levels, and function between all human genes and the genes of model organisms. It also opens up new challenges for determining the functional and cellular role for the many as yet uncharacterized genes within these organisms.
As research into genomics and proteomics progresses, experimental results are beginning to transcend a single gene of interest and are more commonly involving sets of genes or other biomolecules that behave in some sense “similarly” or share a common property. Although computational tools that allow for a comparison of one gene to all other known genes at the level of primary nucleic acid or amino acid sequence have existed for some time (e.g., BLAST; Altschul et al., 1990), such comparisons often do not yield sufficient information to allow for the identification of a specific function for that gene. Indeed, it is very common for genes that share little or no similarity at the nucleic acid sequence level to encode proteins that have related functions or roles. For example, two genes might encode enzymes that catalyze adjacent steps in the same biochemical pathway, and the functional disruption of either gene might lead to a similar outcome for the cell or organism (e.g., a human disease). These genes would be unlikely to exhibit similarity at the primary nucleic acid sequence level, and thus current search strategies would not identify these genes as being related despite the similar phenotype that would result from their functional disruption. By way of additional example, this problem is also encountered in areas such as transcriptome analysis, where lists of genes with similar expression levels or time-profiles are generated from each experiment. Thus, there persists a great need for computational methods for determining the underlying commonality among a set of genes and for ways of assigning consensus annotations to such gene sets.
One currently available approach for analyzing genes is a World Wide Web-based tool that collects and displays information gene-by-gene for a predefined set of genes, such as disease candidates by creating a “home page” for each gene in the set. Halushka et al., 1999. This and other approaches (see e.g., Bouton and Pevsner, 2000; Bouton and Pevsner, 2002; Khatri et al., 2002; Ostermeier et al., 2002) lack breadth and do not comprehensively address the universe of possible interactions, traits, and characteristics between genes.
Some other approaches involving text mining of published scientific abstracts have been developed for use in gene expression profiling (see e.g., Tanabe et al., 1999; Masys et al., 2001; Blaschke et al. 2001), or for finding links between genes and diseases (Jenssen et al., 2001; Perez-Iratxeta et al., 2002a). The latter group has recently demonstrated the feasibility of mining MEDLINE abstracts to generate lists of candidate genes that are believed to be associated with a group of inherited diseases. Perez-Iratxeta et al., 2002b.
Computational methods have been proposed that pertain to partitioning of genotype variation into clusters that predict quantitative trait variation, such as elevated plasma triglyceride levels. Nelson et al., 2001. An extension of this method has been used to uncover a combination of polymorphisms in several estrogen metabolism genes that correlates with increased sporadic breast cancer occurrence. Ritchie et al., 2001. A support-vector machine approach was employed to make gene functional classifications based on phylogenetic profiles and expression data. Pavlidis et al., 2002. Additionally, a graph theoretic method for combining microarray and data with protein interaction maps as a way of annotating sets of genes from transcriptome experiments has been described. del Rio et al., 2001.
While the above methods attempt to address the general problem of assigning consensus annotations to gene sets, these approaches do not offer a comprehensive solution to the problem of identifying the properties of a set of biomolecules and correlating these properties with other sets of biomolecules for which a common property has been defined.
What is needed, therefore, is a method of identifying various properties of a given set of biomolecules and correlating these properties with multiple sets of biomolecules that are common to a given biological process or pathway. Such a method would facilitate the characterization of a set of unknown biomolecules, including an assessment of the function of the unknown biomolecules. These and other problems are addressed herein.
Provided is a method of identifying a relationship between one or more candidate biomolecules and one or more reference biomolecules. In one embodiment, the method comprises: (a) inputting to a computer a query set describing the one or more candidate biomolecules; (b) comparing the query set with a target database describing the one or more reference biomolecules, wherein the one or more reference biomolecules are grouped into one or more buckets, and wherein the one or more reference biomolecules of each bucket share a common property; (c) counting a number of matches between each query set and each bucket of the target database; and (d) statistically analyzing each match, wherein the presence of a statistically significant match identifies a relationship between the query set and a bucket of the target database.
Also provided is a method of identifying a relationship between two or more region sets, each region set describing one or more candidate biomolecules, and a target database describing one or more reference biomolecules grouped into one or more buckets. In one embodiment, the method comprises: (a) providing a query set describing two or more region sets, each region set comprising one or more candidate biomolecule sequences extracted from one genetic region; (b) comparing the query set with target database sequences describing one or more reference biomolecule sequences, wherein the target database sequences grouped into one or more buckets, and wherein the one or more reference biomolecules of each bucket share a common property; (c) counting a number of matches between each query set and each bucket of the target database; and (d) statistically analyzing each match, wherein the presence of a statistically significant match identifies a relationship between the query set and a bucket of the target database. In one embodiment, the method further comprises (e) constructing a plurality of replicates of the one or more query sets; (f) modeling the replicates at random chromosomal locations to form a random location data set; (g) processing the random location data set by following steps (a)-(d); (h) quantifying the number of times each match is found to surpass a predetermined threshold to form a statistically significant set of random location matches; and (i) comparing the statistically significant set of random location matches to the statistically significant relationship of steps (a)-(d).
In various embodiments, query sets comprise one or more sequences, including, but not limited to, DNA, RNA, or protein sequences. In one embodiment, these sequences are derived from one genetic region. In one embodiment, the one or more candidate biomolecules and the one or more reference biomolecules are all selected from the group consisting of proteins, nucleic acids, and small molecules. In one embodiment, the comparing comprises employing a BLAST-based algorithm to identify similar or identical sequences. In one embodiment, the counting comprises applying one or more principles chosen from the group consisting of (a) each query set candidate sequence can match at most one reference sequence in any given bucket; (b) each query set candidate sequence can possess a match in one or more different buckets; and (c) once a candidate sequence in the query set matches a specific bucket reference sequence in the target database, any subsequent matches of that same candidate sequence to other reference sequences in that bucket do not increase the match count for the bucket. In one embodiment, the statistically analyzing comprises computing one or more statistics for each match, which can optionally be sorted and/or outputted to a webpage comprising one or more hyperlinks.
Also provided is a computer-readable medium having stored thereon a data structure having multiple data fields, comprising (a) a first data field containing data representing a bucket; (b) a second data field containing data representing a name for the bucket; and (c) a third data field containing data representing a list of members of the bucket, wherein the members have a common property.
Also provided is a method of making a target database. In one embodiment, the method comprises: (a) identifying a source of informative content; (b) arranging informative content from the source of informative content into a set of buckets, wherein the buckets are given names; (c) gathering the names of the buckets and a list of biomolecules present in each bucket; and (d) creating and loading into a database data fields containing data representing (i) the set of buckets; (ii) the list of biomolecules present in each bucket; and (iii) a description for each biomolecule present in each bucket. In one embodiment, the source of informative content is a publicly available database, including, but not limited to, SwissProt, TrEMBL, and NCBI. In one embodiment, the gathering is accomplished using a source-specific parsing script. In one embodiment, the creating and loading is accomplished using a database loading script. In one embodiment, the data representing a description for each biomolecule present in each bucket is selected from the group consisting of a nucleic acid sequence, an amino acid sequence, or an identification number, wherein the identification number allows for retrieval of a nucleic acid sequence or an amino acid sequence.
Also provided is a computer readable storage device embodying programs of instructions executable by a computer for performing the disclosed methods.
Accordingly, it is an object to provide a novel method for characterizing a set of biomolecules. This and other objects are achieved in whole or in part as disclosed herein.
An object having been stated hereinabove, other objects will be evident as the description proceeds, when taken in connection with the accompanying drawings and examples as best described hereinbelow.
The disclosed methods and data structures can be implemented in hardware, firmware, software, or any combination thereof. In one exemplary embodiment, the methods and data structures disclosed herein for classifying biomolecules can be implemented as computer readable instructions and data structures embodied in a computer-readable medium.
With reference to
Hard disk drive 107, magnetic disk drive 108, and optical disk drive 110 are connected to system bus 103 by a hard disk drive interface 112, a magnetic disk drive interface 113, and an optical disk drive interface 114, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules, and other data for personal computer 100. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 109, and a removable optical disk 111, it will be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories, read only memories, and the like can also be used in the exemplary operating environment.
A number of program modules can be stored on the hard disk, magnetic disk 109, optical disk 111, ROM 104, or RAM 105, including an operating system 115, one or more applications programs 116, other program modules 117, and program data 118.
A user can enter commands and information into personal computer 100 through input devices such as a keyboard 120 and a pointing device 122. Other input devices (not shown) can include a microphone, touch panel, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to processing unit 101 through a serial port interface 126 that is coupled to the system bus, but can be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). A monitor 127 or other type of display device is also connected to system bus 103 via an interface, such as a video adapter 128. In addition to the monitor, personal computers typically include other peripheral output devices, not shown, such as speakers and printers. The user can use one of the input devices to input data indicating the user's preference between alternatives presented to the user via monitor 127.
Personal computer 100 can operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 129. Remote computer 129 can be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to personal computer 100, although only a memory storage device 130 has been illustrated in
System area networking environments are used to interconnect nodes within a distributed computing system, such as a cluster. For example, in the illustrated embodiment, personal computer 100 can comprise a first node in a cluster and remote computer 129 can comprise a second node in the cluster. In such an environment, it is preferable that personal computer 100 and remote computer 129 be under a common administrative domain. Thus, although computer 129 is labeled “remote”, computer 129 can be in close physical proximity to personal computer 100.
When used in a LAN or SAN networking environment, personal computer 100 is connected to local network 131 or system network 133 through network interface adapters 134 and 134a. Network interface adapters 134 and 134a can include processing units 135 and 135a and one or more memory units 136 and 136a.
When used in a WAN networking environment, personal computer 100 typically includes a modem 138 or other device for establishing communications over WAN 132. Modem 138, which can be internal or external, is connected to system bus 103 via serial port interface 126. In a networked environment, program modules depicted relative to personal computer 100, or portions thereof, can be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other approaches to establishing a communications link between the computers can be used.
I. Definitions
Following long-standing patent law convention, the terms “a” and “an” mean “one or more” when used in this application, including the claims.
As used herein, the term “about,” when referring to a value or to an amount of mass, weight, time, volume, concentration or percentage is meant to encompass variations of ±20% or ±10%, in another example ±5%, in another example ±1%, and in still another example ±0.1% from the specified amount, as such variations are appropriate to perform the disclosed method.
As used herein, the terms “amino acid” and “amino acid residue” are used interchangeably and mean any of the twenty naturally occurring amino acids. An amino acid is formed upon chemical digestion (hydrolysis) of a polypeptide at its peptide linkages. In keeping with standard polypeptide nomenclature, abbreviations for amino acid residues are shown in tabular form presented hereinabove. In addition, the phrases “amino acid” and “amino acid residue” are broadly defined to include modified and unusual amino acids.
As used herein, the term “biomolecule” means any molecule isolated from, derived from, or based on a molecule found in a living organism, including viruses. The term biomolecule includes, but is not limited to, both proteins and nucleic acids (RNA and DNA). Biomolecules can be polymeric in nature and can comprise a unique sequence of monomers; for example, a biomolecule can comprise a nucleic acid (e.g., a gene, and fragments thereof), an amino acid, a derivatized protein (e.g., a glycosylated protein), a nucleic acid comprising a nucleic acid analog, a peptide nucleic acid (PNA), an antibody, as well as peptides, polypeptides, proteins and fragments thereof. As used herein, the term “biomolecule” also refers to any molecule that is capable of producing a biological effect or participating in a biological process. In this context, a biomolecule includes, but is not limited to, a small molecule such as a drug.
As used herein, the term “BLAST-formatted database” means a database wherein the data representing a nucleic acid or amino acid sequence of a candidate or reference biomolecule is in a form amenable to manipulation by BLAST and BLAST-based algorithms. The proper form for such sequences is described in Altschul et al., (1990). See also http://blast.wustl.edu/doc/FAQ-Indexing.html. The BLAST-formatted database acts as a master repository for all nucleic acid and amino acid sequences. It includes data entries for nucleic acid and amino acid sequences corresponding to all reference biomolecules as well as identification or accession numbers by which these sequences can be accessed for use in the methods and devices disclosed herein. In addition, data is automatically added to the BLAST-formatted database corresponding to the nucleic acid and amino acid sequences of all candidate biomolecules.
As used herein, the term “bucket” means any grouping of biomolecules (e.g., genes or gene products) that share a biological property. For example, a gene or gene product can have an identifier and/or an associated sequence (amino acid or nucleic acid). In one example, an identifier is a standard name for the gene or gene product (e.g., “human beta-globin”). In another example, the identifier is an identification number or an accession number that allows the sequence of the gene or gene product to be retrieved from a source (e.g., the NCBI accession number for the human beta-globin complete coding sequence is AF007546). A source includes, but is not limited to a public or private database. The identifier need not be unique, and a given gene can be a member of one or more buckets.
Each bucket can have a unique name, which can also indicate its origin and/or creator. Buckets and collections of buckets can be created by individuals or they can be defined as the results from various types of analyses. For example, a bucket can comprise a set of genes found to be more highly expressed in a particular tumor cell compared with a normal cell. Buckets can also be created from public-domain databases. As an additional example, a bucket can include all the component enzymes in a metabolic pathway, all the protein components in a signaling pathway, biomolecules mentioned in the same publication, biomolecules mentioned in publications on the same subject, sets of proteins sharing a particular sequence motif or domain, sets of genes known to be present on an oligonucleotide array or chip, genes classified into particular categories according to an ontology, gene products present in a particular tissue or organ or subcellular location, or genes in which a particular keyword occurs somewhere in their associated annotations. A bucket can form an element of a target database.
As used herein, the term “bucket source” means any medium or entity to which the origin of the bucket can be traced. For example, a bucket source can be a user. In another example, a bucket source can be a database. In yet another example, a bucket source can be the results of a search of a database done with user-specified parameters. Defining a bucket source can be useful as an approach for identifying different buckets that have the same name. The use of bucket sources also allows broad categories of buckets to be defined, such as “pathway” or “function” buckets.
As used herein, the terms “candidate biomolecule” and “candidate sequence” are used interchangeably, and mean a biomolecule or sequence that is part of a query set to be compared to a target database. Candidate biomolecules are ones that the user is attempting to characterize as having or not having the various properties that are represented by the buckets of the target database. This characterization is accomplished by comparing a candidate biomolecule to the reference biomolecules of the target database and statistically analyzing the number of matches that result from the comparison. When a statistically significant match (of the query set) is found to a particular bucket, the user can infer that the candidate biomolecule has the property that is common to the reference biomolecules that are members of the bucket to which the match was made.
As used herein, the terms “describing” and “description” as they relate to biomolecules mean any categorization of the biomolecule that relates to its identity or to a property it possesses. In one example, a biomolecule can be described by its common name, such as “human beta-globin”, “mouse erythropoietin receptor”, “Drosophila fushi tarazu”, etc. In another example, a biomolecule can be described by its nucleic acid or amino acid sequence. In another example, a biomolecule can be described by an identification number or accession number that allows its corresponding nucleic acid and/or amino acid sequence to be retrieved from a source such as a public or private database. In yet another example, a biomolecule can be described by a property that it possesses. In one example, a property can be a functional description of the biomolecule such as “kinase”, “receptor”, “cytokine”, “oncogene”, “ligand”, etc. In another example, a property can include the organism from which the biomolecule was isolated. In another example, the property can include a biochemical pathway in which the gene product plays a role including, but not limited to pyrimidine biosynthesis, the citric acid cycle, fatty acid biosynthesis, the pentose cycle, amino acid biosynthesis, etc. In yet another example, the property can include a three-dimensional (3D) structural feature of the biomolecule. Several ways of incorporating structural information into a search exist including, but not limited to creating sets of buckets based on public databases that impose a structural hierarchy on those proteins with known three-dimensional structures. Exemplary public databases include CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new/index. html) and SCOP (hftp://scop.berkeley.edu/). It is generally believed that proteins with at least 30% overall amino acid identity are likely to fold into very similar structural conformations (McGuffin & Jones 2002).
A more general method might be to reduce or project known 3D structures to a sequence-like character string, comprising the secondary structure adopted by each amino acid (e.g., hhhhhhhhhsssshhhhhhhhhhhhh as a helix-loop-helix motif). A BLAST-like method could optionally be used to compare the length and order of secondary-structural elements of known proteins. Secondary structure predictions for proteins with no known structure could also be compared to those of a database of known structures (see Aurora & Rose 1998).
Yet another possibility is to create structure-specific buckets by computing a root-mean-squared distance (rmsd) measure between the 3D structural coordinates of any two proteins. For example, buckets for all structures within 2 Å rmsd of each other could be defined.
As used herein, the phrase “extracted from one genetic region” refers to sequences derived from genes that are present in a contiguous region of a genome or to protein sequences that are encoded by sequences derived from genes that are present in a contiguous region of a genome. “One genetic region” and “the same region of a genome” include, but are not limited to a chromosome, an arm of a chromosome, a portion of a chromosome contained between two markers, and a band of a chromosome as visualized by banding techniques that are known in the art such as Giemsa banding. These terms also include any other measure of physical proximity on a chromosome, including but not limited to a kilobase, a megabase, or a centimorgan (cM).
As used herein, the term “mutation” carries its traditional connotation and means a change, inherited, naturally occurring, or introduced, in a nucleic acid or polypeptide sequence, and is used in its sense as generally known to those of skill in the art. A mutation can be any (or a combination of) detectable, unnatural change affecting the chemical or physical constitution, mutability, replication, phenotypic function, or recombination of one or more deoxyribonucleotides. Nucleotides can be added, deleted, substituted for, inverted, or transposed to new positions with and without inversion. Mutations can occur spontaneously and can be induced experimentally by application of mutagens. A mutant variation of a nucleic acid molecule results from a mutation. A mutant polypeptide can result from a mutant nucleic acid molecule and can also refer to a polypeptide that is modified at one or more amino acid residues from the wild-type (i.e., naturally occurring) polypeptide. For example, the mutation can be a point mutation or the addition, deletion, insertion, and/or substitution of one or more nucleotides, or any combination thereof. The mutation can be a missense or frameshift mutation. Modifications can be, for example, conserved or non-conserved, natural or unnatural.
As used herein, “nucleic acid” and “nucleic acid molecule” refer to any of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), oligonucleotides, fragments generated by the polymerase chain reaction (PCR), and fragments generated by any of ligation, scission, endonuclease action, and exonuclease action. Nucleic acids can comprise monomers that are naturally occurring nucleotides (such as deoxyribonucleotides and ribonucleotides), or analogs of naturally occurring nucleotides (e.g., α-enantiomeric forms of naturally occurring nucleotides), or a combination of both. Modified nucleotides can have modifications in sugar moieties and/or in pyrimidine or purine base moieties. Sugar modifications include, for example, replacement of one or more hydroxyl groups with halogens, alkyl groups, amines, and azido groups. Sugars can also be functionalized as ethers or esters. Moreover, the entire sugar moiety can be replaced with sterically and electronically similar structures, such as aza-sugars and carbocyclic sugar analogs. Examples of modifications in a base moiety include alkylated purines and pyrimidines, acylated purines or pyrimidines, or other well-known heterocyclic substitutes. Nucleic acid monomers can be linked by phosphodiester bonds or analogs of phosphodiester bonds. Analogs of phosphodiester linkages include phosphorothioate, phosphorodithioate, phosphoroselenoate, phosphorodiselenoate, phosphoroanilothioate, phosphoranilidate, phosphoramidate, and the like. The term “nucleic acid” also includes so-called “peptide nucleic acids,” which comprise naturally occurring or modified nucleic acid bases attached to a polyamide backbone. Nucleic acids can be either single stranded or double stranded.
As used herein, the term “property” denotes any feature of a biomolecule. Properties include, but are not limited to, sequence similarity and/or identity, chromosomal location, involvement in a particular biochemical pathway, association with genetic disease, expression in a context, three-dimensional structural features, and having or encoding a particular functional domain. Representative functional domains include, but are not limited to, kinase domains, growth factor binding domains, phosphorylation sites, glycosylation sites, protein and/or nucleic acid binding sites, protein-protein interaction domains, and post-translational modification sites.
As used herein, the term “quality checking” means the application of subjective criteria to assess the usefulness of a bucket. Quality checking ensures that all reference biomolecules that have been grouped into a bucket share the common property used to describe the bucket. These criteria attempt to take into account the nature of the data analysis involved in assembling the bucket. For example, reliable human-annotated sources (e.g., the SwissPmt database) would receive a higher rank than a set generated by some automated computational procedure.
As used herein, the term “query set” means any item or group of items arranged in such a way as to allow for comparison to a target database. By way of example and not limitation, a query set can include a nucleic acid sequence, an amino acid sequence, or a combination thereof. Query sets can be produced by manual grouping of items. In another example, a query set can be produced by techniques including, but not limited to text mining of sequence databases and literature, homology searches, annotation keyword searches, or any other technique that generates a group of items that are believed to share a common property. Query sets can comprise results from one or more biological experiments, for example as raw data or as a product of statistical or other data analyses.
As used herein, the term “query sequence” means a member of a query set. In one example, a query sequence is a nucleic acid or amino acid sequence. In one embodiment, query sequences can be grouped together to form one or more query sets. The query set(s) is/are then compared to a target database that has been organized into buckets, the members of each bucket sharing a common property.
As used herein, the term “reference biomolecule” refers to the members of the buckets that make up a target database. In one embodiment, a “reference biomolecule” is a “reference sequence”. Reference biomolecules are arranged in a target database into buckets, wherein the reference biomolecules in each bucket share a common property.
As used herein, the term “region set” or “regions sets” means a set containing at least some, and optionally all, of the known and predicted genes that lie within a contiguous region of a genome. A region set might have as its members all the genes either known or predicted to reside in one example on a certain chromosome, in another example on one arm of a certain chromosome, in another example on a portion of a chromosome contained between two markers, in another example on that area of a certain chromosome corresponding to a particular chromosomal band as visualized by G-banding with Giemsa stain, or in yet another example within a certain number of basepairs of each other on a certain chromosome. The certain number of basepairs can be measured in bases, kilobases, megabases, or cM.
As used herein, the term “relationship” means any association between one or more entities. Relationships include, but are not limited to nucleic acid and/or amino acid sequence similarity and/or identity, presence in the same region of a genome or being encoded by genes present in the same region of the genome, having the same or a similar function, containing or encoding a common functional domain, containing a common three-dimensional structural feature, association with a similar phenotype such as a disease state, involvement in the same biochemical pathway, and any combination thereof.
As used herein, the term “relevant universe of all characterized sequences” means all sequences that have been characterized to an extent sufficient to allow the user to conclude that the corresponding biomolecules should or should not be placed into a bucket. This conclusion can be based upon an assessment or a hypothesis as to whether or not a given biomolecule has the property shared by the members of a given bucket. When several complementary or competing sources exist for assigning biomolecules to a bucket (e.g., kinase buckets as defined from several different sources or methods) all can be included rather than attempting to choose one “best” set.
As used herein, the terms “significance” and “significant” relate to a statistical analysis of the probability that there is a non-random association, or a more unusual relationship, between two or more entities. In one example, “significance” refers to the probability that an observed relationship occurred by chance. To determine whether or not a relationship is “significant” or has “significance”, statistical manipulations of the data can be performed to calculate a probability, expressed as a “p-value”. Those p-values that fall below a user-defined cutoff point are regarded as significant. In one example, a p-value less than or equal to 0.05, in another example less than 0.01, in another example less than 0.005, and in yet another example less than 0.001, are regarded as significant.
The term “similarity” can be contrasted with the term “identity”. Similarity is determined using an algorithm including, but not limited to, the BLAST-based algorithms or the GAP program (available from the University of Wisconsin Genetics Computer Group, now part of Accelrys Inc., San Diego, Calif., United States of America). “Identity”, however, means a nucleic acid or amino acid sequence having the same nucleic acid or amino acid at the same relative position in a given family member of a gene family or in a homologous nucleic acid or amino acid in a different organism. Homology and similarity are generally viewed as broader terms than the term identity. Biochemically similar amino acids, for example, leucine/isoleucine or glutamate/aspartate, can be present at the same position in a biomolecule—these are not identical per se, but are biochemically “similar.” These are referred to herein as conservative differences or conservative substitutions. This differs from a conservative substitution or mutation at the DNA level, which is defined as a change in a nucleic acid residue that does not result in a change in the amino acid codon encoded by the DNA at the altered position (e.g., TCC to TCA, both of which encode serine).
As used herein, the term “size” as it relates to a query set, a target database bucket, a genome, or a relevant universe of all characterized sequences, means the number of members present in the referenced item. For example, the size of a query set or a target database bucket would be the number of candidate biomolecules or reference biomolecules that make up the query set or target database bucket, respectively. Similarly, the size of a genome is the number of genes present in a genome or the number of gene products encoded by those genes. Also similarly, the size of the relevant universe of all characterized sequences is the number of sequences that have been characterized sufficiently such that a user can either include or exclude a given biomolecule from a given bucket based upon the biomolecule having or lacking the property shared by the members of the bucket. The “size of the relevant universe” will typically be less than or equal to the size of the genome. It Is also possible to define an “effective size” for a bucket, or for an entire genome, by performing redundancy analysis. Thus, if several very closely related sequences exist within a bucket (several mutant versions of the same protein, for example), one can define the number of substantially different members to be the “effective size” for that bucket. A similar correction could be applied on a per-genome basis as well.
As used herein, the term “source of informative content” means any source of information that describes a relationship between biomolecules or assigns a property to a biomolecule. A source of informative content includes, but is not limited to an annotated database of nucleic acid or amino acid sequences. In this example, the annotations can include references to suspected functions, expression patterns, homologs or orthologs from the same or different species, presence on a particular microarray chip or in a particular cDNA library, or presence on a particular chromosome or region of a chromosome. Other non-limiting sources of informative content include journal articles, public databases, web pages or trees, scientific abstracts and/or posters, technical data sheets, or personal communications. Experimental results, whether raw or resulting from prior analysis, can also be sources of informative content.
As used herein, the term “target database” means a collection of descriptions of one or more reference biomolecules. The reference biomolecules described in the collection are arranged in the target database into one or more buckets, wherein the members of each bucket share a common property. The reference biomolecules are further arranged such that the members of a bucket can be compared to a query set.
II. Biomolecule Analysis
A representative embodiment is adapted to identify properties that are common between a query set and a target database. The method can be employed, for example, to identify the function of a gene product of one or more genes that form a query set.
Referring now to
As shown in
The problem of correlating a given query set with a target database is addressed. The methods and data structures disclosed herein can be readily implemented and employed in a range of applications. Additionally, the methods are able to tolerate small numbers of “contaminant” sequences in a bucket without significantly degraded performance.
II.A. Construction of Target Database
One property of the method is the generality of its application. Given any source of informative content about a particular set of biomolecules that share one or more common properties, the methods can create appropriate buckets and add them to an iterative, ever-expanding, and evolving target database. A target database thus comprises various classifications of biomolecules (e.g., genes and gene products) into collections, also known as “buckets”, of entities having one or more common properties.
A target database can be constructed. For example, as shown in
Referring now to boxes 408 and 410 in
Continuing with boxes 408 and 410 in
Continuing with
The addition of user buckets can result in an enhancement in a given target database. For example, it is possible to add any (or all) gene clusters, dose-response or time-course gene sets, and lists of genes with altered expression derived from any experiment to a target database. Such additions can be made available to an entire project, group, site, or a corporate entity. Further, by identifying the user responsible for adding a specific user bucket, (e.g., by using bucket source identifiers as discussed hereinbelow), any user who finds that his or her query set is similar to that of another user will be able to immediately recognize this event and notify the other user. Thus, communication of experimental results (e.g., results related to the implication of genes or gene products in different disease conditions) can be enhanced.
Continuing with
An ad hoc rating system for relative ranking of the quality of each bucket source is optionally employed. In this rating system, reliable human-annotated sources (e.g., SwissProt accessible via the World Wide Web at http://us.expasy.org/sprot/) can receive a higher rank than a set generated by an automated computational procedure.
Continuing with box 414 in
II.B. Comparison of a Query Set with a Target Database
The comparing of a query set (e.g., a user-defined set of nucleic acid or amino acid sequences) with a target database is disclosed, as is scoring and ranking the matches, and reporting the results.
II.B.1. Searching a Target Database
Pre-computed relationships of identity or similarity between biomolecules from other sources can be used. The identity relationships can be based on equivalence of accessions, identifiers, or names of genes and proteins from data sources such as NCBI's LocusLink, Swissprot, or HUGO. Thus, any member of the query set with a name, accession, identifier, or sequence identical to one in the target database can be considered a match. This identity relationship can be determined by the use of associative arrays, string matching, or regular expressions. More domain specific techniques might be applied for biomolecule sequences, such as BLAST (Altschul et al., 1990) or dynamic programming. A database of these pre-computed relationships or a method for computing these relationships that determines the identity or similarity of a member of the query set to that of a reference biomolecules can be employed.
When all of the sequences comprising a query set and a target database comprise nucleic acid and/or protein sequences, the BLAST algorithms (Altschul et al., 1990) can be employed to rapidly perform pairwise nucleic acid-nucleic acid, protein-protein, or nucleic acid-protein comparisons between each member of the query set and each member of a target database. In one embodiment, stringent BLAST parameters can be employed to enforce a strict matching criterion, thereby reducing the comparison to a binary response (i.e. match/no match) for each sequence pair. Stringent BLAST parameters can include, but are not limited to, parameters that require that in order for a match to be scored, two sequences must be sufficiently identical (e.g. 95%, 96%, 97%, 98% or greater) and the match region must be sufficiently long (e.g. 100 or more residues, or encompassing the entire length of any biomolecule of less than 100 residues in length). In this regard, it is noted that each target database match is not only a match to a specific sequence, but also a match to the bucket(s) of which the sequence is a member.
BLAST is one approach to identifying a degree of similarity between two or more sequences. Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information (NCBI: http://www.ncbi.nlm.nih.gov/), and also can be licensed from Washington University, St. Louis, Mo., United States of America (http://blast.wustl.edu).
The basic BLAST algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in a query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold. These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence until the cumulative alignment score falls a predetermined value below the maximum achieved score. Cumulative scores are calculated using, for nucleic acid sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always <0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when the cumulative alignment score decreases by the quantity X from its maximum achieved value, the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments, or the end of either sequence is reached. The BLAST algorithm parameters W. T. and X determine the sensitivity and speed of the alignment. For BLASTN searches, parameter settings such as “W=13; hitdist=28; M=1; N=−2; Q=1; R=1; X=6; gapw=20” from Washington University BLAST (WU-BLAST: http://blast.wustl.edu/blast/TO-FLY.html#blastn) can be employed. For protein searches, parameter settings of “W=4; T=1000; matrix=PAM10; E=1e-10” to determine identities can be used.
In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences. See, e.g., Karlin and Altschul, 1993. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleic acid or amino acid sequences would occur by chance. For example, a test nucleic acid sequence is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid sequence to the reference nucleic acid sequence is less than about 0.1, in another example less than about 0.01, and in still another example less than about 0.001.
Percent similarity of a DNA or peptide sequence can also be determined, for example, by comparing sequence information using the GAP computer program, available from the University of Wisconsin Genetics Computer Group (now part of Accelrys Inc., San Diego, Calif., United States of America). The GAP program utilizes the alignment method of Needleman and Wunsch (1970), as revised by Smith and Waterman (1981). Briefly, the GAP program defines similarity as the number of aligned symbols (i.e., nucleotides or amino acids) that are similar, divided by the total number of symbols in the shorter of the two sequences. See, e.g., Schwartz and Dayhoff, 1979, pp. 357-358, Gribskov and Burgess, 1986.
II.B.2. Counting Matches
In another aspect, guidelines are provided for counting matches. For example, when a candidate biomolecule of a query set matches a reference biomolecule of a target database (i.e. meets or exceeds a user-defined stringency requirement), a match is counted. When counting matches between sets, the following guidelines for counting matches can be employed:
The third guideline ensures that for a query set with Q members and a bucket with B members, the two cannot share more matches than the minimum of B and Q. A result of a counting procedure is a list of all the buckets in a target database that have one or more matches to a given query set.
II.B.3. Statistical Significance of the Number of Matches
In another aspect, the number of matches between a member of a query set and a bucket of a target database identified and counted as described herein can be analyzed to determine the statistical significance of the match. That is, the number of matches can be analyzed to determine, generally speaking, the likelihood that the number of matches is due to random coincidence, as opposed to a true property in common between the query set and the bucket of the target database.
In general, the significance of a match will depend on the size of the query set, the size of each target database bucket that matched, the number of matches, and the total size of the relevant universe of all characterized sequences (approximated by the number of unique biomolecules in the reference collection). By way of example, the significance of a match can be modeled on the basis of a hypergeometric distribution as follows. If Q is the size of the query set, B is the size of a particular target database bucket that matched the query set, k is the number of matches between Q and B, and G is the size of the relevant universe, then the probability of exactly k matches is given by a hypergeometric distribution, which is defined as:
P(Q∩B=k)=C(B,k)C(G−B,Q−k)/C(G,Q)
where C(x,y) is the binomial coefficient: x!/[y!(x-y)!] and x! indicates factorial (the product of all integers from 1 to x). The P-value is indicated by the tail of the distribution:
The parameter G can be fixed as a constant for all computations.
Although draft forms of the sequence of the human genome are available to the public for searching (See http://www.ncbi.nlm.nih.gov /genome/guide/human/), there is still no agreement on the number of genes or proteins it encodes. See e.g., Smaglik (2000), and compare Lander et al. (2001) estimating 30,000-40,000 protein-coding genes to Venter et al. (2001) estimating 38,000 genes. Although current estimates of the number of human genes range upwards from 20,000, many genes have not yet been characterized while others are likely incorrectly or incompletely characterized. There is no doubt that some genes have not yet been identified, so the true number of genes in the human genome is likely to be uncertain for many years to come. Therefore, any estimate made with respect to genome size is, to some extent, arbitrary.
An aspect, therefore, pertains to the characterization of the number of genes comprising the human genome as a number reflecting how many human genes have been identified, annotated, or otherwise classified. Regardless, the specific value for the genome size has no impact upon the rank order of the buckets that are reported as significant matches. This degree of uncertainty in the size of the genome only affects the cutoff level for statistical significance. Thus, the relative ordering of the buckets is unaffected by any assumptions made concerning the size of the genome. It is also possible to compute an effective size for the genome of any organism by counting up all the unique sequences from that organism that have been partitioned into one or more buckets. Similarly, one could restrict the genome size to the number of probe sets (or number of unique genes) available on a specific DNA microarray or chip, for purposes of analyzing experimental data from RNA expression studies.
In one embodiment, the results of comparing a query set to a target database can be presented as a list of buckets ranked by p-value, and can be bounded by a predefined statistical cutoff. In one embodiment, for each of the buckets in the results list, a hyperlink can be incorporated in an output display that takes the user to a summary page. The summary page can be configured to show which query set sequences matched which bucket elements, as well as which bucket elements had no matches in the query set. One or more additional hyperlinks can also be included. These hyperlinks include, but are not limited to links to a database entry for each query set sequence (such as a link to the entry in SwissProt, NCBI, or a private database).
II.B.4. Representative Steps
The following section describes an embodiment of the method. The section generally describes a series of steps that can be performed when practicing the disclosed method. The following steps describe only one example. Variations on the disclosed method will be apparent to those of ordinary skill in the art, upon consideration of the present disclosure, and are encompassed by the appended claims. Reference is also made to
First, as shown at step ST202 and/or ST204a and ST204b in
Next, as shown in steps ST206 and ST208 in
After a search has been performed, as shown in step ST208 of
Following a search, as shown in step ST220 of
Standard cut-offs for p-values (such as 0.05, 0.01, 0.001) can be used to guide significance. These p-values can be corrected for multiple hypothesis testing using a suitable approach, such as but not limited to one or more of a conservative Bonferroni correction (which multiplies these values by the number of hypotheses tested equal to the number of buckets for this embodiment) and computing an empirical p-value based in simulations with random input sets. This empirical p-value can be obtained by using multiple random input sets of genes and computing the number of times any bucket is observed below a certain statistic. For example, the algorithm can be simulated 1000 times on random input sets of genes (each set with 50 members). The distribution of the best observed hypergeometric statistic from each of those 1000 computations can be plotted, and a statistic chosen, such that only 50 of the 1000 simulations have a statistic as good. This effectively gives the statistic that represents an empirical p-value of 0.05 for query sets of size 50. This can be repeated for query sets of varying sizes.
The results of the statistical operation can then optionally be sorted by increasing or decreasing significance, as shown in step ST222 of
III. Genomic Region Analysis
The methods disclosed herein can be employed to identify a property common to a set of candidate biomolecules from one genomic region that form a query set and a set of reference biomolecules that form one or more buckets of a target database. However, the present method is not limited to a comparison of a query set comprising a single set of candidate biomolecules and a target database. As described in the following sections, one embodiment of the method can be employed to identify a property common to a query set comprising two or more region sets and a target database. Representative steps are as follows:
A non-limiting example of this embodiment can be described in the context of a disease gene association analysis, and is referred to generally at 300 in
A query set, which comprises two or more region sets, is then compared, region set by region set, with a target database, at step ST308 in
Continuing with step ST308 in
As shown at steps ST310, ST312, and ST314 in
III.A. Searching a Target Database
The steps summarized immediately above will now be discussed in detail. The identification of one or more properties of a candidate sequence of a query set can be achieved by searching a target database of reference sequences that have been grouped into buckets representing groups of sequences that have the same properties. Such a search can follow another experiment, the results of which can form elements of a query set. Some technologies and experiments generate a powerset of genes, for example S =(S1S2 . . . Sn). A subsequent goal is then to find a property P such that there is at least one gene in a significant number of sets Sk that has property P. The sets S1 . . . Sn have no pairwise intersection (e.g., non-overlapping genomic regions). Thus if biological pathways are considered as potential properties, then the goal might be to find a pathway that threads or connects these sets of genes.
Consider a linkage analysis experiment, where genetic markers have been genotyped in both a disease and a normal population and log of the odds ratio (LOD) scores have been obtained across the human genome. For most common diseases, multiple linkage peaks are observed. The presence of multiple linkage peaks can be explained as:
If the latter proposition is true, then one of the following cases must apply.
Exploration of known properties that might explain hypothesis H above can be accomplished using a genomic region analysis embodiment. Consider n genomic regions with corresponding region sets of genes: S1,S2 . . . Sn. If an additional set of genes R is added to (S1,S2 . . . Sn), where R contains all remaining genes not in S1 . . . Sn, then the superset (S1 . . . Sn, R) can be considered a partition on the human genome. Thus, consider the data superset S=(S1 . . . Sn, R) and let the number of genes with property P overlapping with these sets be (PJ1 . . . PJn, Pjr). The probability of this event (or partition j) is given by the multivariate form of the hypergeometric distribution (sampling without replacement):
where C( ) is the binomial coefficient, and |S| is the cardinality (i.e., the number of members) of the set S. The probability of seeing this by chance can be estimated by summing the above term over all events −j that would be considered significant, for examples, events that have at 3 or more Pjk greater than 0.
In certain applications, it might only be important that the region set S have at least one biomolecule in common (or identical) to that with property P. It might not add any more evidence if it has two or more molecules with property P. In such cases, computing an exact significance (p-value) becomes a difficult task, and Monte-Carlo techniques can be used to acquire estimates as discussed in the next section
III.C. Statistical Significance of Matches
Statistical measures make assumptions of independence among set members that do not completely hold for biological sequences. Thus, the significance of any result is assessed using negative controls. True negative controls are hard to obtain, as that would require knowing that a certain powerset of biological sequences shares no property in a significant measure. A solution is to generate multiple random sets and use simulations to compute the background frequency of a property. A similar approach is adopted here. A plurality of replicates of a query set is constructed. These replicates are matched to the query set in the sense that the number of genes in each replicate is equal to the number of genes in each set of the powerset and as far as possible arises from a similar bioinformatics process. The replicates are then modeled at random chromosomal locations to form a random location data set. The random location data set is then processed using the same method steps described above. For example, if the original powerset represented linkage regions, then each random set would be a set of contiguously ordered genes from a single chromosome, and a random set of genes from a contiguous region of the genome can be generated. For each property, the number of times that property is observed in the random powerset is counted with P(event) equal to or lower than that observed in the actual powerset. This provides a simulation-based or empirical p-value.
A similar approach can also be used in the analysis of time-series data, such as data gathered from microarray expression experiments over time. For example, some experiments produce a list of genes with significantly perturbed expression at the earliest time point, and various other sets that experienced expression changes at successively later time points. Some pathway or process that connects this set of measurements can also be identified. As an example, assume that there were three time-points measured—(E)arly, (I)ntermediate, and (L)ate. For each time point there would be an associated set of genes (EI, II, LI) whose expression levels had changed relative to the control (time=0). If these sets are non-overlapping, one can apply the method to discover any processes that contain one or more genes that are present in each timepoint set, thus forming a hypothesis as to the pathway and the causal steps involved in the experimental process. Statistical corrections are employed to handle the case where the sets are not completely disjoint.
By way of additional example, schizophrenia is a multifactorial disease. A number of linkage studies have been published implicating the following chromosomal regions: 1q21-22, 1q32-42, 6p24-22, 8p21, 10p14, 13q32, 18p11, and 22q11-13. Blouin et al., 1998; Berrettini, 2000; Straub et al., 1995; Brzustowicz et al., 2000; Ekelund et al., 2001. For the chromosome 1q region, conflicting evidence also exists. Levinson et al., 2002. Given suitable markers or other methods to determine the physical boundaries of each region, one can extract the set of known and predicted genes within each such region. The genome region analysis' embodiment is then used to probe for pathways or other biological processes that have components in some or all of the linkage regions. Simulations can also be performed to repeatedly generate randomly located chromosomal regions of comparable size and gene content to assess whether the results occur frequently by chance alone. The findings are then used as hypotheses for guiding experimental studies.
The following Examples have been included to illustrate certain embodiments. Certain aspects of the following Examples are described in terms of techniques and procedures found or contemplated to work well in the practice of the embodiments. These Examples are exemplified through the use of standard practices of applicants. In light of the present disclosure and the general level of skill in the art, those of skill will appreciate that the following Examples are intended to be exemplary only and that numerous changes, modifications, and alterations can be employed without departing from the spirit and scope of the present disclosure.
Stanelle reported 29 genes as being regulated by the transcription factor E2F1. Stanelle et al., 2002. The authors divided this set of genes into five categories: cell cycle, apoptosis, cancer-related, E2F1 targets, and unknown. Submitting the same unordered list in an embodiment of the present method results in a ranked list of approximately 100 buckets significant at p≦0.05. Presently there are approximately 80,000 buckets in the target database. These buckets have been created from a combination of publicly available databases and internal experimental results. These buckets cover many types of biological data including, but not limited to genomic location, diseases, tissue expression, functions, pathways, transcriptional regulation, families, domains, and literature abstracts. The most significant hits of this input set to the target database are shown in Table 1. Some of the sources which appear are keywords and families from Swissprot, protein domains from EnsEMBL Interpro, human disease sets from OMIM, and sets derived from the Gene Ontology Consortium website and NCBI's LocusLink. This list includes several overlapping buckets related to each of the known categories supplied by the authors, with cyclin C (cell-cycle) determined to be the most significant bucket. In addition to confirming the authors' classifications, more specific links, such as associations with MAPKKK signaling and multiple myeloma, were also uncovered. See Table 1.
The references listed below as well as all references cited in the specification are incorporated herein by reference to the extent that they supplement, explain, provide a background for or teach methodology, techniques, and/or compositions employed herein.
It will be understood that various details can be changed without departing from the scope of the disclosure. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation. Indeed, variations on the present disclosure will be apparent to those of ordinary skill in the art, upon consideration of the present disclosure, and are encompassed by the appended claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US04/19932 | 6/22/2004 | WO | 12/21/2005 |
Number | Date | Country | |
---|---|---|---|
60482420 | Jun 2003 | US |