23andMe®, a personal genomics services company, has built up a large database comprising personal information (e.g., family information, genetic information, etc.) of hundreds of thousand users. One application provided by the company is Relative Finder, which uses genetic information to help users find genetic relatives (i.e., people who share a common ancestor) in the database. Within the large database, an individual may have many relatives, and there can be many ways the individual may be connected to a particular relative. Once the relatives of an individual are identified, it is often as important for the individual to understand how the connections are formed. Additional services are needed to provide insight into the family connections of individuals.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Determining family connections (also referred to as relative connections) between two individuals is described. In some embodiments, a relative connections graph is formed for individuals whose genetic and/or family data is stored in a database. The relative connections graph indicates the relative relationships of these individuals. Based on the relative connections graph, a relative connections path connecting two individuals is determined. In some embodiments, the relative connections path is a shortest path.
Processor 102 is coupled bi-directionally with memory 110, which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a read-only memory (ROM). As is well known in the art, primary storage can be used as a general storage area and as scratch-pad memory, and can also be used to store input data and processed data. Primary storage can also store programming instructions and data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 102. Also as is well known in the art, primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 102 to perform its functions (e.g., programmed instructions). For example, memory 110 can include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional. For example, processor 102 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).
A removable mass storage device 112 provides additional data storage capacity for the computer system 100, and is coupled either bi-directionally (read/write) or uni-directionally (read only) to processor 102. For example, storage 112 can also include computer-readable media such as magnetic tape, flash memory, PC-CARDS, portable mass storage devices, holographic storage devices, and other storage devices. A fixed mass storage 120 can also, for example, provide additional data storage capacity. The most common example of mass storage 120 is a hard disk drive. Mass storage 112, 120 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 102. It will be appreciated that the information retained within mass storage 112 and 120 can be incorporated, if needed, in standard fashion as part of memory 110 (e.g., RAM) as virtual memory.
In addition to providing processor 102 access to storage subsystems, bus 114 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 118, a network interface 116, a keyboard 104, and a pointing device 106, as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed. For example, the pointing device 106 can be a mouse, stylus, track ball, or tablet, and is useful for interacting with a graphical user interface.
The network interface 116 allows processor 102 to be coupled to another computer, computer network, or telecommunications network using a network connection as shown. For example, through the network interface 116, the processor 102 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps. Information, often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network. An interface card or similar device and appropriate software implemented by (e.g., executed/performed on) processor 102 can be used to connect the computer system 100 to an external network and transfer data according to standard protocols. For example, various process embodiments disclosed herein can be executed on processor 102, or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing. Additional mass storage devices (not shown) can also be connected to processor 102 through network interface 116.
An auxiliary I/O device interface (not shown) can be used in conjunction with computer system 100. The auxiliary I/O device interface can include general and customized interfaces that allow the processor 102 to send and, more typically, receive data from other devices such as microphones, touch-sensitive displays, transducer card readers, tape readers, voice or handwriting recognizers, biometrics readers, cameras, portable mass storage devices, and other computers.
In addition, various embodiments disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computer-implemented operations. The computer-readable medium is any data storage device that can store data which can thereafter be read by a computer system. Examples of computer-readable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as optical disks; and specially configured hardware devices such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices. Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.
The computer system shown in
In this example, personal information (including genetic information, phenotype information, family information, population group information, etc., or a combination thereof) pertaining to a plurality of individuals is stored in a database 210, which can be implemented on an integral storage component of the imputation engine, an attached storage device, a separate storage device accessible by the imputation engine, or a combination thereof.
At least a portion of the database includes genotype data, specifically genotype data of genetic markers of individuals' deoxyribonucleic acid (DNA). Examples of such genetic markers include Single Nucleotide Polymorphisms (SNPs), which are points along the genome each corresponding to two or more common variations; Short Tandem Repeats (STRs), which are repeated patterns of two or more repeated nucleotide sequences adjacent to each other; and Copy-Number Variants (CNVs), which include longer sequences of DNA that could be present in varying numbers in different individuals. Although SNP-based genotype data is described extensively below for purposes of illustration, the technique is also applicable to other forms of genotype data such as STRs, CNVs, etc.
In this example, genotype data is used to represent the individuals' genomes. In some embodiments, the genotype data is obtained from DNA samples such as saliva or blood submitted by individuals. The genotype data can be obtained while an individual is still alive, or posthumously. The laboratory analyzes the samples using a genotyping platform, for example the Illumina OmniExpress™ genotyping chip, which includes probes to assay allele values for a specific set of SNPs. One genotyping process is known as hybridization, which yields different hybridization intensity values for each allele. The laboratory assigns genotype values to the alleles of each SNP by comparing the relative strength of these intensities. The resulting genotype data is stored in database 210. Other genotyping techniques can be used.
In some embodiments, the pathfinder engine is a part of a personal genomic services platform providing a variety of services such as genetic counseling, ancestry finding, social networking, etc. In some embodiments, individuals whose data is stored in database 210 are registered users of a personal genomic service platform, which provides access to the data and a variety of personal genetics-related services that the individuals have consented to participate in. Users such as Alice and Bob are genotyped and their genotype data is stored in database 210. They access the platform via a network 204 using client devices such as 206 and 208, and interact with the platform via appropriate user interfaces (UIs) and applications. For example, a pathfinder application implemented as a browser enabled application or a standalone application is used by the users to identify specific connection paths to other individuals in the database.
A relative connections graph is formed based on data in database 210 and used by the pathfinder engine. In various embodiments, the relative connections graph is formed based on genetic analysis of relative relationships, user-reported relative relationships, or a combination thereof. For purposes of example, the relative connections graph described in detail below is formed primarily based on genetically determined relative relationships, specifically relative relationships of individuals who are deemed to have descended from a common ancestor within a certain number (N) of generations. The technique is also applicable to other types of relative relationships such as relative relationships due to marriage, relative relationships determined using other means such as self-reporting by the individuals themselves, etc.
In some embodiments, a relationship is assigned a weight, which is represented by the length of the line representing the relationship. A smaller weight indicates a closer relationship. For example, the relationship between individuals 10 and 24 is father and son, and the relationship between individuals 14 and 19 is third cousins. Accordingly, the line connecting 10 and 24 is shorter than the line connecting 14 and 19. Other representations of relationships are possible; for example, a greater weight may be used to indicate a closer relationship in some embodiments.
In some embodiments, the graph is available to be viewed by a user via a user interface display similar to
The relative connections graph can be formed based at least in part on user-reported data. For example, via a family tree interface, user 1 reports that user 14 is her uncle and thus establishes the connection between them. In some embodiments, the relative connections graph is formed based at least in part on genetic data. For instance, 23andMe® provides a Relative Finder feature to automatically identify relative relationships on the basis of shared genetic material. Relatives are identified based on “Identity by Descent” (IBD) regions of their DNA. Because of recombination and independent assortment of chromosomes, the autosomal deoxyribonucleic acid (DNA) and X chromosome DNA (collectively referred to as recombinable DNA) from the parents is shuffled at the next generation, with small amounts of mutation. Thus, only relatives will share long stretches of genome regions where their recombinable DNA is completely or nearly identical. Such regions are referred to as IBD regions because they arose from the same DNA sequences in an earlier generation. IBD regions of two individuals' genomes or genotype sequences are determined using tools such as fastIBD™ or other appropriate techniques. Based on statistical distribution patterns of the amount of IBD shared and the degree of relationship (i.e., the number of generations within which two people share an ancestor), a predicted degree of relationship is determined. Additional details of how to determine relative relationships based on IBD regions are described in U.S. Pat. No. 8,463,554 entitled FINDING RELATIVES IN A DATABASE which is incorporated herein by reference in its entirety for all purposes.
The relative connections graph is used by the pathfinder engine to identify the shortest path between two individuals. In various embodiments, the length of the path is measured by the number of connections, sum of weight associated with connections in the path, any other appropriate metrics, or combinations thereof. A user of the genomics services platform can invoke pathfinding for any individuals on the platform this user is permitted to see. For example, a first user invokes pathfinding to identify the relative relationships between him and a second user. The first user may find the second user by name or other types of search, select the second user from an extended family tree, or otherwise identify the second user.
At 502, identification information of a first individual and identification information of a second individual are obtained. In some embodiments, the identification information is obtained as input parameters to the pathfinding function. In some embodiments, the identification information of at least one of the individuals is obtained by analyzing the context in which the process is invoked. For example, when Jerry Maxwell identifies Alice Robbins as one of the individuals involved in the pathfinding process, the context of the invocation identifies Jerry as another individual involved in the pathfinding process.
At 504, based at least in part on a genetic connections graph such as the one shown in
In some embodiments, the specific connections path is the shortest path. The length of a path can be measured in different ways. In some embodiments, the length of a path is determined based on the number of connections in the path, and the shortest path corresponds to a path connecting two individuals with the fewest number of connections. Referring to
A number of techniques are usable to determine the specific genetic connections path. Two example techniques (breadth-first search and weighted Dijkstra) are described in greater detail below. Any other appropriate graph-based search techniques can be used.
At 506, information pertaining to the determined path is output. In some embodiments, the path is shown in a user interface display. Additional information about individuals included in the path, such as their profile or other metadata information, their relationships to each other, etc., is optionally output.
In some embodiments, breadth-first search is applied to the genetic connections graph to identify the shortest path.
At 602, the node corresponding to the first individual is enqueued (i.e., added to the queue).
At 604, a node is dequeued (i.e., removed from the queue). This node is also referred to as the current node.
At 606, it is determined whether the current node corresponds to the second individual. If so, a path is found and at 608, the length of the path connecting the first individual and the second individual is computed. Depending on implementation, the computation includes counting the number of connections, computing a weighted sum of the connections, or a combination. The result is kept on record (e.g., in memory or other storage) for later comparison.
If the current node does not correspond to the second individual, then, at 610, any direct child nodes (i.e., nodes connected to the current node) that have not yet been processed are enqueued.
At 612, it is determined whether the queue is empty.
If the queue is not empty, process returns to 604 to be repeated.
If the queue is empty, then every node on the graph has been examined. Process continues to 614, where the lengths of all the computed paths (e.g., results obtained from 608) are compared to determine the shortest path.
In some embodiments, Dijkstra's Algorithm is used to identify the shortest path on the genetic connections graph.
At 702, the process is initialized. Specifically, every node in the genetic connections graph is assigned a tentative distance value, 0 for the initial node corresponding to the first individual and infinity for all other nodes; all nodes are marked as unvisited; the initial node corresponding to the first individual is set as the current node; a set of the unvisited nodes forms an unvisited set, which comprises all of the nodes except the initial node.
At 704, for the current node, tentative distances to its unvisited neighbors are calculated and kept on record. For example, if the current node (“Bob Smith”) has a tentative distance of 6, and the connection with a neighbor (“Clara Jones”) has a weighted length of 2, then the distance to Clara Jones (through Bob Smith) will be 6+2=8. If this distance is less than the previously recorded tentative distance of Clara Jones (e.g., infinity), then the previous tentative distance is overwritten. At this point the neighbor nodes remain in the unvisited set.
At 706, the current node is marked as visited and is removed from the unvisited set.
At 708, it is determined whether the destination node (i.e., the node corresponding to the second individual) has been marked as visited. If so, at 710, the tentative distance associated with the destination node is deemed to be the shortest path and returned; otherwise, at 712, the unvisited node that is associated with the smallest tentative distance is set as the new current node, and the process returns to 704.
Breadth-first search and Dijkstra's Algorithm are example techniques used to identify the shortest path. Other techniques such as iterative deepening depth-first search can also be used.
Once the shortest path is determined, the result is optionally displayed to the user who invoked the pathfinding function to inform the user of how the two focal individuals are connected.
In this example, Shirley Jones has authorized the platform to display her name in the pathfinding application. The individual represented by box 808, however, has not given authorization to display his name, and is therefore shown as “Anonymous.” Both Shirley and Anonymous have authorized certain metadata to be displayed. In this example, the metadata includes certain profile information provided by Shirley and Anonymous such as age, gender, and current city of residence. The metadata displayed can also include certain information inferred by the system. For example, by comparing the individuals' genotype information (e.g., DNA markers) with reference individuals known to be of a specific ancestry, it is determined that Shirley is of Irish ancestry and Anonymous is of African ancestry.
In some embodiments, the pathfinder application permits a user to select a celebrity as an individual in a focal pair. For example, instead of Alice Robbins, the second individual may be specified as Sergey Brin or Albert Einstein. How celebrities are identified depends on implementation. In some embodiments, a system administrator manually identifies celebrities as they join the personal genomics services platform, and marks their personal data accordingly. In some embodiments, celebrities are automatically identified by comparing their names and occupation with a database of celebrities. In some embodiments, out of privacy concerns, the platform places certain restrictions on how connections near a celebrity may be displayed. For example, paths including close relatives (e.g., people who are relatives within two generations) are excluded from consideration in some embodiments; as another example, in some embodiments, on a path involving a close relative of a celebrity, the name and metadata associated with that close relative are not displayed.
In some cases, multiple shortest paths are found.
In some embodiments, instead of or in addition to displaying metadata of the individuals in the paths in the manner shown in
Finding a relative connection path between two individuals in a database has been described. By utilizing a relative connections graph, the pathfinder application can quickly determine a shortest connection path, providing insight into how the individuals are related.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 61/656,298 entitled DETERMINING FAMILY CONNECTIONS OF INDIVIDUALS IN A DATABASE filed Jun. 6, 2012 which is incorporated herein by reference in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
6416325 | Gross | Jul 2002 | B2 |
7512612 | Akella et al. | Mar 2009 | B1 |
7957907 | Sorenson et al. | Jun 2011 | B2 |
8187811 | Eriksson et al. | May 2012 | B2 |
8271201 | Chakraborty et al. | Sep 2012 | B2 |
8428886 | Wong et al. | Apr 2013 | B2 |
8463554 | Hon et al. | Jun 2013 | B2 |
8510057 | Avey et al. | Aug 2013 | B1 |
8543339 | Wojcicki et al. | Sep 2013 | B2 |
8589437 | Khomenko et al. | Nov 2013 | B1 |
8645343 | Wong et al. | Feb 2014 | B2 |
8719304 | Golze | May 2014 | B2 |
8738297 | Sorenson et al. | May 2014 | B2 |
8786603 | Rasmussen et al. | Jul 2014 | B2 |
8855935 | Myres et al. | Oct 2014 | B2 |
8913797 | Siddavanahalli | Dec 2014 | B1 |
8990250 | Chowdry et al. | Mar 2015 | B1 |
9116882 | Macpherson et al. | Aug 2015 | B1 |
9213944 | Do et al. | Dec 2015 | B1 |
9213947 | Do et al. | Dec 2015 | B1 |
9218451 | Wong et al. | Dec 2015 | B2 |
9336177 | Hawthorne et al. | May 2016 | B2 |
9367800 | Do et al. | Jun 2016 | B1 |
9390225 | Barber et al. | Jul 2016 | B2 |
9405818 | Chowdry et al. | Aug 2016 | B2 |
9836576 | Do et al. | Dec 2017 | B1 |
9864835 | Avey et al. | Jan 2018 | B2 |
20020032687 | Huff | Mar 2002 | A1 |
20030172065 | Sorenson et al. | Sep 2003 | A1 |
20050075917 | Flores | Apr 2005 | A1 |
20050114364 | Tebbs et al. | May 2005 | A1 |
20050147947 | Cookson, Jr. | Jul 2005 | A1 |
20060287876 | Jedlicka | Dec 2006 | A1 |
20070168368 | Stone | Jul 2007 | A1 |
20070178500 | Martin et al. | Aug 2007 | A1 |
20070226248 | Darr | Sep 2007 | A1 |
20080040046 | Chakraborty | Feb 2008 | A1 |
20080154566 | Myres et al. | Jun 2008 | A1 |
20080215301 | Eyal et al. | Sep 2008 | A1 |
20080227063 | Kenedy et al. | Sep 2008 | A1 |
20090118131 | Avey et al. | May 2009 | A1 |
20090119083 | Avey et al. | May 2009 | A1 |
20090240722 | Yu et al. | Sep 2009 | A1 |
20100049736 | Rolls et al. | Feb 2010 | A1 |
20100138374 | Chakraborty et al. | Jun 2010 | A1 |
20100223281 | Hon et al. | Sep 2010 | A1 |
20110004581 | Schmidt et al. | Jan 2011 | A1 |
20110137944 | Rolls | Jun 2011 | A1 |
20110202846 | Najork | Aug 2011 | A1 |
20120207690 | Weill et al. | Aug 2012 | A1 |
20120232796 | Keerthi | Sep 2012 | A1 |
20120270794 | Eriksson et al. | Oct 2012 | A1 |
20130131994 | Birdwell et al. | May 2013 | A1 |
20130254213 | Cheng et al. | Sep 2013 | A1 |
20130345988 | Avey et al. | Dec 2013 | A1 |
20140006433 | Hon et al. | Jan 2014 | A1 |
20140067355 | Noto et al. | Mar 2014 | A1 |
20140278138 | Barber et al. | Sep 2014 | A1 |
20160026755 | Byrnes et al. | Jan 2016 | A1 |
20160103950 | Myres et al. | Apr 2016 | A1 |
20160171155 | Do et al. | Jun 2016 | A1 |
20160277408 | Hawthorne et al. | Sep 2016 | A1 |
20160350479 | Han et al. | Dec 2016 | A1 |
20170011042 | Kermany et al. | Jan 2017 | A1 |
20170017752 | Noto et al. | Jan 2017 | A1 |
20170220738 | Barber et al. | Aug 2017 | A1 |
20170228498 | Hon et al. | Aug 2017 | A1 |
20170277827 | Granka et al. | Sep 2017 | A1 |
20170277828 | Avey et al. | Sep 2017 | A1 |
20170329891 | Macpherson et al. | Nov 2017 | A1 |
20170329899 | Bryc et al. | Nov 2017 | A1 |
20170329901 | Chowdry et al. | Nov 2017 | A1 |
20170329902 | Bryc et al. | Nov 2017 | A1 |
20170329904 | Naughton et al. | Nov 2017 | A1 |
20170329915 | Kittredge et al. | Nov 2017 | A1 |
20170329924 | Macpherson et al. | Nov 2017 | A1 |
20170330358 | Macpherson et al. | Nov 2017 | A1 |
Number | Date | Country |
---|---|---|
2016073953 | May 2016 | WO |
Entry |
---|
Dodds, et al., “An Experimental Study of Search in Global Social Networks,” Science vol. 301, Aug. 8, 2003, pp. 827-829. |
Travis, et al., “An Experimental Study of the Small World Problem,” Sociometry, vol. 32, No. 4, Dec. 1969, pp. 425-443. |
Easley, et al., “Networks, Crowds, and Markets: Reasoning about a Highly Connected World,” Chapter 20: The Small-World Phenomenon, Draft version, Jun. 10, 2010, pp. 611-644. |
Milgram, S., “The Small World Problem,” Psychology Today, vol. 1, No. 1, May 1967, pp. 61-67. |
Schnettler, S., “A small world on feet of clay? A comparison of empirical small-world studies against best-practice criteria,” Social Networks vol. 31, 2009, pp. 179-189. |
Schnettler, S., “A structured overview of 50 years of small-world research,” Social Networks, vol. 31, 2009, pp. 165-178. |
Number | Date | Country | |
---|---|---|---|
20170329866 A1 | Nov 2017 | US |
Number | Date | Country | |
---|---|---|---|
61656298 | Jun 2012 | US |