A nearest neighbor query looks to a group of objects to find the object among the group that has the shortest distance to a query point. Different variations on this query are possible.
An application of this query may be used when a user wants to plan several trips to different locations in some sequence. The user may alternatively desire to make a trip to different types of locations in some sequence. It may be desirable to find the optimal route between the points selected in this way.
The present application describes techniques which enable determination of an optimal sequenced route.
Embodiments describe techniques to carry this out via a query, for example, using spatial databases. Other embodiments describe techniques to minimize the amount of processing, and/or the memory space, used for this operation.
a-3h show different iterations carried out in a first embodiment;
The embodiment describes a feature called the optimal sequenced route determination. The determination can be made based on a query. Consider one application of the optimal sequenced route query.
A user may plan a trip, for example by automobile, where the trip planner intends to first leave home towards a gas station to fuel the car, then to a library branch to check in a book, and finally to a post office to mail a package. The user typically prefers to drive the minimum overall distance.
Defining the locations of the points, with gas station gi, library branch lj, and post office pk, the problem can be considered as one of choosing the sequence between these points which shortens the trip in distance or time. The way of doing this may be based on the user's preferences, that is considering distance or time. This route is referred to herein as the optimal sequenced route.
Commercial applications for this kind of nearest neighbor query may include automated navigation devices for vehicles and computerized map services. These queries may also be used in crisis management, as well as in defense and intelligence systems. This kind of query may be useful to provide an ability to respond to a series of incidences in an absolute fastest time in these and other analogous applications.
Simply performing a series of independent nearest neighbor queries to the different locations will produce an answer, however, one that is not likely to be the optimal answer.
One simple way of solving the problem will be dubbed the “greedy” approach. The greedy approach might first locate the closest gas station to p, which in
However, examining
Embodiments describe finding the optimal sequenced route. The problem of doing so is closely related to the known traveling salesman problem. The traveling salesman problem asks for an the minimum “cost” of a round-trip route from a starting point to a given set of points. The traveling salesman problem is effectively a search for the Hamiltonian cycle with the least weight in a weighted graph. There are, however, differences between the traveling salesman problem, and the present problem of optimal sequenced route. While the traveling salesman problem requires that all of the points in the set be visited, the optimal sequenced route enforces a specific sequence to find the appropriate points from a point in a set.
Another similar problem is the sequential ordering problem, in which a Hamiltonian path with a specific node precedence constraint is required. The sequential ordering problem, however, requires a solution which passes through all the points in the set, like in all the traveling salesman problems.
The inventors recognized that certain applications require a very different analysis, specifically efficient selection of the sequence of points of each of which can be any member of the given point set. This differs from many conventional searches of this type, such as the Yellow Pages on Yahoo and MapQuest. The search only for the K-nearest neighbors in one specific category or point set to a given query location cannot find the optimal sequenced route from the query to a group of point sets.
The embodiment describes how this new kind of query can be carried out.
Defining the problem—U1, U2, U3 . . . Un are n sets, each containing points in a d-dimensional space Rd. D(.) is a distance metric defined in Rd, where D(.) obeys the triangular inequality.
As an example,
First, this is defined mathematically according to the following definitions according to the table of notations reproduced in table 1.
Definition 1: Given n, the number of point sets Ui, we say M−(Ml, Ms, . . . , Mm) is a sequence if and only if 1≦Mi≦n for 1≦i≦m. That is, given the point sets Ui, a user's OSR query is valid only if asking for existing location types. For the example of
Definition 2: R=(P1,P2, . . . ,Pr) is a route if and only if PiεRd for each 1≦i≦r. p⊕R=(p,P1, . . . ,Pr) denotes a new route that starts from starting point p and goes sequentially through P1 to Pr. The route p⊕R is the result of adding p to the head of route R.
Definition 3: The length of a route R=(P1, P2, . . . , Pr) is defined as
Note that L(R)=0 for r=1. For example, the length of the route (g2, l2, g3) in
Definition 4: Let M=(M1, M2, . . . , Mm) be a sequence. We refer to the route R=(P1,P2, . . . ,Pm) as a sequenced route that follows sequence M if and only if PiεUM
Definition 5: given the starting point p, a sequence M=(M1, . . . , Mm), and point sets {U1 . . . , Un}, we refer to Rg(p, M=(P1, . . . , Pm) as the greedy sequenced route that follows M from point p if and only if it satisfies the following:
1. P1 is the closed point o p in UM
2. For 1≦I<m, Pi+1 is the closest point to Pi in UM
Rg(p,M) is unique for a given point p, a sequence M, and the sets Ui. Moreover, by definition, the optimal sequenced route R is never longer than the greedy sequenced route for the given sequence M, i.e., L(p,R)≦L(p, Rg(p,M)).
The actual query for the optimal sequenced route is then defined as:
Definition 6: Assume that we are given a sequence M=(M1, M2 . . . , Mm). For a given starting point p in Rd and the sequence M, the Optimal Sequenced Route (OSR) Query, Q(p,M), is defined as finding a sequenced route R that follows M where the value of the following function L is minimum over all the sequenced routes that follow M:
L(p,R)=D(p,P1)+(L(R) (2)
Note that L(p,R) is in fact the length of route Rp=p⊕R.
Q(p,M)=(P1,P2, . . . , Pm) is used to denote the optimal SR, the answer to the OSR query Q. For the example above where (U1, U2, U3)=(black, white, gray), M=(2,1,3), and D is the shortest path, the answer to the OSR query is Q(p,M)=(g1, l1, p1). The term “candidate SR” is used to refer to all other sequenced routes that follow sequence M.
In order to find the query, a number of properties all the points are used to advantage.
Property 1: for a route R=(P1, . . . ,Pi, Pi+1, . . . ,Pr) and a given point p:
L(p,R)≧D(p,Pi)+L((Pi, . . . ,Pr)) (3)
Proof: The triangular inequality implies that
both sides of the inequality and considering the definition of the function L( ) in Equation 2, yields Equation 3.
Property 1 is used to reduce the set of candidate sequenced routes for Q(p,M) by filtering out the points whose distance to p is greater than a threshold, and hence cannot possibly be the optimal route. Note that this property is applicable to all routes in the space.
The answer to the OSR query Q(p,M) demonstrates the following two unique properties. We utilize these properties to improve the exhaustive search among all potential routes of a given sequence.
Property 2: If Q(p,M0=R=(P1, . . . ,Pm−1,Pm), then Pm is the closest point to Pm−1 in UM
Proof: The proof of this property is by contradiction. Assume that the closest point to Pm−1 in UM
Property 2 states that given that P1, . . . , Pm−1 are subsequently on the optimal route, it is only required to find the first nearest neighbor of Pm−1 to complete the route and subsequent nearest neighbors cannot possibly be on the optimal route and hence, will not be examined. Note that this property does not prove that the greedy route is always optimal. Instead, it implies that only the last point of the optimal sequenced route R(i.e., Pm) is the nearest point of its previous point in the route (i.e., Pm−1).
Property 3: If Q(p,M)=(P1, . . . ,Pi, Pi+1, . . . , Pm) for the sequence of M=(M1, . . . , Mi, Mi+1, . . . , Mm), then for any point Pi and M=(Mi+1, . . . Mm), we have Q(Pi,M′)=(Pi+1, . . . , Pm).
Proof: The proof of this property is by contradiction. Assume that Q(Pi,M′)=R′=(P′1, . . . , P′m−1). Obviously (Pi+1, . . . , Pm) follows sequence M′, therefore we have L(Pi,R′)<L(Pi,(Pi+1, . . . , Pm)). We add L(p,(P1, . . . , Pi)) to both sides of this inequality to get L(p,(P1, . . . , Pi, P′1, . . . P′m−1))<L(p,(P1, . . . , Pm)).
The above inequality shows that the answer to Q(p,M) must be (P1, . . . , Pi, P′1, . . . , P′m−i) which clearly follows sequence M. This contradicts our assumption that Q(p,M)=R.
The variables mentioned above are set forth in table 1.
Taking advantage of the above, the optimal sequenced route can be determined.
This can be calculated based on the so-called “Dijkstra” algorithm.
An OSR query is carried out for a network with a starting point P. A sequence M, and point sets {UM1 . . . UMn}. A weighted directed graph G is constructed for the network. The set V=Ui=mmUM
The operation proceeds according to the flowchart of
This graph in fact shows all the possible candidates sequence routes for the given M and the set of Us. Mathematically, this graph shows all the routes Rp=p⊕R where R is any candidate sequenced route.
From the definitions above, the optimal route for a given query is the candidate sequence route where Rp has the minimum length. 710 illustrates examining all the paths to find the minimum length. Graph G illustrates how the optimal sequenced route can be simply considered as finding the shortest, or minimum weight, paths from p to each of the vertices that correspond to the points in UMm. The shortest path is then taken as the optimal route.
This solution may become difficult to implement for larger sets because of the large cardinality of the sets Ui. For example, for a real world data set with 40,000 points and m being 3, the set G may have 124 million edges. The complexity of this technique accordingly scales according to the log of the number of vertices. Also, the graph must be built and maintained in main memory 205. Accordingly, the memory necessary also scales with a log of the number of vertices.
705 illustrates a set reduction technique that reduces the size of the set. Different embodiments implement this in different ways. An embodiment improves the performance of this embodiment might be choose a value L. A range query is then carried out to select only those points that are closer the starting point than L. For example, L may be the route which corresponds to the points of greedy route Rg(p,M), or any other route that can be easily calculated, e.g., using one calculation per leg of the trip. Any point outside this range is longer than the greedy route and hence can be ignored.
Another embodiment calculates the optimal sequenced route in vector space.
This embodiment assumes that the distance function D is the Euclidean distance between points in the space Rd.
A first embodiment is considered a light algorithm, since it is light in terms of memory usage/workspace required. According to this embodiment, and as shown in 800 of
This embodiment uses two different thresholds to minimize the amount of work and/or workspace at 805. A variable threshold Tv changes at each iteration. A constant threshold Tc represents the length of the greedy route. These thresholds are used to eliminate possibilities, and hence to minimize the size of the solution space. In this embodiment, only those points in the set that can be added to the partial sequenced routes and will not generate routes that are longer than the variable threshold value Tv, are added. The embodiment also examines the partial sequenced routes by calculating their lengths after adding the value p and discards those routes at 810 whose corresponding length is more than a constant threshold value Tc, where Tc is the length of the so-called “greedy” route.
a depicts a starting point of p and 3 different sets of points U1, U2, and U3, which are respectively shown as filled points, hollow points and shaded points. The optimal sequenced route require finding the route r with the minimum L(p,R) from white to black to gray from the start point. The query is therefore formulated as Q(p,(2,1,3))).
The program first issues M=3 consecutive nearest neighbor queries, to find the greedy route that follows 2, 1, 3 from p. This is done, as described above, by first finding the closest w to P, which here is w2. Then it finds the closest b to w2, here b2. Then, it finds the closest g to b2, here g2.
b shows the greedy route Rg(p,(2,1,3)) as (w2, b2, g2).
The embodiment initiates a threshold values Tv and Tc to the lengths p+Rg(p,M). The value of Tc remains continuously constant, while the value of Tv reduces after each iteration.
Subsequently, the system discards all the points whose distances p are grater than Tv, that is the points that are outside the circle shown in
The system then generates a set S of partial candidate routes and inserts the “gray nodes” which are inside the circle in
In the first iteration, each point χεUM
As another simplification, at 815, if there are partial sequenced routes which have the same first point, only the partial sequenced route with the shortest length will be kept in the S, based on property 2.
In addition, any partial sequenced route that cannot have x added to it will be discarded. For example, in
In the example, at the end of the first iteration, the threshold Tv is decreased at 802 as follows. Suppose that Q(p,M)=(q1, . . . , qi, . . . ,qm) and we are examining iteration (m−i+1) (i.e., the partial SRs in S are in the form of (Pi+m, . . . ,pm)). The definition of the greedy route implies that L(p,(q1, . . . ,qm))≦L(p,Rg(p,M))=Tc and by considering Property 1, we have:
D(p,qi)+L((qi+1, . . . ,qm))<D(p,qi)+L((qi, . . . ,qm))≦Tc which can be rewritten as:
D(p,qi)≦Tc−L((qi+1, . . . ,qm)) (4)
Note that the inequality 4 must hold for all points qi that are to be examined at iteration (m−i+1). Hence, by replacing L((qi+1, . . . ,qm)) with its minimum value, we obtain the maximum value for D(p,qi) for any qi. Therefore, for any point qi that is examined in iteration (m−i+1), we must have D(p,qi)≦Tv=Tc−minPSRεS(L(PSR)).
Note that at each iteration, the lengths of the partial SRs in S, and hence the value of minPSRεS(L(PSR)) is increasing. This yields to smaller values for Tv after each iteration. This is also shown in
At the end of each iteration, the value of the variable threshold Tv is decreased. {(b6,g5), (b4,g3), (b3,g3), (b2,g2), (b1,g2)}
The subsequent iterations are performed in a similar way. The partial routes in the set S become more complete routes, that is candidate sequenced routes that follow M after the last iteration is completed.
As the final step, the technique examines the distance from p to the first point in each complete route in the set (i.e., {(w2,b2, g2), (w3,b4,g3)}) and selects the route that generates the minimum total distance, that is the route with a minimum value for the L( ) function as a result of Q(p, (2,1,3)). This is shown in
This can be carried out according to the following pseudo code:
In the pseudocode, lines 3 through 15 perform the first range queries using a variable threshold, and initializes the set of partial sequenced routes. The iterations are performed in line 6-16. Lines 9 and 12 check to see if a point can be added to the partial sequenced routes, and line 16 updates the value of the variable threshold. Finally, lines 17 returns the minimum 1 as a result of q.
Another embodiment allows the points in Ui to be stored as an R-tree index structure. This embodiment uses the neighborhood information of the points that is inherently stored in the R-tree to more efficiently prune the candidate points at each iteration. In the embodiment, the point selection criterion is changed to a range query of the type that is applicable on an R tree. This point selection can be performed using a single range query.
In this embodiment, and as in the previous embodiment, the system prunes the points in Um. A first pruning step eliminates points of the set that are farther than the variable threshold from the starting point. This is done with a range query (Q1) using a circle with radius Tv surrounding the starting point p.
A second pruning step checks the points that are returned from the first query step against other partial sequenced routes. If adding a point to that partial sequenced route makes it greater than the length of the greedy route (Tc), then the point is not added. Otherwise, a new partial sequenced route is generated.
To identify Range (Q2), we first find the locus of the points x which can possibly be added to a PSR=(pi, . . . ,P|PSR|εS. For such a point x, we must have D(χ,P1)≦Tc−L(PSR) (Line 12 in the psuedocode). As L(PSR) and Tc are constant values for a given PSR and query Q(p,M), the sum of χ's distances from two fixed points p and P1 cannot be larger than a constant. Hence, χmust be on or inside an ellipse defined by the foci p and P1 and the constant Tc−L(PSR).
To identify Range (Q2), we first find the locus of the points χ which can possible be added to a PSR=(P1, . . . , P|PSR|)εS. For such a point χ, we must have D(χ,p)+D(χ,pl)≦Tc−L(PSR) (Line 12 in the psuedocode). As L(PSR) and Tc are constant values for a given pSr and query Q(p,M), the sum of χ's distances from two fixed points p and P1 cannot be larger than a constant. Hence, χmust be on or inside an ellipse defined by the foci p and P1 and the constant Tc−L(PSR).
Query Q2 is defined in terms of the set of partial SRs stored in S in the current iteration. For each PSR, points are appended inside ellipse E(p,PSR) to the head of the PSR in order to build a new partial candidate route. All such ellipses, each corresponding to a partial SR in S, are intersecting as they all share the common focus point p. The union of these ellipses contains all the points X (of the appropriate set), where for each, there is exactly one route starting with X built at the end of the current iteration. In other words, this union should be the range used in query Q2.
Up to this point, we have identified the range of the two main queries Q1 and Q2 used in the program. The following shows that any ellipse for the range Q2 is entirely inside the circle for range Q1 and hence, the range of Q2 is completely inside that of Q1.
Lemma 1. During each iteration of the program for Q(p,M), given a partial SR PSRεS, any point χ inside or on the ellipse E(p,PSR) has a distance less than current value of the variable threshold Tv from point p (i.e., D(χ,p)<Tv).
Proof. As point χ is inside or on ellipse E(p,PSR) corresponding to the route PSR, we have
The right side of the above inequality has the same value as that of the current value of Tv. It directly yields that D(χ,p)≦Tv−D(χ,P1) and subsequently, we have D(χ,p)<Tv.
Lemma 1 shows that any ellipse E(p,PSR) is completely inside the circular range of Q1. Now, as Range (Q2) is the union of all ellipses E(p,PSR) corresponding to all the partial SRs in S, it can be concluded that it is entirely inside Range (Q1).
Note that at each iteration, the program builds a new route using only the points in the intersection of Range (Q1) and Range (Q2). Given Lemma 1, this intersection is the same as Range (Q2). Hence, the algorithm must only consider the points which are within the range of Q2 from p, to be added to the partial SRs in S.
This embodiment acts as an R-tree Friendly Program by transforming the threshold values into range queries that can be performed on R-tree index structures. The above has shown that the two range queries Q1 and Q2 employed by the program can be reduced to only one, as Q2 is entirely inside Q1. However, as
To retrieve the points in a specific range, we need to traverse the R-tree from its root down to the leaves and report those points that are within the given range. To make the search efficient, existing search algorithms on R-tree prune subtrees of the main tree utilizing some metrics. The most common metric, mindist(N,q), provides a lower bound on the smallest distance between the point q and any point in the subtree of node N. We utilize the minimum distance for Q1 as its range is relative to a fixed point p. Any Rj-tree node N with mindist(N,p) greater than threshold Tv cannot contain a point q with the distance D(p,q) less than or equal to Tv. Such node can be easily pruned when traversing the R-tree during our first range query (i.e., Q1). Moreover, query Q1 is used to initialize the PSRs of LORD (Line 3-5 in the psuedocode).
The second rectangular range query (i.e., MBR (Q2)) can be performed as follows. We first check whether a node N of the R-tree intersects with the rectangle. If their intersection is empty, the node N is pruned; otherwise, the child nodes of N must be checked for their intersection with MBR (Q2).
Now that both of the range queries used to select the points have been selected, and their use has been studied, another embodiment, called R-LORD is described: the R-tree version of LORD. A difference between R-LORD and LORD is that R-LORD incorporates the R-tree implementation of two range queries of LORD in its iterations. First, it initializes the set S, with the partial SRs of length zero, each including a single point of the set of points returned from the function RQ1(p,Tc,Mm) (
The embodiments discussed above may be efficiently carried out in vector space. However, these embodiments may be difficult to use in a metric space. Certain of the functions applied above may render it difficult to use these features in metric spaces where the distance is usually a computationally complex function.
Another embodiment, intended for use in metric space, uses progressive neighbor exploration to address optimal sequenced route queries in metric spaces for arbitrary values of M. Progressive neighbor exploration incrementally creates a set of candidate routes for Q(p,M) in the same sequence as M, that is from p to Umm. In the embodiment, this is done through an iterative process which starts by examining the nearest neighbor to P in the set U, enerates the partial sequenced route from P to this neighbor, and stores the candidate route in a heat based on its length. Each subsequent iteration examines the sequenced route partials from top to bottom. Each examination is as follows.
1. If |PSR|=m, meaning that the number of nodes in the partial SR is equal to the number of items in M and hence PSR is a candidate SR that follow M, the PSR is selected as the optimal route for Q(p,M) since it also has the shortest length.
2. If |PSR|≠m:
(a) First the last point in PSR,r|PSR|, (which belongs to UM
(b) We then find the nearest neighbor in UM
A concrete example is described using the above example. The weighted directed graph of
Next, the next nearest gi to p,g1 is found and placed into the heap. Similarly to the above, this process repeats until the route on the top of the heap follows only the sequence m.
Note that this technique requires keeping only one candidate sequenced route in the heap. If during any step 28, a route with m the points is generated, it is only added to the heap if there is no other candidate sequence route that has a shorter length in the heap. Moreover, any time a candidate sequenced route is added to the heap, any other sequenced route with a longer length is discarded. For example, table 2 illustrates the different steps. For example, in step 6, adding the route (g2,l3,p3) with the length of 14 to the heap will result in discarding the route (g2,l2,p2) with the length of 15 from the heap (crossed out in the Figure).
The only requirement for PNE is a nearest neighbor approach that can progressively generate the neighbors. Hence, by employing an approach similar to INE [16] or VN3 [12], which are explicitly designed for metric spaces, PNE can address OSR queries in metric spaces. In theory PNE can work for vector spaces in a similar way; however, it is inefficient for these spaces where distance computation is not expensive. The reason is that PNE explores the candidate routes from the starting point which might result in an exhaustive search. Instead, R-LORD optimizes this search by building the routes in the reverse sequence utilizing the RO-tree index structure.
Another embodiment adds the additional parameter of a separate endpoint to any of the above embodiments.
Initially, this is defined as a query:
Definition 8: Given source point p, destination point q and a sequence M, the OSR-I query is defined as R=(P1, . . . , Pm), a sequenced route that follows M, where the following function G is minimum over all sequence routes that follow M:
G(p,R,Q)=D(p,P1)+L(R)+D(Pm,q) (6)
The above equation is similar to L(p,R)+D(Pm,q). We show that this new form of OSR can easily be reduced to the general form of OSR.
We define a new set of Un+1={q}. Including this new set in the set of Ui's makes M′={M1, . . . , Mm, n+1) a valid sequence in the new setting of the problem. Now if we assume that Q(p,M′)=R′=(P′1, . . . , P′m+1), we know that P′m+1 will be q as q is the only member of Un+1. Moreover, L(p,R′) is minimum over all candidate routes that follow M′. Recall that the length of the route R′p=p⊕R′ (i.e., L(p,R′)) is equal to D(p,P′1)+L(R′). We define the route R as (P′1, . . . , P′m) by excluding q from R′. It is clear that L(p,R′) is the same as D(p,P1)+L(R)+D(Pm,q). By comparing the latter expression with G(p,R,q) of Equation 6, we conclude that R is the answer to the OSR-I query given the source p, destination q and sequence M.
Since we have shown that OSR-I can be reduced to a general OSR problem, we are able to use our LORD (or R-LORD) algorithm to answer this query. Specifically, the answer to OSR-I given the source p, destination q, and sequence M is the same as the answer to LORD(p,M′) excluding the point q, where Un+1={q} and M′=(M1, . . . ,Mm,n+1). Although R-LORD can similarly solve OSR-I, we can further optimize it for OSR-I. This is achieved by neglecting the range query Q1 (i.e., RQ1(p,Tc,n+1)). This is because we know that the only point in this range is q. Therefore, the set S can be directly initialized to {(q)}.
The second variation of OSR is when the user asks for the k routes with the minimum total distances to its location. We define this as k-OSR query. We can easily address this type of query using our PNE approach discussed above.
Recall that in PNE, we maintain a heap of the partially completed sequenced routes and only keep one candidate sequenced route (or, in other words, a route that follows M), that is the one that has the minimum total length. By modifying this policy to maintain k candidate SRs in the heap and continuing the iterations until k candidate SRs are fetched from the heap, PNE can also address k-OSR queries.
Although only a few embodiments have been disclosed in detail above, other embodiments are possible and the inventor(s) intend these to be encompassed within this specification. The specification describes specific examples to accomplish a more general goal that may be accomplished in another way. This disclosure is intended to be exemplary, and the claims are intended to cover any modification or alternative which might be predictable to a person having ordinary skill in the art. For example, other computers may be used, and may calculate the values in other space.
The computers described herein may be any kind of computer, either general purpose, or some specific purpose computer such as a workstation. The computer may be a Pentium class computer, running Windows XP or Linux, or may be a Macintosh computer. The programs may be written in C, or Java, or any other programming language. The programs may be resident on a storage medium, e.g., magnetic or optical, e.g. the computer hard drive, a removable disk or other removable medium. The programs may also be run over a network, for example, with a server or other machine sending signals to the local machine, which allows the local machine to carry out the operations described herein.
Also, the inventor(s) intend that only those claims which use the words “means for” are intended to be interpreted under 35 USC 112, sixth paragraph. Moreover, no limitations from the specification are intended to be read into any claims, unless those limitations are expressly included in the claims.
This application claims priority to U.S. Application Ser. No. 60/692,730, filed on Jun. 21, 2005. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.
The U.S. Government may have certain rights in this invention pursuant to Grant Nos. EEC-9529152, IIS-0324955 (ITR) and IIS-0238560 (PECASE) awarded by NSF.
Number | Date | Country | |
---|---|---|---|
60692730 | Jun 2005 | US |