One of the fundamental problems in internet image searches is to rank visual images according to a given textual query. Existing search engines can depend on text descriptions associated with visual images for ranking the images, or leverage query-image pairs annotated by human labelers to train a series of ranking functions. However, there are at least two major limitations to these approaches: 1) text descriptions associated with visual images are often noisy or too few to accurately or sufficiently describe salient aspects of image content, and 2) human labeling can be resourcefully expensive and can produce incomplete and/or erroneous labels. The present implementations can mitigate the above two fundamental challenges, among others.
The description relates to click-through-based cross-view learning for internet searches. One implementation includes receiving textual queries from a textual query space that has a first structure, visual images from a visual image space that has a second structure, and click-through data related to the textual queries and the visual images. Mapping functions can be learned that map the textual queries and the visual images into a click-through-based structured latent subspace based on the first structure, the second structure, and the click-through data. Another implementation includes determining distances among the textual queries and/or the visual images in the click-through-based structured latent subspace. Given new content, results can be sorted based on the distances in the click-through-based structured latent subspace.
The above listed example is intended to provide a quick reference to aid the reader and is not intended to define the scope of the concepts described herein.
The accompanying drawings illustrate implementations of the concepts conveyed in the present document. Features of the illustrated implementations can be more readily understood by reference to the following description taken in conjunction with the accompanying drawings. Like reference numbers in the various drawings are used wherever feasible to indicate like elements. In some cases parentheticals are utilized after a reference number to distinguish like elements. Use of the reference number without the associated parenthetical is generic to the element. Further, the left-most numeral of each reference number conveys the FIG. and associated discussion where the reference number is first introduced.
This description relates to improving results for internet searches and more specifically to click-through-based cross-view learning (CCL). In some implementations, click-through-based cross-view learning can include projecting textual queries and visual images into a latent subspace (e.g., a low-dimensional feature representation space). Click-through-based cross-view learning can make the different modalities (e.g., views) of the textual queries and the visual images comparable in the latent subspace (e.g., shared latent subspace, common latent subspace). For example, the textual queries and visual images can be compared by mapping distances in the latent subspace. In some cases, the distances can be mapped based on click-through data and structures from an original textual query space and an original visual image space. As such, click-through-based cross-view learning can 1) reduce (and potentially minimize) the distance between mappings of textual queries and visual images in the latent subspace, and 2) preserve inherent structure from the original textual query and visual image spaces in the latent subspace. In these cases, the latent subspace can be considered a click-through-based structured latent subspace.
The latent subspace mapped using click-through-based cross-view learning techniques can be used to improve visual image search results with textual queries. For example, relevance scores (e.g., similarities) between the textual queries and the visual images can be determined based on the distances in the mapped latent subspace. In some cases, the relevance scores can provide improved search results for visual images from textual queries, and a visual image search list can be returned for a textual query by sorting the relevance scores. In some cases, click-through-based cross-view learning techniques can achieve improvements over other methods (e.g., other subspace learning techniques) in terms of relevance of textual query to visual image results. Additionally, click-through-based cross-view learning techniques can reduce feature dimension by several orders of magnitude (e.g., from thousands to tens) compared with that of the original textual query and/or visual image spaces, producing memory savings compared to existing search systems.
To summarize, textual queries and visual images can be projected into a latent subspace using click-through-based cross-view learning techniques. The textual queries and visual images within the latent subspace can be mapped. Distances between relevant textual queries and visual images can be reduced, and structure from original textual query and visual image spaces can be preserved. The mapped latent subspace can be used to determine relevance scores for visual images corresponding to textual queries, and an image search list can be returned for a textual query by sorting the relevance scores.
As shown in
Also as shown in
In some implementations, the graphical arrangement of the links (e.g., link 114, link 116) between individual textual queries 104 and/or the thicknesses/strengths of the links can constitute a structure of the textual query space 102. Similarly, the visual image space 106 can have a structure that can be represented by the graphical arrangement of the links (e.g., link 118, link 120) between individual visual images 108 and/or the thicknesses/strengths of the links.
In example click-through-based cross-view learning scenario 100, the click-through bipartite graph 110 can include click-through data 122 (e.g., “crowdsourced” human intelligence, click counts). The click-through data 122 can be associated with edges between individual textual queries 104 and individual visual images 108 (e.g., textual query-visual image pairs) in the click-through bipartite graph 110, indicating that a user clicked an individual visual image in response to an individual textual query. For example, in
In some implementations, the click-through data 122 and the structures from the textual query space 102 and the visual image space 106 can be used to generate click-through-based cross-view learning mapping functions 112. The mapping functions 112 can be used to project (e.g., map) the textual queries 104 and the visual images 108 into a latent subspace, which will be described relative to
Referring to
In some implementations, relevance scores (RS) 204 between textual query 104-visual image 108 pairs can be directly computed based on their mappings in the latent subspace 200. In some cases, the relevance scores can be the distances between the textual query-visual image pairs in the latent subspace.
The calculated relevance scores 204 for textual query 104-visual image 108 pairs can be sorted. Based on the sorted relevance scores, a ranked image search list 206 can be returned for any given textual query 104. For example, in
Referring to
Additionally, relevance scores 204 can be computed between two textual queries 104 and/or between two visual images 108 based on distances mapped in the latent subspace 200. The mapped latent subspace can therefore be useful for comparing relevance between two textual queries, between two visual images, and/or between textual query-visual image pairs.
Note that in the example shown in
In scenario 100 shown in
A second click-through-based cross-view learning scenario will now be described. In this second scenario, a click-through bipartite graph (such as click-through bipartite graph 110 as described relative to
In this example, =(v,ε) can denote a click-through bipartite graph. v=Q ∪v can be a set of vertices, which can consist of a textual query set Q and a visual image set v. ε can be a set of edges between textual query vertices and visual image vertices. A number associated with an edge can represent a number of times a visual image was clicked in image search results of a particular textual query. In some cases, there can be n triads {qi, vi, ci}i=1n generated from the click-through bipartite graph, where ci can be the individual click-through data points (e.g., click counts) of visual image vi in response to textual query qi. In this case, Q={q1, q2, . . . qn}Tε n×d
A low-dimensional, latent subspace (e.g., common subspace) can exist for representation of textual queries and visual images. A linear mapping function can be derived from the latent subspace:
f(qi)=qiWq, and f(vi)=viWv (1)
where d can be the dimensionality of the latent subspace, and Wqεd
To measure relations between the textual query and visual image content, one example can be to measure a distance between their mappings in the latent subspace as:
where tr(•) can denote a trace function. The matrices Wq and Wv can have orthogonal columns, i.e., WqTWq=WVTWv=I, where I can be an identity matrix. The constraints can restrict Wq and Wv to converge to reasonable solutions rather than go to 0, which can be essentially meaningless in practice.
Specifically, a click number (e.g., click count) of a textual query-visual image pair can be viewed as an indicator of their relevance. In the case of image search, search engines can display results as thumbnails. Users can see an entire image before clicking on it. As such, barring distracting images and user intent changes, users predominantly tend to click on images that are relevant to their query. Therefore, click data can serve as a reliable connection between textual queries and visual images. An underlying assumption can be that the higher the click number, the smaller the distance between the textual query and the visual image in the latent subspace.
To learn the latent subspace across different views, the distance can be intuitively incorporated as a regularization on the mapping matrices Wq and Wy weighted by the click number.
Structure preservation or manifold regularization can be effective for semi-supervised learning and/or multiview learning. This regularizer can indicate that similar points (e.g., similar textual queries) in an original space (e.g., a textual query space) should be mapped to relatively close positions in the latent subspace. An estimation of underlying structure can be measured by appropriate pairwise similarity between training samples. Specifically, the estimation can be given by:
where Sqεn×n and Sv Δn×n can denote affinity matrices defined on the textual queries and visual images, respectively. Under the structure preservation criterion, it is reasonable to reduce and potentially minimize Eq. (3), because it might incur a heavy penalty if two similar examples are mapped far away from each other.
The affinity matrices Sq and Sv can be defined many ways. In this case, the elements can be computed by Gaussian functions, for example:
where t ε {q, v} for simplicity, e.g., t can be replaced by any one of q and v. σt as the bandwidth parameters. Nk (ti) can represent a set of k nearest neighbors of ti.
By defining the graph Laplacian Lt=Dt−St for t ε {a, v}, where Dt can be a diagonal matrix with its elements defined as Dijt=ΣjSijt, Eq. (3) can be rewritten as:
tr((QWq)TLq(QWq)+tr((VWv)TLv(VWv)). (5)
By reducing and potentially minimizing this term, a similarity between examples in the original space can be preserved in the learned latent subspace. Therefore, this regularizer can be added in the framework of the click-through-based cross-view learning technique, potentially for optimization of the technique.
An overall objective function can integrate the distance between views in Eq. (2) and the structure preservation in Eq. (5). Hence the following optimization (e.g., potentially optimizing) problem may be obtained:
where λ can be the tradeoff parameter. The first term is the cross-view distance, while the second term represents structure preservation.
For simplicity, L(Wq,Wv) can be denoted as the objective function in Eq. (6). Thus, the optimization problem can be rewritten as:
The optimization above can be a non-convex problem. Nevertheless, the gradient of the objective function with respect to Wq and Wv can be easily obtained, and can be given by:
In some implementations, Eq. (7) can represent a difficult non-convex problem due to the orthogonal constraints. In response, in some cases a gradient descent optimization procedure can be used with curvilinear search for a local optimal solution.
In individual iterations of the gradient descent procedure, given the current feasible mapping matrices {Wq, Wv} and their corresponding gradients {Gq=∇WqL(Wq,Wv), Gv=}WvL(Wq,Wv)), the skew-symmetric matrices Pq and Pv can be defined as:
P
q
=G
q
W
q
T
−W
q
G
q
T
,P
v
=G
v
W
v
T
−W
v
G
v
T. (9)
A new point can be searched as a curvilinear function of a step size τ, such that:
Then, it can be verified that Fq(τ) and Fv(τ) lead to several characteristics. The matrices Fq(τ) and Fv(τ) can satisfy (Fq(τ))TFq(τ)=(Fv(τ))TFv(τ)=I for all τ ε R. The derivatives with respect to τ can be given as:
In particular, some implementations can obtain Fq′ (0)=−PqWq and Fv′(0)=−PvWv. Then, {Fq(τ), Fv(τ)}τ≧0 can be a descent curve. Some implementations can use the classical Armijo-Wolfe based monotone curvilinear search algorithm to determine a suitable step τ as one satisfying the following conditions:
L(Fq(τ),Fv(τ))≦L(Fq(0),Fv(0))+ρ1τLτ′(Fq(0),Fv(0)),
L
τ′(Fq(τ),Fv(τ))≧ρ2Lτ′(Fq(0),Fv(0)), (12)
where p1 and p2 can be two parameters satisfying 0<p1<p2<1. Lτ′ (Fq(τ),Fv(τ)) can be the derivative of L with respect to τ and can be calculated by:
where Rt(τ)=∇w
After the optimization (e.g., potential optimization) of Wq and Wv, the linear mapping functions defined in Eq. (1) can be obtained. With this, originally incomparable textual query and visual image modalities can become comparable. Specifically, given a test textual query-visual image pair, ({circumflex over (q)} ε d
r({circumflex over (q)},{circumflex over (v)})=∥{circumflex over (q)}Wq−{circumflex over (v)}Wv∥2. (15)
This value can reflect how relevant the textual query is to the visual image, and/or how well the textual query describes the visual image, with lower numbers indicating higher relevance. For any textual query, sorting by its corresponding values for all its associated visual images can give the retrieval ranking for these visual images. In this case, the algorithm is given in Algorithm 1.
The time complexity of the click-through-based cross-view learning technique can depend on computation of Gq, Gv, Pq, Pv, Fq(τ), Fv(τ), and Lr′Fq(τ), Fv(τ)). The computation complexity of Gq and Gv can be O(n2×dq) and O(n2×dv), respectively. Pq and Pv can take O(dq2×d) and O(dv2×d).
A matrix inverse
can dominate the computation of Fq(τ) and Fv(τ) in Eq. (10). By forming Pq and Pv as an outer product of two low-rank matrices, the inverse computation cost can decrease significantly. As defined in Eq. (9), Pq=GqWqT−Wq−WqGqT and Pv=GvWvT−WvGvT, Pq and Pv can be equivalently rewritten as Pq=XqYqT and Pv=XvYvT, where Xq=[Gq,Wq], Yq=[Wq,−Gq] and Xv=[Gv,Wv], Yv=[Wv,−Gv]. According to a Sherman-Morrison-Woodbury formula, for example:
(A+αXYT)−1=A−1−αA−1X(I+αYTA−1X)−1YTA−1,
the matrix inverse
can be re-expressed as:
Furthermore, Fq(τ) can be rewritten as:
For Fv(τ), the click-through-based cross-view learning technique can get the corresponding conclusion. Since d<<dq can be typical in some cases, the cost of inverting
can be much lower than inverting
The inverse of
can take O(d3), thus the computation complexity of Fq(τ) can be O(dqd2)+O(d3). Similarly, Fv(τ) can be O(dvd2)+O(d3). The work of computing Lr′(τ),Fv(τ)) can have a cost of O(n2×dq)+O(n2×dv)+O(dq2)+O(dvd2)+O(d3).
As d<<dq,dv<<n, the overall complexity of the Algorithm 1 can be Tmax×T×O(n2×max(dq,dv)), where T can be a number of searching for appropriate τ which can satisfy Armijo-Wolfe conditions and can be less than ten in some cases. Given a training of Wq and Wv on one million {query, image, click} triads with dv=1,024 and dq=10,000 for example, this algorithm can take around 32 hours on the server with Intel E5-2665@2.40 GHz CPU and 128 GB RAM.
To summarize, click-through-based cross-view learning techniques can learn the multi-view distance between a textual query and a visual image by leveraging both click-through data and subspace learning techniques. The click-through data can represent the click relations between textual queries and visual images, while subspace learning can aim to learn a common latent subspace between multiple modalities. Click-through-based cross-view learning techniques can be used to solve the problem of seemingly incomparable modalities in a principle way. Specifically, two different linear mappings can be used to project textual queries and visual images into the latent subspace. The mappings can be learned by jointly reducing the distance of observed textual query-visual image pairs on a click-through bipartite graph, and also preserving inherent structure in original spaces of the textual queries and visual images. Moreover, orthogonal assumptions on the mapping matrices can be made. Then, mappings can be obtained efficiently through curvilinear search. An l2 norm can be taken between the projections of textual query and visual image in the latent subspace as an instant function to measure the relevance of a textual query-visual image pair.
Although only the distance function between textual queries and visual images on the learned mapping matrices is presented in Algorithm 1, the optimization actually can also help learning of query-query and image-image distances. Similar to the distance function between a textual query and a visual image, the distance between a textual query and another textual query, or a visual image and another visual image, can be computed as:
(∀{circumflex over (q)},
respectively. Furthermore, the obtained distance can be applied for several information retrieval (IR) applications, e.g., query suggestion, query expansion, image clustering, image classification, etc.
As shown in
In some implementations, the example search dataset 302 can be used to train click-through-based cross-view learning techniques and/or other techniques. In some cases, the example search dataset can be a large-scale, click-based image dataset (e.g., the “Clickture” dataset). The example search dataset can entail two parts, for example a training set and a development set. In one example, the training set can consist of many {query, image, click} triads (e.g., millions of triads), where “query” can be a textual word or phrase, “image” can be a base64 encoded JPEG image thumbnail (for example), and “click” can be an integer which is no less than one. In this example, there can be potentially millions of distinct queries and millions of unique images of the training set.
In the development set, there can be potentially thousands of {query, image} pairs generated from hundreds of queries, for example. In some cases, each image to a corresponding query can be manually annotated on a three-point ordinal scale: “Excellent,” “Good,” and “Bad.” The training set can be used for learning a latent subspace (such as latent subspace 200 described relative to
As shown in
In example use-case scenario 300, the words in textual queries can be taken as “word features.” For any textual query, words can be stemmed and/or words can be removed. With word features, each textual query can be represented by a ‘tf’ vector in a textual query space (such as textual query space 102 shown in
Click-through-based Cross-view Learning (CCL), such as the implementation described above in Algorithm 1, can be compared to other example techniques in use-case scenario 300. The other example Techniques (A-D) can include:
Technique A: N-Gram support vector machine (SVM) Modeling, or N-Gram SVM
Technique B: Canonical Correlation Analysis (CCA)
Technique C: Partial Least Squares (PLS)
Technique D: Polynomial Semantic Indexing (PSI)
In example use-case scenario 300, N-Gram SVM can be considered a baseline without low-dimensional, latent subspace learning, thus in N-Gram SVM the relevance score can be predicted on an original visual image. For the other four techniques in this example, which include latent subspace learning, the dimensionality of the latent subspace can be in the range of {40, 80, 120, 160} in this implementation. The k nearest neighbors preserved in Eq. (4) can be selected within {100, 500, 1000, 1500, 2000}. The tradeoff parameter λ in the overall objective function can be set within {0.1, 0.2, . . . , 1.0}. Some implementations can set μ=0.3, ρ1 1=0.2, and ρ2 2=0.9 in the curvilinear search by using a validation set.
In this example, for performance evaluation of visual image search, a Normalized Discounted Cumulative Gain (NDCG) technique can be adopted, which can take into account a measure of multi-level relevancy as the performance metric. Given an image ranked list, the NDCG score at the depth of d in the ranked list can be defined by:
where rj={Excellent=3, Good=2, Bad=0} can be the manually judged relevance for each image with respect to the query. Zd can be a normalizer factor to make the score for d Excellent results 1. The final metric can be the average of NDCG@d for all queries in the test set.
In this example, as the step τ can be chosen to satisfy the Armijo-Wolfe conditions to achieve an approximate minimizer of L(Fq(τ),Fv(τ)) in Algorithm 1 instead of finding the global minimization due to its computational expense, the average overall objective value of Eq. (6) for one textual query-visual image pair versus iterations can be depicted to illustrate the convergence of the algorithm. In some cases, the value may not decrease as the iterations increase at all the dimensionality of the latent subspace. Specifically, after 100 iterations, the average objective value between query mapping and image projection can be around 10 when the latent subspace dimension is 40. Thus, the experiment can verify that Algorithm 1 can provide improved results and potentially reach a reasonable local optimum.
In the example shown in
In some cases, higher relevancy as shown by the relevance scales 402 can indicate that a certain technique has performed better than another technique in terms of returning relevant visual image search results 400 for the given textual query. For instance, in the example shown in
In some cases, relevance can be considered a measure of how similar visual image search results are to each other for a given technique. For example, in the column representing results from Technique A, search result 400(3) appears as a visual image of a cobra (e.g., a snake, not a car). The visual image of the cobra is not similar to the other visual images in the column. The associated relevance scale 402(3) includes a non-shaded box. In another example, in the column representing results from Technique C, search result 400(4) appears as a visual image of an engine. The visual image of the engine is not similar to the other visual images in the column. The associated relevance scale 402(4) includes no shaded boxes. In these cases, Techniques A and C could be viewed as under-performing in part due to dissimilarity among their top search results. In some implementations, similarity of returned search results can be an effect of the structure preservation regularization term in the overall objective of the click-through-based cross-view learning technique, described above (e.g., Eq. (5)). The structure preservation regularization term can restrict similar images in the original visual image space (such as visual image space 106 in
In general, use of click-through data can help bridge a user intention gap for image search. The user intention gap can relate to difficulty in knowing a user's intent based on textual query keywords, especially for ambiguous queries. The user intention gap can lead to biasing or error in manual annotation of relevance of textual query-visual image pairs by human labelers. For example, given the textual query “mustang cobra,” experts can tend to label images of animals “mustang” and/or “cobra” as highly relevant. However, empirical evidence suggests that most users entering the textual query “mustang cobra” to a search engine wish to retrieve images of a specific car named “Mustang Cobra.” The experts' labels might therefore be erroneous. Such factors can bias a training set and human ranking can be considered sub-optimal. On the other hand, click-through data can provide an alternative to address the user intention gap problem. In an image search engine, users can browse visual image search results before clicking a specific visual image. A user's decision to click on a visual image is likely dependent on the relevancy of the visual image. Therefore, the click-through data can serve as a reliable and implicit feedback for visual image search. Most of the clicked visual images might be relevant to the given textual query judged by the real users.
In the example use-case scenario 300, performance of the click-through-based cross-view learning (CCL) technique and example alternative Techniques A-D can be measured with the Normalized Discounted Cumulative Gain (NDCG) technique described above relative to Eq. (16).
In bar chart 600, the bars can represent NDCG scores averaged for over 1000 textual queries. In this case, the prediction for Technique A is performed on original visual image features of 1,024 dimensions, for example. For Techniques B-D and CCL, the performances are given by choosing 80 as the dimensionality of the latent subspace, in this case.
In the example shown in
The example results shown in bar chart 600 also show a performance gap between Techniques B and D. Though both example techniques attempt to learn linear mapping functions for forming a latent subspace, they are different in the way that Technique D learns cosine as a similarity function, and Technique B learns dot product. As indicated by the results shown in bar chart 600, increasing (e.g., potentially maximizing) a correlation between mappings in the latent subspace can lead to better performance. Moreover, Technique C, which utilizes click-through data as relative relevance judgments rather than absolute click numbers, can be superior to Technique B, but still shows lower NDCG scores than the CCL technique in this case. Another observation is that performance gain by the CCL technique is almost consistent when going deeper into the ranked list, which can represent another confirmation of effectiveness of the CCL technique in this case.
In some cases, the CCL technique is robust to changes in the dimensionality of the latent subspace. Stated another way, the CCL technique can be shown to outperform example Techniques A-D for different dimensionalities of the latent subspace. Thus, the example CCL techniques can provide a solution to the technical problem of identifying appropriate images for web queries. The solution can enhance the end user experience while effectively utilizing resources on the server side (e.g., providing meaningful results per processing cycle).
For purposes of explanation, devices 702(1-4) can be thought of as operating on a client-side 706 (e.g., they are client-side devices). Devices 702(5-6) can be thought of as operating on a server-side 708 (e.g., they are server-side devices, such as in a datacenter or server farm). The server-side devices can provide various remote services, such as search functionalities, for the client-side devices. In some implementations, each device 702 can include an instance of a text-image correlation component (TICC) 710. This is only one possible configuration and other implementations may include the server-side text-image correlation components 710(5-6) but eliminate the client-side text-image correlation components 710(1-4), for example. Other implementations may be accomplished on a single, self-contained device, such as on a single client-side device.
Text-image correlation component 710 can function in cooperation with application(s) layer 800 and/or operating system layer 802. For instance, text-image correlation component 710 can be manifest as an application or an application part. In one such example, the text-image correlation component 710 can be an application part of (or work in cooperation with) a search engine application 816.
From one perspective, individual devices 702 can be thought of as a computer. Processor 808 can execute data in the form of computer-readable instructions to provide a functionality. Data, such as computer-readable instructions and/or user-related data, can be stored on storage 806, such as storage that can be internal or external to the computer. The storage can include any one or more of volatile or non-volatile memory, hard drives, flash storage devices, and/or optical storage devices (e.g., CDs, DVDs etc.), among others. As used herein, the term “computer-readable media” can include signals. In contrast, the term “computer-readable storage media” excludes signals. Computer-readable storage media includes “computer-readable storage devices.” Examples of computer-readable storage devices include volatile storage media, such as RAM, and non-volatile storage media, such as hard drives, optical discs, and flash memory, among others.
In some configurations, individual devices 702 can include a system on a chip (SOC) type design. In such a case, functionality provided by the computer can be integrated on a single SOC or multiple coupled SOCs. One or more processors can be configured to coordinate with shared resources, such as memory, storage, etc., and/or one or more dedicated resources, such as hardware blocks configured to perform certain specific functionality. Thus, the term “processor” as used herein can also refer to central processing units (CPUs), graphical processing units (CPUs), controllers, microcontrollers, processor cores, or other types of processing devices.
Generally, any of the functions described herein can be implemented using software, firmware, hardware (e.g., fixed-logic circuitry), manual processing, or a combination of these implementations. The term “component” as used herein generally represents software, firmware, hardware, whole devices or networks, or a combination thereof. In the case of a software implementation, for instance, these may represent program code that performs specified tasks when executed on a processor (e.g., CPU or CPUs). The program code can be stored in one or more computer-readable memory devices, such as computer-readable storage media. The features and techniques of the component are platform-independent, meaning that they may be implemented on a variety of commercial computing platforms having a variety of processing configurations.
The text-image correlation component 710 can include a subspace mapping module (S M M) 818 and a relevance determination module (R D M) 820. Briefly, these modules can accomplish specific facets of text-image correlation. The subspace mapping module 818 can be involved in learning mapping functions that can be used to map a latent subspace. The relevance determination module 820 can be involved in determining relevance between the textual queries and the visual images.
In some implementations, the subspace mapping module 818 can use click-through data related to textual queries and visual images to learn mapping functions, such as described relative to
In some implementations, the relevance determination module 820 can use the mapping functions produced by the subspace mapping module 818 to project textual queries and/or visual images into a latent space, such as described relative to
Referring to
For example, referring again to the example in
In summary, a text-image correlation component can learn a click-through-based structured latent subspace for correlation of textual queries and visual images. The latent subspace can be mapped based on click-through data and structures of original spaces of the textual queries and the visual images. The relevance of the textual queries and the visual images can then be used to rank visual image search results in response to the textual queries.
Note that the user's privacy can be protected while implementing the present concepts by only collecting user data upon the user giving his/her express consent. All privacy and security procedures can be implemented to safeguard the user. For instance, the user may provide an authorization (and/or define the conditions of the authorization) on his/her device or profile. Otherwise, user information is not gathered and functionalities can be offered to the user that do not utilize the user's personal information. Even when the user has given express consent the present implementations can offer advantages to the user while protecting the user's personal information, privacy, and security and limiting the scope of the use to the conditions of the authorization.
As shown in
At block 904, method 900 can receive visual images from a visual image space. The visual image space can have a second structure. In some cases, the second structure can be representative of similarities between pairs of the visual images in the visual image space.
At block 906, method 900 can receive click-through data related to the textual queries and the visual images. In some cases, the click-through data can include click numbers representing a number of times an individual visual image is clicked in response to an individual textual query.
At block 908, method 900 can create a latent subspace. In some implementations, the latent subspace can be a low-dimensional common subspace that can be used to represent the textual queries and the visual images.
Viewed from one perspective, the latent subspace can be defined as a new space shared by multiple views by assuming that the input views are generated from this latent subspace. The dimensionality of the latent subspace can be lower than that of any input view, so subspace learning is effective in reducing the “curse of dimensionality.” The construction of the latent subspace can be a core component of some of the inventive aspects and some of the inventive aspects can come from the exploration of cross-view distance and structure preservation, which have not been previously attempted.
At block 910, method 900 can map the textual queries and the visual images in the latent subspace. The mapping can include determining distances between textual queries and the visual images in the latent subspace. In some cases the distances can be based at least in part on the click numbers described relative to block 906. In some cases the mapping can also include preservation of the first structure from the textual query space and the second structure from the visual image space.
At block 912, method 900 can determine relevance between the textual queries and the visual images based on the mapping. In some cases, the relevance can be determined between a first textual query and a second textual query, a first visual image and a second visual image, and/or the first textual query and the first visual image. In some cases, the relevance between textual queries and visual images can be determined based on the mapped distances in the latent subspace.
At block 1008, method 1000 can learn mapping functions that map the textual queries and the visual images into a click-through-based structured latent subspace based on the first structure, the second structure, and the click-through data. At block 1010, method 1000 can output the learned mapping functions.
At block 1012, method 1000 can use the learned mapping functions (and/or other mapping functions) to determine distances among the textual queries and the visual images in the click-through-based structured latent subspace.
At block 1014, method 1000 can sort results for new content based on the distances. New content can include a new textual query, a new visual image, two or more new textual queries or visual images, or content from other modalities, such as audio or video. For example, a new textual query may be received that is not one of the textual queries used to learn the mapping functions in blocks 1002-1008. In this case, the method can use the mapping functions and/or the learned latent subspace to determine relevance of visual images to the new textual query, and sort the visual images into ranked search results. In other examples, the results can be textual queries, other modalities such as audio or video, or a mixture of modalities. For example, a ranked search result list could include visual images, video, and/or audio results for the new content.
Method 1000 may be performed by a single device or by multiple devices. In one case, a single device, such as a device performing a search engine functionality could perform blocks 1002-1014. In another case, a first device may perform some of the blocks, such as blocks 1002-1010 to produce the learned mapping functions. Another device, such as a device preforming the search engine functionality could leverage the learned mapping functions in performing blocks 1012-1014 to produce improved image results when users submit new search queries.
The described methods can be performed by the systems and/or devices described above, and/or by other devices and/or systems. The order in which the methods are described is not intended to be construed as a limitation, and any number of the described acts can be combined in any order to implement the method, or an alternate method. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof, such that a device can implement the method. In one case, the method is stored on computer-readable storage media as a set of instructions such that execution by a computing device causes the computing device to perform the method.
The description relates to click-through-based cross-view learning. In one example, a click-through-based structured latent subspace can be used to directly compare textual queries and visual images. In some implementations, a click-through-based cross-view learning method can include determining distances between textual query and visual image mappings in the latent subspace. The distances between the textual queries and the visual images can be weighted by click numbers from click-through data. The click-through-based cross-view learning method can also include preserving structure relationships between textual queries and visual images in their respective original feature spaces. In some cases, after the mapping of the latent subspace, a relevance between textual queries and visual images can be measured by their mappings. In other cases, relevance between two textual queries and/or between two visual images can be measured by their mappings. The relevance scores can be used to rank images and/or queries in search results.
Although techniques, methods, devices, systems, etc., pertaining to providing accurate search results are described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed methods, devices, systems, etc.
Number | Date | Country | |
---|---|---|---|
62009080 | Jun 2014 | US |