This application claims the priority benefit of Chinese Patent Application No. 201210123853.5, filed on Apr. 25, 2012, the contents of which are incorporated by reference herein in their entirety for all purposes.
The present disclosure generally relates to image-based information retrieval, and more particularly, to methods and systems for identifying one or more items from a scanned image and retrieving information relating to the identified one or more items.
Very often in everyday life do people encounter items in the physical world that they would like to have more information about. For example, someone looking at a movie poster on a wall may want to find out more about the director and actors involved with the movie, such as their previous work. He may also want to see a preview of the movie and, if he likes the preview, find a nearby theater and buy a ticket to see the movie. Similarly, a person browsing in a book store may want to read reviews of a particular book or cross-shop the book at online book stores.
There are a number of existing ways to obtain information relating to items such as the movie poster or book. One way is to conduct a manual search using, for example, a browser or application-based search engine on a PC or mobile device. This process is usually tedious and slow because it requires the user to manually enter a descriptive search string. Also, it may only work well for text-based searches. It is usually difficult to run a search for an image without specialized software.
Another existing mechanism for retrieving information regarding an item is to scan a barcode (linear or matrix) associated with the item. The barcode can usually be found on or in close proximity of the item. It can be scanned using, for example, a dedicated scanner, such as a common barcode scanner, or a mobile device equipped with a camera and the required scanning application. However, there are certain limitations with scanning barcodes. For example, the amount of information retrievable from a barcode is usually limited. Scanning the barcode on a product in a supermarket may only provide the name and price of the product. More advanced barcodes, such as Quick Response (QR) codes, can provide a Web link, name, contact information such as an address, phone number, email address, and/or some other similar data type when scanned. Nevertheless, the information retrievable from these barcodes is typically limited to the information available in the corresponding backend system/database, such as an inventory management system of a supermarket. Such system/database may not have all the information desired by the person interested in the item.
Radio Frequency Identification (RFID) technology is another mechanism for automatically identifying and tracking tags attached to an item. RFID technology relies on radio-frequency electromagnetic fields to transfer data in a non-contacting fashion. An RFID system typically requires RFID tags to be attached to the item and a reader for reading data associated with a particular item from the corresponding tag. The reader can transmit the data to a computer system to be further processed. Nevertheless, RFID technology has the same shortcomings as barcodes in that only a relatively limited amount of information can be retrieved from reading the RFID tags. Furthermore, the fact that it requires special tags and readers makes it a less desirable solution for retrieving information since most people do not carry a RFID reader on them.
Accordingly, information retrieval systems and methods that can provide a simpler and more user-friendly experience and have access to a large information repository for providing information relating to a wide range of items are highly desirable.
This generally relates to systems and methods for retrieving information relating to an item based on a scanned image of the item. In particular, the systems and methods can involve using a device, such as a smartphone, to capture a 2-dimensional image of an item and transmit the captured image to a server. The server can analyze the image against pre-stored data to determine a corresponding item associated with the image and obtain information relating to the item from a data repository such as the Internet. The information can then be transmitted from the server to the device.
In one embodiment, an information-providing system is disclosed. The information-providing system can include an image-receiving module that receives an image from a device, an item-selection module that identifies an item based on the received image, an information-retrieving module that retrieves information relating to the item, and a data transmitting module that transmits the retrieved information to the device, wherein the item is identified by matching one or more features of the received image with features identified from a training image associated with the item.
In another embodiment, the system can also include a training image processing module that identifies one or more features from at least one training image. In another embodiment, the training image processing module can further include: a keypoint-identifying module that identifies at least one keypoint of the received image, a descriptor-generating module that generates a descriptor for each of the at least one keypoint, a feature ID generating module that quantifies a descriptor to generate at least one feature ID, and a database-access module that stores the at least one feature ID and at least one of its corresponding item in a database. In another embodiment, the system can also include a database for storing the features identified from the training image. In another embodiment, the database can store the features and one or more items associated with each of the features. In another embodiment, the information is retrieved from the Internet. In another embodiment, the identified item is a book and the received image includes a book cover of the book.
In yet another embodiment, the item-selection module can further include: a keypoint-identifying module that identifies at least one keypoint of the received image, a descriptor-generating module that generates a descriptor for each of the at least one keypoint, a feature ID generating module that quantifies a descriptor to generate at least one feature ID, an item-selecting module that selects at least one item corresponding to each of the at least one feature ID, a hit-counting module that determines a total number of hits for each of the selected items, and a top item selection module that selects one of the selected items that best matches with the received image. In yet another embodiment, the item-selection module includes: a threshold module that determines whether the number of hits for an item exceeds a predetermined threshold, and an item-eliminating module that eliminates an item if the number of hits for the item does not exceed the predetermined threshold. In yet another embodiment, the item-selection module includes a geometric verification module that performs geometric verification on the received image and the training image associated with the best-matching item.
a-3c are screen shots on the requesting device illustrating exemplary user interfaces for retrieving information based on a scanned image, according to an embodiment of the disclosure.
In the following description of preferred embodiments, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific embodiments in which the disclosure can be practiced. It is to be understood that other embodiments can be used and structural changes can be made without departing from the scope of the embodiments of this disclosure.
This generally relates to systems and methods for retrieving information relating to an item based on a scanned image of the item. In particular, the systems and methods can involve using a device, such as a smartphone, to capture a 2-dimensional image of an item and transmit the captured image to a server. The server can analyze the image against pre-stored data to determine a corresponding item associated with the image and obtain information relating to the item from a data repository such as the Internet. The information can then be transmitted from the server to the device.
Although only one device 100 is shown to be connected to the server 102, it should be understood that additional devices can also be connected to the server 102 and request information in the same fashion. The server 102 can be any suitable computing device or devices capable of receiving image data from one or more devices 100, identifying an item based on the image data, retrieving additional data regarding the identified item from internal and/or external sources such as the Internet 104, and transmitting the retrieved data back to the requesting device(s). Various methods can be employed by the server to extract information (e.g., features such as the color, brightness, and/or relative position of the pixels) from the image and identify, based on the extracted information, one or more items associated with the image (e.g., an item depicted in the image). The information extracted from an image can identify features that, alone or in combination, can be used to identify the image among a collection of images.
In some embodiments, this can be done by the server querying a database containing pre-stored feature IDs and corresponding item(s). The feature IDs in the database can be generated from a collection of training images of various items. As referred to hereinafter, a training image can be any existing image depicting one or more items. One or more features can be identified from a training image by using a feature extracting mechanism, an example of which will be discussed in detail below. A unique ID (e.g., feature ID) for each of these features can be stored in the database along with the names or IDs of one or more items associated with the training image from which the features were identified. Essentially, the database can store pairing of various features and items.
When a request (or a scanned image) from a requesting device is received by the server, the server can perform the same feature extracting process on the scanned image to generate one or more feature IDs from the scanned image. These feature IDs can then be used to query the database to find the matching item(s). The item with the best matching score can be identified as the item associated with the particular scanned image.
Referring again to
After the image is captured, the requesting device can transmit the image to a server to identify the item depicted in or associated with the image (step 202). For example, the scanned book cover can be sent to the server to retrieve information regarding the particular book. The image-transmitting step can take place automatically after the requesting device determines that the scanning operation was successfully. Additionally or alternatively, the device may perform one or more quality-assurance steps to ensure that the captured image meets certain criteria in terms as clarity, resolution, size, brightness, etc. so that it can be properly analyzed by the server. If the image does not meet one or more of the criteria, the device can prompt the user to scan the image again. In some embodiments, the user has to manually transmit the image to the server. In some embodiments, the user can also enter addition information, such as keywords specifying the type of information to be returned by the server, to be transmitted with the image to the server. In the book cover example, the scanned image of the book cover can be transmitted to the server. The user may optionally enter keywords, such as “author” and/or “cover designer,” to be transmitted with the image. These keywords may direct the server to search for information specifically relating to the author and/or cover designer of this particular book after the server identifies the book from the scanned image.
After receiving the image transmitted from the requesting device, the server can identify one or more features of the image (step 203). In one embodiment, the one or more features can be represented by one or more feature IDs. For example, the image of the book cover may be processed by the server to extract certain features defined by, for example, color, brightness, and/or relative position of the pixels of the image. Each of these features can then be quantified as a unique feature ID.
Next, the server can identify an item based on the feature IDs calculated from the image (step 204). This can involve looking up items corresponding to each of the feature IDs in a database and ranking these items based on the number of feature IDs to which they correspond. The item ranked the highest can be determined to be the best match for the scanned image (e.g., most likely to be the item depicted in the image). Steps 203 and 204 will be described in further detail below.
After the item is determined, the server can then search on the Internet (or another data repository) for information relating to the item (step 205). In the book cover example, the server can determine that the cover image corresponds to the cover of one of the Harry Potter books. The server can then run a search on the Internet for information, such as title and plot summary of all Harry Potter books, readers' reviews of the book, information relating to the graphic designer who designed the book cover, and online book stores offering this particular book for sale. Essentially, any information regarding the book that is available on the Internet can be found by the server and made available to the user device. In the embodiments where the scanned image is transmitted to the server along with keywords entered by the user, the search can incorporate these keywords to provide results tailored to the user's interest.
The server can then transmit the search results to the requesting device (step 206). The results can be displayed on the screen of the device for user browsing. They can also include web links to other websites where additional information can be available to the user. For example, the user may follow one of the links to an online book store to buy a copy of the Harry Potter book. Additionally or alternatively, he may also purchase and download an electronic copy of the book onto his device so that he can start reading right away.
The exemplary embodiments discussed above can provide a much simpler and effective way for retrieving information than any existing mechanisms including those described in the Background section. First, the disclosed methods and systems do not require any customized hardware, such as a barcode scanner. Any device with a camera can be used to scan an image of an item and receive information regarding the particular item from the server. Furthermore, the item of interest does not have to come with any barcode, QR code, or any other type of code to enable the information retrieval process. All it takes is for the user to scan, or capture in another way, a two-dimensional image of the item using the camera on his mobile phone to have access to potentially all kinds of information relating to the item. This process cannot be any more straightforward from the user end. Another advantage of the disclosed systems and methods is that the backend server can have access to the whole Internet to find information relating to the item. This can overcome the limitations of existing systems where only a limited amount of information (e.g., the information available in a closed system) can be returned in response to an inquiry based on, for example, a QR code.
a-3c are exemplary screen shots on the requesting device illustrating user interfaces for retrieving information based on a scanned image, according to embodiments of the disclosure. In particular,
If the user hits the “Image Scan” softkey on the interface 300 of
The processes being performed by the server will be discussed in detail in latter paragraphs. After the server finds relevant information regarding the item associated with the scanned image, it can transmit this information in a specific format to the user device.
As apparent from the exemplary user interfaces 300, 312, 318 of
At mentioned above, when the server receives the scanned image from the requesting device, the server can identify various features of the image. The server can include a database of features extracted from a collection of known images (i.e., training images) of items. The server can then match the features extracted from the scanned image with the features from the collection of training images to identify a best-matching training image from the collection of training images. Because each training image can be associated with at least one known item, the server can identify one or more items relating to the scanned image if features from the scanned image are found to match with features from multiple training images.
First, the processing of training images by the server to generate and store a list of features and their associated items is discussed.
In this embodiment, to extract one or more features from a training image, the server can first identify a number of keypoints of the image (step 402). First, scale-invariant feature transform (SIFT) features can be extracted from the training image. SIFT features can be invariant with respect to, for example, rotating, scaling, and illumination changes of the image. They can also be relatively stable with respect to, for example, the changing of the viewing angle, affine transformation, noise, and other factors that may affect an image. In one embodiment, SIFT feature extraction can be carried out as follows.
First, scale space extrema can be detected. To effectively extract stable keypoints, the training image can be convolved with Difference of Gaussians (DoGs) that occur at multiple scales.
D(x,y,σ)=(G(x,y,kσ)−G(x,y,σ))*I(x,y)=L(x,y,kσ)−L(x,y,σ)
This can be achieved by generating Gaussian image pyramids. The image pyramids can be in P groups and each group can include S layers. The layers of the first group can be generated by convolving the original image with DoGs that occur at multiple scales (adjacent layers can have a scale difference of factor k). The next group can be generated by downsampling the previous group of images. A DoG pyramid can be generated from the differences between the adjacent Gaussian image pyramids.
To locate the scale space extrema, each sampling point (e.g., pixel) in the DoG pyramid can be compared to its eight adjacent pixels at the same scale and nine upper and nine lower neighboring pixels in each of the neighboring scales (a total of 8+9*2=26 pixels). If the value of the pixel is lesser or greater than the value of the 26 neighboring pixels, the pixel can be determined as a local extremum (i.e., a keypoint).
Next, an accurate location of each keypoint can be determined. Specifically, this can be done by fitting a 3-dimensional quadratic function to accurately determine the location and scale of each keypoint. At the same time, low contrast candidate points and edge response points along an edge can be discarded to improve consistency in the later feature-matching processes and also increase noise-rejection capability. To accurately locating a keypoint can include determining a main orientation of the keypoint and generating a descriptor of the keypoint.
To determine the orientation of a keypoint, the keypoint can be used as the center of the neighboring window for sampling. An orientation histogram can be used for determining the gradient orientation of the neighboring pixels. An orientation histogram with 36 bins can be formed, with each bin covering 10 degrees for a total range of 0-360 degrees. The peaks in this histogram can correspond to the dominant orientations of the neighboring gradient of the keypoint, and thus can be used as the dominant orientations of the keypoint. In the gradient orientation histogram, the orientations corresponding to the peaks that are within 80% of the highest peaks can be the supplemental direction of the keypoint.
Referring to
The 128-dimensional SIFT feature vector can then be quantified as a feature ID (e.g., a number from 1-1,000,000) (step 404). That is, each SIFT feature vector representing a feature of the training image can have a corresponding numeric feature ID. Typically, more than one feature can be identified from a training image. Accordingly, each training image may be associated with multiple feature IDs. Because each training image can be associated with an item (e.g., the item depicted in the image), the item can also be associated with the multiple feature IDs.
Similarly, the same feature may be found in different training images. For example, the book cover of “Harry Potter and the Chamber of Secrets” may share some of the same features with that of “Harry Potter and the Goblet of Fire” (e.g., both covers may include an image of the text “Harry Potter”). Accordingly, each feature ID may be associated with multiple items. The relationship between the features (as identified by their respective feature ID) and the items can be captured and stored in a database accessible to the server (step 405).
In the various embodiments, the database can be any suitable data storage program/format including, but not limited to, a list, text file, spreadsheet, relational database, and/or object-oriented database.
As shown in the table 500, each feature ID can correspond to one or more items. As previously discussed, when the same feature is found in two different images, the corresponding feature ID can be associates with two different items. For example, as shown in the table of
When the server receives a scanned image from a user device as an request for information relating to the item in the image, the server can process the scanned image to extract features IDs representing various features of the scanned image and look up the corresponding item(s) from the database (e.g., the table of
With the feature IDs determined, corresponding item(s) for each feature ID can be looked up from a database (e.g., table of
Next, the total number of hits for each of the selected items can be determined (step 606). For example, “Harry Potter and the Chamber of Secrets” can have a total of one hit while “Harry Potter and the Goblet of Fire” can have a total of two hits based on the information in the table of
In some embodiment, a geometric verification step can be performed to further verify that the scanner image matches with the training image associated with the candidate item (step 610) before an item is determined to be the best match for the scanned image. In particular, geometric verification can involve matching the individual pixels or features from the scanned image with those from the training image of the item selected through the process described above. This can be done by, for example, measuring and comparing the relative distances between two or more pixels in each of the two images. Based on how well the relative distances between the pixels match in the two images, it can be determined whether the training image is a top match for the scanned image. If the geometric verification is successful, the item associated with the training image can be confirmed to be the item associated with the scanned image.
After an item is determined to be the item corresponding to the scanned image, the server can search for information relating to the item in one or more data repositories. For example, if “Harry Potter and the Goblet of Fire” is determined to be the item associated with the scanned image received from the user device, the server can conduct a search for information relating to this particular book. The results from the search can then be transmitted back to the user device for display, as shown in the screen shot of
Although the above embodiments describe identifying books and movies from images of book covers and movie posters, respectively, the same methods and systems can be applied to obtain information relating to any item, as long as an image of the item can be captured by scanning or other mechanisms and the item in the image can be recognized based on the information available (e.g., information extracted from the training images) to the server. In various embodiments, the item can also include, for example, the logo of a product, a screen shot from another device, a work of art such as a painting, or a 3-dimensional object such as a building. It should also be understood that the processes for extracting information such as feature IDs from an image are not limited to those described in the embodiments above. Without departing from the spirit of the disclosure, other suitable processes for recognizing text, graphics, facial expressions, geographic locations, 1D and 2D codes, etc. can also be used for identifying a particular item for the purpose of providing information relating to the item. Examples of other types of image processing systems and methods are described in, for example, Chinese Patent Application No. 201210123853.5, filed Apr. 26, 2012, the content of which is incorporated by reference herein in their entirety.
The above-described exemplary processes including, but not limited to, generating a list of feature IDs from training images, storing these feature IDs with their corresponding items in the database, determining a best-matching item for a scanned image using the information stored in the database, and obtaining information relating to the best-matching item can be implemented using various combinations of software, firmware, and hardware technologies on the server (or a cluster of servers). The server may include one or more modules for facilitating the various tasks of these processes.
The server can also include a training image processing module 702 for collecting and processing training images to identify various feature IDs and their corresponding items. In some embodiments, the training image processing module 702 can perform one or more steps illustrated in
In some embodiments, the training image processing module 800 and the item-selection module 900 can share one or more of the keypoint-identifying module, descriptor-generating module, and feature ID generating module 904.
In some embodiments, one or more of these modules on the server can be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. In the context of this document, a “non-transitory computer-readable storage medium” can be any medium that can contain or store the program for use by or in connection with the instruction execution system, apparatus, or device. The non-transitory computer readable storage medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, a portable computer diskette (magnetic), a random access memory (RAM) (magnetic), a read-only memory (ROM) (magnetic), an erasable programmable read-only memory (EPROM) (magnetic), a portable optical disc such a CD, CD-R, CD-RW, DVD, DVD-R, or DVD-RW, or flash memory such as compact flash cards, secured digital cards, USB memory devices, memory sticks, and the like.
The non-transitory computer readable storage medium can be part of a computing system serving as the server.
Although the modules illustrated in
For example, referring again to
In one embodiment, one of the requesting devices can process image-based information retrieval request from one or other requesting devices. As such, no dedicated server is necessary to carry out the processes of methods discussed above. In other embodiments, the various steps and tasks involved in the processes described in view of
Essentially, embodiments of the disclosure provides methods and systems that can allow a user to scan a 2-dimensional image of any item of his interest using, for example, his smartphone, and provide him with all sorts of information relating to the item that can be ascertained from, for example, the Internet and/or other existing data repositories. This can provide a simple, low-cost, but also user-friendly and effective way of looking up information relating to anything that can be captured in an image.
Although embodiments of this disclosure have been fully described with reference to the accompanying drawings, it is to be noted that various changes and modifications will become apparent to those skilled in the art. Such changes and modifications are to be understood as being included within the scope of embodiments of this disclosure as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
201210123853.5 | Apr 2012 | CN | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2013/074731 | 4/25/2013 | WO | 00 | 5/30/2013 |