The present disclosure relates generally to image classification, and more particularly to automated features for providing classification labels for businesses or other location entities based on images.
Computer-implemented search engines are used generally to implement a variety of services for a user. Search engines can help a user to identify information based on identified search terms, but also to locate businesses or other location entities of interest to a user. Often times, search queries are performed that are locality-aware, e.g., by taking into account the current location of a user or a desired location for which a user is searching for location-based entity information. Examples of such queries can be initiated by entering a location term (e.g., street address, latitude/longitude position, “near me” or other current location indicator) and other search terms (e.g., pizza, furniture, pharmacy). Having a comprehensive database of entity information that includes accurate business listing information can be useful to respond to these types of search queries. Existing databases of business listings can include pieces of information including business names, locations, hours of operation, and even street level images of such businesses, offered within services such as Google Maps as “Street View” images. Including additional database information that accurately identifies categories associated with each business or location entity can also be helpful to accurately respond to location-based search queries from a user.
Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.
One example aspect of the present disclosure is directed to a computer-implemented method of providing classification labels for location entities from imagery. The method can include providing, using one or more computing devices, one or more images of a location entity as input to a statistical model. The method can also include applying, by the one or more computing devices, the statistical model to the one or more images. The method can also include generating, using the one or more computing devices, a plurality of classification labels for the location entity in the one or more images. The plurality of classification labels can be generated by selecting from an ontology that identifies predetermined relationships between location entities and categories associated with corresponding classification labels at multiple levels of granularity. The method can still further include providing, using the one or more computing devices, the plurality of classification labels as an output of the statistical model.
Another example aspect of the present disclosure is directed to a computer-implemented method of processing a business-related search query. The method can include receiving, using one or more computing devices, a request for listing information for a particular type of business. The method can also include accessing, using the one or more computing devices, a database of business listings that comprises businesses, images of the businesses, and associations between the businesses and multiple classification labels. The associations between the businesses and multiple classification labels can be identified by providing each image of a business as input to a statistical model, applying the statistical model to each image of the business, generating the multiple classification labels for the business, and providing the multiple classification labels for the business as output of the statistical model. The method can also include providing, using the one or more computing devices, listing information including one or more business listings identified from the database of business listings at least in part by consulting the associations between the businesses and multiple classification labels.
Other example aspects of the present disclosure are directed to systems, apparatus, tangible, non-transitory computer-readable media, user interfaces, memory devices, and electronic devices for estimating restaurant wait times and/or food serving times using mobile computing devices.
These and other features, aspects, and advantages of various embodiments will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the related principles.
Detailed discussion of embodiments directed to one of ordinary skill in the art are set forth in the specification, which makes reference to the appended figures, in which:
Reference now will be made in detail to embodiments, one or more examples of which are illustrated in the drawings. Each example is provided by way of explanation of the embodiments, not limitation of the present disclosure. In fact, it will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments without departing from the scope or spirit of the present disclosure. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that aspects of the present disclosure cover such modifications and variations.
In some embodiments, in order to obtain the benefits of the techniques described herein, the user may be required to allow the collection and analysis of image data, location data, and other relevant information collected for various location entities. For example, in some embodiments, users may be provided with an opportunity to control whether programs or features collect such data or information. If the user does not allow collection and use of such signals, then the user may not receive the benefits of the techniques described herein. The user can also be provided with tools to revoke or modify consent. In addition, certain information or data can be treated in one or more ways before it is stored or used, so that personally identifiable data or other information is removed.
Example aspects of the present disclosure are directed to systems and methods of providing classification labels for a location entity based on images. Following the popularity of smart mobile devices, search engine users today perform a variety of locality-aware queries, such as “Japanese restaurant near me,” “Food nearby open now,” or “Asian stores in San Diego.” With the help of local business listings, these queries can be answered in a way that can be tailored to the user's location.
Creating accurate listings of local businesses can be time consuming and expensive. It is not a trivial task for humans to categorize such business listings, since human categorization requires abilities to read the local language, be familiar with local chains and brands, and generally become experts in complex categorization. To be useful for a search engine, the listings need to be accurate, extensive, and importantly, contain a rich representation of the business category including more than one category. For example, recognizing that a “Japanese Restaurant” is a type of “Asian Store” that sells “Food” can be important in accurately answering a large variety of queries.
In addition to the complexities of creating accurate and comprehensive business listings, listing maintenance can be a never ending task as businesses often move or close down. It is estimated that about 10 percent of establishments go out of business every year. In some segments of the market, such as the restaurant industry, this rate can be as high as about 30 percent. The time, expense, and continuing maintenance of creating an accurate and comprehensive database of categorized business listings makes a compelling case for new technologies to automate the creation and maintenance of business listings.
The embodiments according to example aspects of the present disclosure can automatically create classification labels for location entities from images of the location entities. In general, this can be accomplished by providing location entity images as an input to a statistical model (e.g., a neural network or other model implemented through a machine learning process.) The statistical model then can be applied to the image, at which point a plurality of classification labels for the location entity in the image can be generated and provided as an output of the statistical model. In some examples, a confidence score also can be generated for each of the plurality of classification labels to indicate a likelihood level that each generated classification label is accurate for its corresponding location entity.
Types of images and image preparation can vary in different embodiments of the disclosed technology. In some examples, the images correspond to panoramic street-level images, such as those offered by Google Maps as “Street View” images. In some examples, a bounding box can be applied to the images to identify at least one portion of each image that contains business related information. This identified portion can then be applied as an input to the statistical model.
Types of classification labels also can vary in different embodiments of the disclosed technology. In some examples, the location entities correspond to businesses such that classification labels provide multi-label fine grained classification of business storefronts. In some examples, the plurality of classification labels for the location entity identified in the images includes at least one classification label from a first hierarchical level of categorization and at least one classification label from a second hierarchical level of categorization. In some examples, the plurality of classification labels are generated by selecting from an ontology that identifies different predetermined relationships between location entities and different categories associated with corresponding classification labels at multiple levels of granularity. In some examples, the plurality of classification labels for the location entity can include at least one classification label from a general level of categorization that includes such options as an entertainment and recreation label, a health and beauty label, a lodging label, a nightlife label, a professional services label, a food and drink label and a shopping label.
Training the neural network or other statistical model can include using a set of training images of different location entities and data identifying the geographic location of the location entities within the training images, such that the neural network outputs a plurality of classification labels for each training image. In some examples, the neural network can be a distributed and scalable neural network. In some examples, the neural network can be a deep neural network and/or a convolutional neural network. The neural network can be customized in a variety of manners, including providing a specific top layer such as but not limited to a logistics regression top layer.
The generated plurality of classification labels provided as output from the neural network or other statistical model can be utilized in variety of specific applications. In some examples, the images provided as input to the neural network are subsequently tagged with one or more of the plurality of classification labels generated as output. In some examples, an association between the location entity associated with each image and the plurality of generated classification labels can be stored in a database. In some examples, the location entities from the images correspond to businesses and the database of stored associations includes business information for the businesses as well as the associations between the business associated with each image and the plurality of generated classification labels. In some examples, images can be matched to an existing business in the database using the plurality of generated classification labels at least in part to perform the matching. In other examples, a request from a user for business information can be received. The requested business information then can be retrieved from the database that includes the stored associations between the business associated with an image and the plurality of generated classification labels.
According to an example embodiment, a search engine receives requests for various business-related, location-aware search queries, such as a request for listing information for a particular type of business. The request can optionally include additional time or location parameters. A database of business listings that comprises businesses, images of the businesses, and associations between the businesses and multiple classification labels can be accessed. In some examples, the associations between the businesses and multiple classification labels can be identified by providing each image of a business as input to a statistical model, applying the statistical model to each image of the business, generating the multiple classification labels for the business, and providing the multiple classification labels for the business as output of the statistical model. Listing information then can be provided as output, including one or more business listings identified from the database of business listings at least in part by consulting the associations between the businesses and multiple classification labels.
Referring now to the drawings, exemplary embodiments of the present disclosure will now be discussed in detail.
The statistical model 104 can be implemented in a variety of manners. In some embodiments, machine learning can be used to evaluate training images and develop classifiers that correlate predetermined image features to specific categories. For example, image features can be identified as training classifiers using a learning algorithm such as Neural Network, Support Vector Machine (SVM) or other machine learning process. Once classifiers within the statistical model are adequately trained with a series of training images, the statistical model can be employed in real time to analyze subsequent images provided as input to the statistical model.
In examples when statistical model 104 is implemented using a neural network, the neural network can be configured in a variety of particular ways. In some examples, the neural network can be a deep neural network and/or a convolutional neural network. In some examples, the neural network can be a distributed and scalable neural network. The neural network can be customized in a variety of manners, including providing a specific top layer such as but not limited to a logistics regression top layer. A convolutional neural network can be considered as a neural network that contains sets of nodes with tied parameters. A deep convolutional neural network can be considered as having a stacked structure with a plurality of layers.
Although statistical model 104 of
Referring still to
Types and amounts of classification labels 106 can vary in different embodiments of the disclosed technology. In some examples, the location entities correspond to businesses such that classification labels 106 provide multi-label fine grained classification of business storefronts. In some examples, the plurality of classification labels 106 for the location entity identified in image 102 includes at least one classification label 106 from a first hierarchical level of categorization (e.g., “Health & Beauty”) and at least one classification label from a second hierarchical level of categorization (e.g., “Dental.”) In some examples, the plurality of classification labels 106 are generated by selecting from an ontology that identifies different predetermined relationships between location entities and different categories associated with corresponding classification labels at multiple levels of granularity. In some examples, the plurality of classification labels 106 for the location entity can include at least one classification label from a general level of categorization that includes such options as an entertainment and recreation label, a health and beauty label, a lodging label, a nightlife label, a professional services label, a food and drink label and a shopping label. Although four different classification labels 106 and corresponding confidence scores are shown in the example of
Referring now to
The disclosed classification techniques effectively address potentially large within-class variance when accurately predicting the function or classification of businesses of other location entities. The number of possible categories can be large, and the similarity between different classes can be smaller than within class variability. For example,
The disclosed classification techniques provide solutions for accurate business classification that do not rely purely on textual information within images. Although textual information in an image can assist the classification task, and can be used in combination with the disclosed techniques, OCR analysis of text strings available from an image is not required. This provides an advantage because of the various drawbacks that can potentially exist in some text-based models. The accuracy of text detection and transcription in real world images has increased significantly in recent years. However, relying solely on an ability to transcribe text can have drawbacks. For example, text can be in a language for which there is no trained model, or the language used can be different than what is expected based on the image location. In addition, determining which text in an image belongs to the business being classified can be a hard task and extracted text can sometimes be misleading.
Referring more particularly to
In light of potential issues that can arise as shown in
An ontology for classification labels as used herein helps to create large scale labeled training data for fine grained storefront classification. In general, information from an ontology of entities with geographical attributes can be fused to propagate category information such that each image can be paired with multiple classification labels having different levels of granularity.
It should be appreciated that the relatively small snippet of ontology depicted in
Ontologies can be designed in order to yield a multiple label classification approach that includes many plausible categories for a business and thus many different classification labels. Different classification labels used to describe a given business or other location entity represent different levels of specificity. For example, a hamburger restaurant is also generally considered to be a restaurant. There is a containment relationship between these categories. Ontologies can be a useful way to hold hierarchical representations of these containment relationships. If a specific classification label c is known for a particular image portion p, c can be located in the ontology. The containment relations described by the ontology can be followed in order to add higher-level categories to the label set of p.
Referring again to the example of
Referring now to
In some examples, building a set of training data for training statistical model 104 can include matching extracted image portions p and sets of relevant classification labels. Each image portion can be matched with a particular business instance from a database of previously known businesses β that were manually verified by operators. Textual information and geographical location of the image can be used to match the image portion to a business. Text areas can be detected in the image, then transcribed using an Optical Character Recognition (OCR) software. Although this process requires a step of extracting text, it can be useful for creating a set of candidate matches. This provides a set of S text strings. The image portion can be geo-located and the location information can be combined with the textual data for that image. For each known business b ε β, the same description can be created by combining its location and the set T of all textual information that is available for that business (e.g., name, phone number, operating hours, etc.) Image portion p can be identified as a subset of β if the geographical distance between them is less than approximately one city block, and enough extracted text from S matches T. Using this technique, many pairs of data (p;b) can be created, for example, on the order of three million pairs of more.
Referring still to a task of training the statistical model at (302), a train/test data split can be created such that a subset of images (e.g., 1.2 million images) are used for training the network and the remaining images (e.g., 100,000) are used for testing. Since a business can be imaged multiple times from different angles, the train/test data splitting can be location aware. The fact that Street View panoramas are geotagged can be used to further help the split between training and test data. In one example, a globe of the Earth can be covered with two types of tiles: big tiles with an area of 18 kilometers and smaller tiles with an area of 2 kilometers. The tiling can alternate between the two types of tiles, with a boundary area of 100 meters between adjacent tiles. Panoramas that fall inside a big tile can be assigned to the training set, and those that are located in the smaller tiles can be assigned to the test set. This can ensure that businesses in the test set are never observed in the training set while making sure that training and test sets are sampled from the same regions. This splitting procedure can be fast and stable over time. When new data is available and a new split is made, train/test contamination can be avoided as the geographical locations are fixed. This can allow for incremental improvements of the system over time.
In some examples, training a statistical model at (302) can include pre-training using a predetermined subset of images and ground truth labels with a Soft Max top layer. Once the model has converged, the top layer in the statistical model can be replaced before the training process continues with a training set of images as described above. Such a pre-training procedure has been shown to be a powerful initialization for image classification tasks. Each image can be resized to a predetermined size, for example 256×256 pixels. During training, random crops of slightly different sizes (e.g., 220×220 pixels) can be given to the model as training images. The intensity of the images can be normalized, random photometric changes can be added and mirrored versions of the images can be created to increase the amount of training data and guide the model to generalize. In one testing example, a central box of size 220×220 pixels was used as input 102 to the statistical model 104, implemented as a neural network. The network was set to have a dropout rate of 70% (each neuron has a 70% chance of not being used) during training, and a Logistic Regression top layer was used. Each image was associated with a plurality of classification labels as described herein. This setup can be designed to push the network to share features between classes that are on the same path up the ontology.
Referring still to
It should be appreciated that the application of a bounding box at (304) to one or more images can be an optional step. In some embodiments, application of a bounding box or other cropping technique may not be required at all. This can often be the case with indoor images or images that are already focused on a particular location entity or that are already cropped when obtained or otherwise provided for analyses using the disclosed systems and methods.
The one or more images or identified portions thereof created upon application of a bounding box at (304) then can be provided as input to the statistical model at (306). The statistical model then can be applied to the one or more images at (308). Application of the statistical model at (308) can involve evaluating the image relative to trained classifiers within the model such that a plurality of classification labels are generated at (310) to categorize the location entity within each image at multiple levels of granularity. The plurality of classification labels generated at (310) can be selected from the predetermined ontology of labels used to train the statistical model at (302) by evaluating the one or more input images at multiple processing layers. In some examples, a confidence score also can be generated at (312) for each classification label generated at (310).
In example implementations of method (300) using actual statistical model training, image inputs, and corresponding classification label outputs, results can be achieved that have human level accuracy. Method (300) can learn to extract and associate text patterns in multiple languages to specific business categories without access to explicit text transcriptions. Method (300) can also be robust to the absence of text. In addition, when distinctive visual information is available, method (300) can make accurate generation of classification labels having relatively high confidence scores. Additional performance data and system description for actual example implementations of the disclosed techniques can be found in “Ontological Supervision for Fine Grained Classification of Street View Storefronts,” Movshovitz-Attias et al., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp. 1693-1702, which is incorporated by reference herein in its entirety for all purposes.
The steps in
In some examples of the disclosed technology, the generation (310) of a plurality of classification labels can be postponed unless and until a certain threshold amount of information is available for identifying at least one category or classification label. This option can be helpful to ensure that the classification of business listings generally remains at a very high level of accuracy. This can be useful by preventing unnecessary generation of inaccurate classification labels for a listing, which can potentially frustrate end users who are searching for business listings that use the classification labels generated by method (300). In such instances, a decision to complete generation (310) and later aspects of method (300) can be postponed until a later date if the category for some business images cannot be identified. Since a given business often can be imaged many times (from different angles and/or at different dates/times), it is possible that a category can be determined from a different image of the business. This affords the opportunity to build a classification label set for multiple imaged businesses incrementally as more image data becomes available, while keeping the overall accuracy of the listings high.
Referring now to
Referring now to
Each server 602 and client 622 can include at least one computing device, such as depicted by server computing device 604 and client computing device 624. Although only one server computing device 604 and one client computing device 624 is illustrated in
The computing devices 604 and/or 624 can respectively include one or more processor(s) 606, 626 and one or more memory devices 608, 628. The one or more processor(s) 606, 626 can include any suitable processing device, such as a microprocessor, microcontroller, integrated circuit, logic device, one or more central processing units (CPUs), graphics processing units (GPUs) dedicated to efficiently rendering images or performing other specialized calculations, and/or other processing devices. The one or more memory devices 608, 628 can include one or more computer-readable media, including, but not limited to, non-transitory computer-readable media, RAM, ROM, hard drives, flash drives, or other memory devices. In some examples, memory devices 608, 628 can correspond to coordinated databases that are split over multiple locations.
The one or more memory devices 608, 628 store information accessible by the one or more processors 606, 626, including instructions that can be executed by the one or more processors 606, 626. For instance, server memory device 608 can store instructions for implementing an image classification algorithm configured to perform various functions disclosed herein. The client memory device 628 can store instructions for implementing a browser or application that allows a user to request information from server 602, including search query results, image classification information and the like.
The one or more memory devices 608, 628 can also include data 612, 632 that can be retrieved, manipulated, created, or stored by the one or more processors 606, 626. The data 612 stored at server 602 can include, for instance, a database 613 of listing information for businesses or other location entities. In some examples, business listing database 613 can include more particular subsets of data, including but not limited to name data 614 identifying the names of various businesses, location data 615 identifying the geographic location of the businesses, one or more images 616 of the businesses, and classification labels 617 generated from the image(s) 616 using aspects of the disclosed techniques.
Computing devices 604 and 624 can communicate with one another over a network 640. In such instances, the server 602 and one or more clients 622 can also respectively include a network interface used to communicate with one another over network 640. The network interface(s) can include any suitable components for interfacing with one more networks, including for example, transmitters, receivers, ports, controllers, antennas, or other suitable components. The network 640 can be any type of communications network, such as a local area network (e.g. intranet), wide area network (e.g. Internet), cellular network, or some combination thereof. The network 640 can also include a direct connection between server computing device 604 and client computing device 624. In general, communication between the server computing device 604 and client computing device 624 can be carried via network interface using any type of wired and/or wireless connection, using a variety of communication protocols (e.g. TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g. HTML, XML), and/or protection schemes (e.g. VPN, secure HTTP, SSL).
The client 622 can include various input/output devices for providing and receiving information to/from a user. For instance, an input device 660 can include devices such as a touch screen, touch pad, data entry keys, and/or a microphone suitable for voice recognition. Input device 660 can be employed by a user to request business search queries in accordance with the disclosed embodiments, or to request the display of image inputs and corresponding classification label and/or confidence score outputs generated in accordance with the disclosed embodiments. An output device 662 can include audio or visual outputs such as speakers or displays for indicating outputted search query results, business listing information, and/or image analysis outputs and the like.
The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. One of ordinary skill in the art will recognize that the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, server processes discussed herein may be implemented using a single server or multiple servers working in combination. Databases and applications may be implemented on a single system or distributed across multiple systems. Distributed components may operate sequentially or in parallel.
It will be appreciated that the computer-executable algorithms described herein can be implemented in hardware, application specific circuits, firmware and/or software controlling a general purpose processor. In one embodiment, the algorithms are program code files stored on the storage device, loaded into one or more memory devices and executed by one or more processors or can be provided from computer program products, for example computer executable instructions, that are stored in a tangible computer-readable storage medium such as RAM, flash drive, hard disk, or optical or magnetic media. When software is used, any suitable programming language or platform can be used to implement the algorithm.
The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. One of ordinary skill in the art will recognize that the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, server processes discussed herein can be implemented using a single server or multiple servers working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
While the present subject matter has been described in detail with respect to specific example embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the scope of the present disclosure is by way of example rather than by way of limitation, and the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.