Each year, more than 35 million people in the United States move to a new location. People move to find new or cheaper housing, for employment, to be closer or farther from family members, and the like. People are often moving from a familiar place to a less familiar place. Currently, some real estate apps and websites recommend similar properties based on price, bedrooms, square footage, price per square foot, year built, and other aspects of the home.
The neighborhood similarity tool and method disclosed in the present application recognizes that addresses, neighborhoods, and cities vary on many dimensions. By analyzing key features of locations that are outside of the four walls of a home, the neighborhood similarity tool improves upon the current techniques and increases the accuracy of recommendations for identifying homes, apartments, hotels, vacation rentals, and the like when moving, temporarily relocating, or traveling.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
The following disclosure describes a neighborhood similarity tool and method for detecting locations that are similar to each other, thereby improving the accuracy of recommendations for homes, apartments, vacation rentals, travel lodging, and the like. In furtherance of this tool, characteristics or features have been determined that provide unique character to a place. These characteristics or features may be grouped into categories and analyzed when comparing different locations for neighborhood similarity. In some embodiments, the neighborhood similarity tool may be provided as a network accessible application, such as a web page specified by a Uniform Resource Locator (URL) and displayable via a web browser, or, may be provided via a server or as a web service and integrated into another third party application.
The processor unit 102 is coupled to the memory 104, which is advantageously implemented as RAM memory holding software instructions that are executed by the processor unit 102. These software instructions represent computer-readable instructions and computer executable instructions. In this embodiment, the software instructions stored in the memory 104 include components (i.e., computer-readable components) for a neighborhood similarity tool 120, a runtime environment or operating system 122, and one or more other applications 124. The memory 104 may be on-board RAM, or the processor unit 102 and the memory 104 could collectively reside in an ASIC. In an alternate embodiment, the memory 104 could be composed of firmware or flash memory. Depending on the computing device 100, different groupings of the components for the neighborhood similarity tool 120 may reside on the device 100. For example, the components 120 residing on a mobile computing device may differ from the components 120 residing on a server.
The storage medium 106 may be implemented as any nonvolatile memory, such as ROM memory, flash memory, or a magnetic disk drive, just to name a few. The storage medium 106 could also be implemented as a combination of those or other technologies, such as a magnetic disk drive with cache (RAM) memory, or the like. In this particular embodiment, the storage medium 106 is used to store data during periods when the computing device 100 is powered off or without power. The storage medium 106 could be used to store metrics used during the similarity calculation, such as population density, walk score metric, median income, crime score metric, and the like. It will be appreciated that the functional components may reside on a computer-readable medium and have computer-executable instructions for performing the acts and/or events of the various method of the claimed subject matter. The storage medium being on example of computer-readable medium.
The computing device 100 also includes a communications module 126 that enables bi-directional communication between the computing device 100 and one or more other computing devices. The communications module 126 may include components to enable RF or other wireless communications, such as a cellular telephone network, Bluetooth connection, wireless local area network, or perhaps a wireless wide area network. Alternatively, the communications module 126 may include components to enable land line or hard wired network communications, such as an Ethernet connection, RJ-11 connection, universal serial bus connection, IEEE 1394 (Firewire) connection, or the like. These are intended as non-exhaustive lists and many other alternatives are possible.
The audio unit 128 may be a component of the computing device 100 that is configured to convert signals between analog and digital format. The audio unit 128 is used by the computing device 100 to output sound using a speaker 130 and to receive input signals from a microphone 132. The speaker 132 could also be used to announce incoming calls.
A display 110 is used to output data or information in a graphical form. The display could be any form of display technology, such as LCD, LED, OLED, or the like. The input mechanism 108 includes keypad-style input mechanism and other commonly known input mechanisms. Alternatively, the input mechanism 1208 could be incorporated with the display 1210, such as the case with a touch-sensitive display device. Other alternatives too numerous to mention are also possible.
When determining the similarity of two or more addresses or point locations, process 200 determines how much of the area to compare between the locations. A simple radius may be used to compare the area between addresses, but the inventors of the present technique have found that a walk shed area (the area reachable in a certain amount of walking time) yields a more accurate comparison between places. Techniques described in U.S. application Ser. No. 13/587,680 filed on Aug. 16, 2012, entitled “System and Method for the Calculation and Use of Travel Times in Search and Other Applications” may be used to determine the “walk shed” and is hereby incorporated by reference in its entirety. When determining the similarity of areas, process 200 determines the area, such as a neighborhood or city, that is used for comparing places. For example, all of the neighborhood areas in one city could be compared against all of the neighborhood areas of another city.
At block 204, a second location is obtained and a second area encompassing the second location is determined. As described above in block 202, the second location may be obtained by any suitable manner. Likewise, the second area may be determined as described above. Typically, the first area and the second area may be determined using similar methods, however, this is not required. While process 200 is illustrated as comparing two locations, one skilled in the art will appreciate that multiple locations may be compared in bulk or batch form without departing from the claimed invention. In addition, one or more of the locations may have been previously stored, such as a home location associated with a mobile device.
At block 206, characteristics that create the unique character for the identified locations are determined. The characteristics may be grouped into categories, such as a built environment category, a people and jobs category, a social media and reviews category, and the like. Each of the categories may have any number of features related to the category. The built environment category may include features related to human-made space in which people live, such as buildings, transportation, home prices, rents, and the like. The people and jobs category may include the types of people who live and work in a neighborhood, which will aid in determining the character of a neighborhood. The social media and review category may include interesting information about the character of a neighborhood that may be obtained from social media, such as from TWITTER services, FOURSQUARE services, GOOGLE PLUS services, YELP services, and the like. Those skilled in the art will appreciate that other categories may be added or one of the afore-mentioned categories may be removed without departing from the scope of the claimed invention. In addition to these location similarity features, home amenity similarity features may be determined, such as the number of bedrooms, square footage, and the like.
At block 208, each location is processed. Processing may occur dynamically or may use stored values from prior processing, such as if the location is a home address that is used quite often. At block 220, metrics are obtained. The metrics may be based on a score that can be obtained from another source, data obtained from another source, generated data based on data obtained from one or more sources, data obtained from sensor technology, data aggregated from social media, and the like. These metrics provide some type of measure that can used to compare two or more different locations. Each metric provides at least one of the dimensions in the multi-dimensional comparison of two locations. The following describes some example metrics that may be used in different categories.
In the built environment category, metrics may include one or more of the following measures: 1) a measure of a proximity of amenities (businesses, parks, schools, etc) to an identified address or area; 2) a measure of how well an address or area is served by public transit (e.g., types of transit routes, frequency, and proximity to those routes); 3) a measure indicating bike-ability for a location (e.g., bike lanes or paths, the number of bike commuters); 4) a measure indicating a number and type of business, a number and type of transit lines, a number of car or bike shares, and the like; 5) a measure of building characteristics (e.g., age of buildings, heights, lot sizes, average or median home prices or prices per square foot, average or median rents per bedroom or per square foot, and other information related to buildings near a location; 6) a measure based on analysis of locations, such as types of businesses (e.g., retail versus restaurants versus industrial), price ranges of those businesses, number or area of parks, percentage of one type of restaurant versus another type, price range, rating, and/or review; 7) a measure indicating an average block length (e.g., does a neighborhood have short pedestrian friendly blocks or longer blocks), intersection density (high intersection density is more pedestrian friendly), speed limits, road width, sidewalks; 8) a measure indicating a distance to a city center or other commercial districts (e.g., neighborhood center) for differentiating between close-in (e.g., close to downtown or commercial districts) and fringe (further from downtown) neighborhoods; and 9) a measure indicating levels of traffic and congestion along roads near an address or in an area. These and other metrics may be used in determining neighborhood similarity. For example, if congested road speed is on average 90% of the free-flow traffic speed in one neighborhood but it is only 10% of the free-flow traffic speed in another neighborhood, those neighborhoods may be considered dissimilar.
In the people and job category, metrics may include one or more of the following measures: 1) a measure indicating demographics (e.g., population density, age, gender, commute times, transportation preferences, and the like); 2) a measure indicating jobs (e.g., the number of jobs near an address or in an area, the types of jobs, income for the jobs, commute times for the jobs, and the like); 3) a measure indicating crime rates and types of crimes; and 4) a measure indicating noise volume, frequency, and the like. The data for determining these metrics may be obtained from various sources. For example, the United States Census or other entities and businesses may provide data for obtaining demographics metrics. In addition, E. G. Esri provides “tapestry segments” with detailed demographic information that may be used. The Longitudinal Employer-Household Dynamics (LEHD) from the United States Census may provide data for obtaining job metrics.
In the social media and reviews category, metrics may include a measure indicating an aggregation of social media terms from messages near a location to determine the most common words, topics, or phrases.
At block 210, the metrics obtained for each of the locations are compared in a meaningful manner. This may involve further normalization of the metrics if the locations being compared differ greatly in certain characteristics. For example, two neighborhoods might have very similar metrics related to the built environment category but have different metrics related to people and jobs. In another example, two neighborhoods may have a very similar built environment and population but one neighborhood might have older vs. newer buildings or higher vs. lower incomes. The difficulty is determining the similarity of places when the locations vary on many dimensions. In overview, there are a number of well-established mathematical and statistical techniques for determining pairwise distance between multidimensional data points including Euclidean distance, squared Euclidean distance, Manhattan distance, and Cosine similarity. These distance functions compute a numerical value that can be used to determine how “close” two multidimensional points are. Distance functions are one of the parameters used by clustering algorithms.
Clustering algorithms identify clusters among complex data, and the computed clusters indicate which items in a dataset are similar. The selection of a clustering algorithm is tied closely to the data being clustered. For the problem of determining location similarity, potentially applicable approaches include hierarchical clustering algorithms, centroid-based algorithms, and density-based clustering algorithms. Specific examples from these classes of algorithms include the k-means algorithm, DBSCAN, and OPTICS.
At block 212, results from the comparison of metrics are provided. The results then indicate the neighborhood similarities between two or more locations. As mentioned above, the results improve the accuracy of recommendations for homes, apartments, vacation rentals, travel lodging, and the like.
As briefly discussed above, two example techniques for obtaining an area associated with a location are illustrated in
In the social media and reviews category, metrics may include a measure indicating an aggregation of social media terms from messages near a location to determine most common words, topics, or phrases.
At block 1002, the social media data occurring within a specified area is aggregated. For example, process 1000 may aggregate multiple social media messages near an address or in a neighborhood or city to determine which words, topics, or phrases occur most often in a neighborhood. If the location represents an address, process 1000 may look at social media messages within a “walk shed” (the area reachable in a certain time by walking) near an address. Social media APIs allow programmers to retrieve “Tweets”, “check ins”, or other social media data such as online reviews and ratings with their associated latitude and longitude. Social media data includes data retrieved from TWITTER services, FOURSQUARE services, GOOGLE PLUS services, YELP services, and the like. The following pseudocode demonstrates an example aggregation of social media phrases from TWITTER services.
At block 1010, social media messages are accessed. Retrieval of the social data is achieved by interacting with an API provided by the social media organization. For example, tweets that have occurred within the identified area are requested. In Table 1 above, a call to getTweets( ) is performed to get social media data occurring within the specified area.
At block 1012, the messages are analyzed. Statistical or sentiment analysis may be performed on the content of the social media messages. In some embodiments, authors of some tweets are filtered in order to avoid capturing automatically-generated tweets. In addition, in some embodiments, n-grams from a collection of tweets of a single user are aggregated together so that each unique n-gram from a user is counted only once. An n-gram is a set of n contiguous values from a sequence. For example, “method for” is a 2-gram of “system and method for discovering”. The makeNGrams( ) function above in the pseudocode from Table 1 returns all contiguous values size n or less from the input. For example, a call to makeNGrams (‘and pedestrians need quality’, 2) returns [‘and’, ‘pedestrians’, ‘need’, ‘quality’, ‘and pedestrians’, ‘pedestrians need’, ‘need quality’].
At block 1014, an optional filter may be applied to improve the quality of the analysis performed in block 1012. For example, in some embodiments, frequent posters such as bots may be filtered out or offensive information or non-interesting information may be filtered out. To ensure data quality, the output of makeNGrams( ) function may be filtered to return the unique set of n-grams from the input. In addition, a location associated with the messages may be determined. The function filterNGrams( ) removes elements from the input that contain certain strings that identified to be filtered. For example, some of the filtered strings may include mundane values like “the”, “of”, and “and”, as well as phrases that are not considered valuable or worth displaying to users, such as profanity. Instances of the filtered n-grams are counted to determine how many times an n-gram appeared across multiple tweets. Finally, n-grams appearing fewer than minCount times are removed from the list because they are not used commonly enough to constitute a trend in the area.
At block 1004, output from process 1000 may be provided. Continuing with the example for the pseudocode in Table 1 above, the results from getSocialTerms( ) function may be output to a user or may be provided to process 200 for further comparisons with other locations. For example, the output may be rendered on a display using a font size that is dependent on the number of times the n-gram was seen for an area. Briefly, turning to
The result from process 800 may also generate a metric indicating when a neighborhood is most active. For example, the result of the analysis may indicate whether a neighborhood is more active during the day, at night, or whether the area has a higher level of social media activity during all times. In one embodiment, for example, “Tweets” from the TWITTER service or “check-ins” from the FOURSQUARE service may be used to determine how active a neighborhood is. Social media activity may be normalized by the area contained by a neighborhood or the population within a given boundary, radius, or walk shed. In addition, sentiment analysis may be determined and used to detect a general “mood” of a neighborhood. For example, there are well-known algorithms and software packages that can infer sentiment from text, such as Python NLTK (Natural Language Tookit) and a text mining module for the statistical programming language R. The sentiment of social media in a neighborhood might be positive or negative, it might be happy, sad, angry, etc. During the comparison process of different locations, the present neighborhood similarity tool may use sentiment of neighborhoods to identify similarities with locations.
The social media metrics may then be combined with the metrics from other categories (e.g., built environment category, and people and job category). Once each location is analyzed to determine N dimensions for that location, comparisons between different locations may be performed to determine similarities.
At block 1202, a pairwise distance between multidimensional data points for the two locations are determined. As discussed above, the difficulty with determining the similarity of locations is the numerous dimensions which may vary between the locations. There are a number of well-established mathematical and statistical techniques for determining pairwise distance between multidimensional data points including Euclidean distance, squared Euclidean distance, Manhattan distance, and Cosine similarity. These distance functions compute a numerical value that can be used to determine how “close” two multidimensional points are. Distance functions are one of the parameters used by clustering algorithms. Clustering algorithms identify clusters among complex data, and the computed clusters indicate which items in a dataset are similar. The selection of a clustering algorithm is tied closely to the data being clustered. For the problem of determining location similarity, the neighborhood similarity tool may apply hierarchical clustering algorithms, centroid-based algorithms, density-based clustering algorithms, or the like. Specific examples from these classes of algorithms include the k-means algorithm, DBSCAN, and OPTICS. Table 2 illustrates example pseudocode for calculating the squared Euclidean distance between all of the addresses in one city versus another city.
In the pseudocode in Table 2, scaledDifference is a mathematical expression selected based on the property, and additiveProperties and subtractiveProperties are sets of metrics that are used to measure similarity, such as a walk score metric, a transit score metric, a population density metric, or the like. Subtractive properties are metrics that make the distance smaller (i.e. bring two addresses closer together). For example, each shared term that appears in the social data for a pair of addresses may be used to decrease the total distance. The amount by which the distance is decreased is controlled by the propertyContribution function, which can be tuned to provide the desired amount of impact for each type of property. In most cases the scaled difference is simply the difference between the property value for address1 and address2. Some types of property, however, may need to be normalized so that values contributed from different types of properties will be of similar magnitudes. For example, values for the walk score metric range from 0 to 100, whereas home values range from five-figure numbers to values in the millions. So that differences in home prices (differences potentially in the millions) do not eclipse walk score metrics (differences potentially in the tens), the neighborhood similarity tool normalizes home prices.
At block 1204, properties may be optionally normalized. In one embodiment, a softmax transformation may be applied. For example, for two home prices, p1 and p2, the following function for scaledDifference may be employed:
Normalization may be performed between cities too. As mentioned above, the different scales of metrics may require some normalization, using techniques like softmax, to get the desired effect. Metrics of the same type across cities may also differ in scale. The neighborhood similarity tool normalizes these metrics between cities in order to create accurate comparisons. For example, the average rent in a cheap New York City neighborhood may be the same as the average rent in the most expensive neighborhood of a smaller city. It would be incorrect to call these neighborhoods similar. A variety of techniques may be used to normalize these different metrics across neighborhoods. For example, when computing distances the neighborhood similarity tool computes distances using metrics, such as a walk score metric, a bike score metric, a transit score metric, a median income, a median rent, a population density, a job density, and/or social data metric. Of these, median income and median rent may not be directly comparable from city to city. To make median income and median rent comparable between cities, those metrics may be normalized by the median metrics in their cities. This has the effect of changing the median metric for all neighborhoods into a value that is a multiple of the containing city's median metric. For example, if a city has a median income of $50,000 and a set of neighborhoods have median incomes of $32,000, $40,000, $60,000, and $90,000. The scaled median incomes for those neighborhoods are 0.64, 0.8, 1.2, and 1.8, respectively.
At block 1206, similar data within each data set is identified by comparing metrics. In one embodiment, this may be achieved using pairwise distances computed by the computeCityToCityAddressDistances( ) function to determine the set of similar addresses between city1 and city2 by applying a threshold that separates similar from dissimilar. Table 3 illustrates example pseudocode.
To compute the distance between all neighborhoods in one city versus another city, the distance computation function in Table 4 could be used.
There are multiple data points that can be compared for identified locations. Each data point is associated with some metric. The metrics may be used in their raw form or may be normalized to provide more accurate comparisons. Table 5 illustrates an example set of data points for comparing neighborhoods.
Currently, real estate apps and sites may recommend similar homes or apartments to their users. For example, if a user is looking at a 3 bedroom 2 bathroom home that is 2,200 square feet and costs $250,000 the real estate app or site may recommend other homes with similar characteristics. These characteristics might include price, number of beds and baths, square footage, year built, amenities such as a pool, view, large yard, etc. and other metrics such as the floor area ratio of the home (footprint of the home to lot size), style of the home, age of the home, etc These characteristics may be referred to as home amenity similarities. However, the two seemingly similar houses or apartments could be located in very different types of neighborhoods, thereby, making the homes seem not similar to a user. The present neighborhood similarity tool uses a technique to discover which locations are actually similar to each other based not only on the home and/or apartment similarity, but also with respect to the locations of each. This technique thereby enhances the accuracy of recommendations for similar locations. The location of a home may be deemed similar based on the walk shed of the home (area reachable in a certain walking time), a radius around the home, the neighborhood the home is in, or the like. The pseudocode in Table 6 may be used to find homes that are similar to a specified home based on location similarity and home amenity similarity.
The compareHomeToHomes function may be used to find homes that are similar to a specified home based on location similarity and home amenity similarity. As with the location distance algorithms described above, home distance has some terms that add to the distance and some terms that subtract from the distance. For example, differences in square footage and number of bedrooms may add to the distance, and the presence of some amenities in both houses, such as a fireplace, could subtract from the distance. The closer the distance, the more similar the homes.
Once the set of data points is compared for the identified locations, the neighborhood similarity tool outputs the results. The output can take many different forms. The goal of the neighborhood similarity tool is to help people find new places to live that match the places they know or like or to help people find new places to visit that match places they have enjoyed in the past. A variety of visualizations and user interfaces may be used to help people find the similar places.
In another embodiment, the output may include a similarity score that is calculated based on similarities between neighborhoods. For example, neighborhoods that are almost identical might have a similarity score of 100 and neighborhoods that are completely different may have a score of 0. The similarity score may be based on a normalization of the Euclidean similarity distance calculated between neighborhoods and may be expressed as a number between 0-100, a percentage (e.g. 57% similar), a text label such as “very similar”, or the like. The similarity score may be determined by taking the Euclidean distance which is a raw number of arbitrary scale and transform it into a more understandable score between 0-100. For example, the range of distances may be split into three groups and scored conditionally: 1) score 100 if distance <0; 2) score computed by function if 0<=distance <upper; and 3) score 0 if upper <=distance. The upper argument may be the first distance value that is deemed to have a score of 0. One exemplary linear function to compute scores for the middle group is:
Some web and mobile applications know your “home” location based on past behavior (e.g. GPS traces or user-entered home and work locations). Mobile interfaces of this nature could automatically suggest similar neighborhoods when you are in a new place. For example, upon launch, a mobile app could detect your current location, compare that against your stored home location, and automatically select similar locations to visit in a new city. This scenario could be useful for traveling or home shopping. Search interfaces can be used to help people find neighborhoods similar to neighborhoods they know. For example, a website, web page, or app about moving to a city could allow a user to find neighborhoods in that city that are similar to neighborhoods that they are already know.
While the foregoing written description of the invention enables one of ordinary skill to make and use a neighborhood similarity tool as described above, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the described embodiments, methods, and examples herein. Thus, the invention as claimed should therefore not be limited by the above described embodiments, methods, and examples, but by all embodiments and methods within the scope and spirit of the claimed invention.
This application claims priority under 35 U.S.C. Section 119(e) to U.S. Provisional Application Ser. No. 62/008,490 filed Jun. 5, 2014, entitled “Neighborhood Similarity” the disclosure of which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62008490 | Jun 2014 | US |