In computing, a Uniform Resource Locator (URL) is a Uniform Resource Identifier (URI) that specifies where an identified resource is available and provides a mechanism for retrieving it. For example, a URL can be a unique identity given to a web page by the creator of a website hosting the web page. URLs are defined in a standard format which typically specifies a scheme or protocol, a domain name or Internet Protocol (IP) address, a path of the resource to be fetched or the program to be run, a query string and an optional fragment identifier. Increasingly, URLs contain condensed text that is highly relevant to the topic of the web pages they correspond to. They can be seen as a valuable source of information about the topic of a web page in many applications.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The keyword extraction technique described herein extracts keywords from URLs in web logs (e.g., server logs that contain a series of URL entries requested by a user, typically in reverse chronological order). The technique leverages the content and the structure of URLs to extract relevant keywords. In one embodiment, a URL is first divided into multiple components based on its structure. A set of keywords is extracted from each component of the URL independently with the help of a controlled vocabulary. A second set of keywords is generated by forming combinations of terms from different segments of the URL. Only those combinations which are present in the controlled vocabulary are retained as keywords. Finally, the keywords are scored with a function which take into account of a wide set of features.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the keyword extraction technique, reference is made to the accompanying drawings, which form a part thereof, and which show by way of illustration examples by which the keyword extraction technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The following sections provide an overview of the keyword extraction technique, as well as exemplary processes and an exemplary architecture for practicing the technique. Details of various embodiments of the keyword extraction technique are also provided.
1.1 Overview of the Technique
The keyword extraction technique described herein extracts keywords from URLs. The technique uses the content and the structure of URLs to extract relevant keywords. These keywords can then be used in various applications, such as, for example, on-line advertising and on-line content filtering.
1.2 URL Structure
Since the present keyword extraction technique uses the URL structure in extracting keywords, some explanation of URL structure is useful. A URL's format is based on Unix file path syntax, where forward slashes are used to separate directory or folder and file or resource names. Every URL consists of some of the following: the scheme name (commonly called protocol), followed by a colon, then, depending on the scheme, a domain name (alternatively, Internet Protocol (IP) address), a port number, the path of the resource to be fetched or the program to be run, a query string, and an optional fragment identifier. The syntax is scheme://domain:port/path?query_string#fragment_id. The keyword extraction technique described herein uses this URL format to extract keywords for web pages, which can be used for various applications. It is not necessary for the web page to be downloaded in order to extract the keywords for the web pages that correspond to the extracted keywords. This provides great computational efficiency.
1.3 Exemplary Processes
The identified components are then broken down into segments, as shown in block 104. For example, the authority component is broken into segments by discarding a protocol field and an extension field for the authority component; while the path component is broken into segments by discarding all fields not related to the topic of the web page to which the URL corresponds. The query component is broken into segments by extracting key-value pairs in the query field, and the fragment component is broken into segments by extracting a fragment field. The segmentation of the keywords will be discussed in greater detail later in this specification.
The segments are then processed by performing text segmentation on the segments to convert URL text into natural language terms, as shown in box 106. For example, in one embodiment, this is done by replacing each delimiter in the URL text with a space to create terms; and then splitting terms commonly found in URLs.
A first set of keywords is then extracted from the segment terms based on a controlled vocabulary, as shown in block 108. Terms in the segments that match the controlled vocabulary are held to belong to the first set of keywords. The controlled vocabulary is large list of valid terms and phrases that could be extracted from any URL. A second set of keywords is also generated by forming combinations of terms from different segments of the URL than were used to generate the first set of keywords based on the controlled vocabulary, as shown in block 110. In one embodiment of the technique, this second set of keywords is extracted by combining pairs of segments of the URL to generate candidate keyword combinations and taking a keyword each from the pair of segments by concatenating the keyword from each of the pair of segments and then verifying the candidate keyword combinations against the controlled vocabulary. Candidate keyword combinations found in the controlled vocabulary are extracted as keywords and those that are not found are excluded. The keywords extracted from the URL can also be optionally expanded by using an external knowledge source. For instance, with a semantic mapping, “travel” can be expanded to “trip” and “tour”.
As shown in block 112, the relevance of the first and second sets of keywords is then scored based on a set of features, and the scored keywords are output in order of relevance (block 114). In one embodiment of the keyword extraction technique each keyword is scored based on the position of its parent segment, length of the keyword, and length of the parent segment.
The output keywords can then be used in various applications, as shown in block 116. For example, the extracted keywords can be used to match keywords on a web page with keywords provided by advertisers related to advertisements in order to target specific types of advertisements to specific types of websites. It should be noted that it is not necessary to download the web page in order to extract the keywords from a given web page. Alternately, the extracted keywords can be used for content filtering, for example to filter content, such as pornography, by matching keywords extracted from a web page with a list of terms or phrases that are objectionable. The extracted keywords can also be used for search applications by matching the extracted keyword for a web page with search query terms.
As shown in
These first and second sets of keywords are then scored based on relevance in order to output an ordered set of scored keywords, as shown in block 210. Various scoring techniques can be used for this purpose. The technique can also generate additional keywords by using an external knowledge source to provide expansion of the keywords by mapping the keywords to other semantically equivalent or related words and phrases.
1.4 Exemplary Architecture
Details for aspects of this architecture will be discussed in the next section.
1.5 Details of Exemplary Embodiments of the Keyword Extraction Technique
Exemplary processes and an exemplary architecture having been discussed, the following sections provide details of various embodiments of the keyword extraction technique.
1.5.1 URL Parsing
URL parsing is one of the first steps in keyword extraction where informative parts in the URL are retained and noisy text is skipped. This is achieved by leveraging the structure of the URL. As discussed previously, URLs generally contain four important components: authority, path, query and fragment. The general extraction of the components from the URL is discussed in greater detail in the paragraphs below. Each of the extracted components is further parsed into segments.
1.5.1.1 Authority
Authority is a necessary component in every URL. It gives the name of the server on which the page representing the URL is hosted. Authority may contain multiple parts such as protocol, hostname, domain separated by dots. Authority always starts with a protocol such as “http”, ‘https”. Also, the last part in the authority takes one among the values of “com’, “net”, “us’, “org” etc which broadly indicates the kind of website and is not typically useful in finding relevant keywords. The technique discards both the protocol and the last part of the URL and retains the remaining parts as segments from this component. For example, “http://realestate.msn.com” has the segments “realestate” and “msn”.
1.5.1.2 Path:
A URL may contain a path field which contains the path to the resource to be fetched. The path field follows authority in the URL and may contains a list of directories separated by “/”. These directories might represent the categories to which the page corresponding to the URL belongs to. Sometimes, directories can contain non-informative text like “content” or a series of digits which have no relation to the topic of the page. These directories are ignored and the remaining directories constitute the segments for this component. For example, these directories may be ignored if the text is too generic (i.e., “content”, “file”) or non-informative (i.e., “123”, “a”).
1.5.1.3 Query:
Sometimes URLs point to a web application such as search engine and Common Gateway Interface (CGI) scripts. The query field is the query string that is sent as input to these programs. The query field starts with a “?” after the path in the URL. A query field contains key-value pairs with delimiters “;”, “&”, and so forth. Key-value pairs are a set of two linked data items: a key, which is a unique identifier for some item of data; and the value, which is either the data that is identified or a pointer to the location of that data. For example, city=“las vegas”&show=“cirque du soleil” means that the Cirque du Soleil performance is in the city of Las Vegas. Key-value pairs in the query string are retained as segments from this component. Depending on the application some keys may become important and some other keys may become noise.
1.5.1.4 Fragment:
The fragment field is the HTML anchor that appears at the end of the URL after the pound sign, “#”. The fragment field is retained as segments from this component.
All the segments derived from the four logical components form the base unit for the keyword extraction technique to operate on.
1.5.2 Controlled Vocabulary
It is difficult to find phrase boundaries from the unstructured text in the URLs as there is no rule on how text should appear. Existing Natural Language Processing (NLP) tools for phrase identification such as Name Entity Recognizers (NER), Part of Speech (POS) taggers cannot be applied here as they are trained on the free flow of natural language text. To overcome this challenge, the keyword extraction technique makes use of a controlled vocabulary to identify valid phrases in a URL.
In general, a controlled vocabulary is a large list of valid phrases that can be extracted from any URL. The nature and the size of the controlled vocabulary may vary depending upon the application for which the keywords are used. For example, a general topic identification system can use a generic topic list derived from Wikipedia topics as a controlled vocabulary. A keyword extraction system for advertising may use a list of millions of advertising bid phrases as controlled vocabulary.
1.5.3 Text Segmentation
Prior to keyword extraction, additional processes are required to convert segmented URL text to natural language text. In one embodiment, delimiters such as “-” or “_” are replaced with space and attached terms commonly found in URLs are split. For instance, “savinganddebt” will be split into “savings and debt”.
To optimize the relevance of the split terms, each split term is first checked to see if it is present in the controlled vocabulary. If it is not present, the technique tries to search for a valid split present in the controlled vocabulary. Term splitting is performed in an iterative fashion as follows.
1) One more space is introduced into the term (e.g., this can be done by trial and error in an iterative fashion until a match is found in the controlled vocabulary).
2) All possible splits of words with the new space are generated.
3) If one valid split is found, the terms of the valid split are returned.
4) If more than one valid split is found, for each valid split, the sum of frequencies of individual words in the controlled vocabulary is computed and the terms of the valid split with maximum sum is returned.
1.5.4 Keyword Extraction
After text segmentation, keywords are extracted from each segment by scanning the segment against a controlled vocabulary. A phrase from a segment is designated as keyword if it is present in the controlled vocabulary. In one embodiment of the keyword extraction technique, each segment is scanned from the left initially with the largest possible phrase, a length of 4 words. If match was found, the phrase is added to the list of keywords. Otherwise, the length of the phrase is reduced by one, to a length of 3 words, and the technique repeats the previous step. This process is reiterated till the technique finds a phrase in controlled vocabulary or the technique is left with the first word in the segment. Then the technique moves to the next word in the segment and repeats the same process to find phrases which might be keywords.
In one embodiment, along with the above keywords, an additional keyword is extracted if the URL is a search engine result page. A user query is extracted from the query component of the URL and output as a stand-alone keyword irrespective of whether the query is present in the controlled vocabulary or not.
1.5.4 Keyword Combinations
Keyword extraction from a URL does not yield many keywords because of the limited amount of text in the URL. One limitation of the keyword extraction process discussed with respect to the extraction of the first set of keywords is that the technique constructs keywords from only words appearing consecutively in the same segment of the URL. However, it is possible to generate relevant keywords by combining the terms from different segments of the URL. To achieve this, the technique implements the following.
First, a set of keywords are extracted from each segment in the URL using the method explained in the extraction step for the first set of keywords. For every pair of segments, candidate keyword combinations are formed by taking a keyword each from the two different segments and concatenating them. These candidate combinations are verified against the controlled vocabulary and those present in the controlled vocabulary are retained as keywords and others are discarded. The initial set of keywords extracted from the segments in the previous extraction step and the keywords generated from this combination step form the final set of keywords for a URL.
1.5.6 Smart Expansion
In one embodiment, the technique uses smart expansion to expand the keywords extracted from a URL. This embodiment uses an external knowledge source which provides keyword to related expansions mapping. For instance, semantically related terms could be created by experts. In such a mapping “auto insurance” could be mapped to “car insurance”. Expansions can be used during the above-discussed keyword combinations stage. After initial keyword sets are generated, additional keywords are retrieved and added for all keywords in each set using smart expansions. The rest of the combinations process is carried out as described in the previous section but on the new sets having the expansions.
1.5.6 Relevance Scoring
In one embodiment of the technique, a relevance score of a keyword is computed based on the position of its parent segment(s), length of the keyword and length of the parent segment(s). First, each keyword is assigned a value between 0 and 10, referred to as level, based on its position in the URL. The value of level increases as one moves from left to right in the URL. A keyword appearing in authority has less level than that of a keyword from Query (Fragment>Query>Path>Authority). The level of the keyword k is normalized using the length of the parent segment.
Where k.len is the length of the keyword, k.level is the level of the keyword and n is the length of the parent segment. If the keyword is a combination of two keywords k1 and k2, then the level of the keyword is normalized as the following.
The final relevance score of a keyword is computed in a range of 0 to 10,000. It is equal to the 1000 times the level of the keyword normalized by the maximum level possible for that URL. The relevance score of a keyword is given by
Depending on the applications the extracted keywords are used for, the relevance score can be further combined with other measures of keywords. These measures can be obtained in generating the control vocabulary. For example, in an advertising application, the number of bidding advertisers, the number of user views, clicks, conversion or price can all be important measurements to use.
1.5.6 Capturing User intent with Keyword Extraction from a Referrer URL
In some applications, keywords are extracted every time a user visits a web page to infer the user intent. In such scenarios, along with a web page's URL, it is also possible to make use of a referrer URL. The referrer URL is the URL of the previous web page from which the user requested the current page. It gives the context in which the user visited the current page. In one embodiment of the keyword extraction technique, when the referrer URL is also available along with the query URL, keywords are extracted from both of the URLs independently using the extraction method explained above. A final list of keywords is prepared by combining keywords from both URLs. If a keyword originated from both, the keyword having the highest score is retained and the other keyword is ignored.
The keyword extraction technique described herein is operational within numerous types of general purpose or special purpose computing system environments or configurations.
For example,
To allow a device to implement the keyword extraction technique, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by
In addition, the simplified computing device of
The simplified computing device of
Storage of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying the some or all of the various embodiments of the keyword extraction technique described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the keyword extraction technique described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.