Determining a user's interest can include the observation and tracking of tags, or non-hierarchical keywords or terms assigned to a piece of information. A tag can describe an item and allow it to be found again by browsing or searching. In a typical tagging system, manual tagging is relied on either by an author of the document or by viewers of the document (e.g., “Web 2.0”). Tagging is infrequently done, so many documents do not have tags, and those documents that are tagged can include inconsistent tagging. Different taggers may have different sets of tags that they apply, and these differences can be difficult to map. Tagging may not allow for sufficient interest-tracking. Tagging can also include training text classifiers to run on a document and take concepts whose classifiers produce a threshold score. However, this technique can require a large time commitment and large budget.
Examples of the present disclosure may include methods, systems, and computer-readable and executable instructions and/or logic. An example method for constructing an analysis of a document may include determining a plurality of features based on the document, wherein each of the plurality of features is associated with a subset of a set of concepts. The example method may also include constructing a set of concept candidates based on the plurality of features, wherein each concept candidate is associated with at least one concept in the set of concepts. Furthermore, the example method may include choosing a subset of the set of concept candidates as winning concept candidates and constructing an analysis that includes at least one concept in the set of concepts associated with at least one of the winning concept candidates.
In the following detailed description of the present disclosure, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration how examples of the disclosure may be practiced. These examples are described in sufficient detail to enable those of ordinary skill in the art to practice the examples of this disclosure, and it is to be understood that other examples may be utilized and that process, electrical, and/or structural changes may be made without departing from the scope of the present disclosure.
Elements shown in the various figures herein can be added, exchanged, and/or eliminated so as to provide a number of additional examples of the present disclosure. In addition, the proportion and the relative scale of the elements provided in the figures are intended to illustrate the examples of the present disclosure, and should not be taken in a limiting sense. References to logical entities in the figures or specification can include embodiments and/or examples in which such entities are not identifiable as single entities as implemented, including examples in which the functions performed by the logical entities are implemented by other components or by the system as a whole.
In this description, the phrase “document” can include any tangible or on-line object with which features may be associated. Methods can include the use of textual documents, that is, documents that consist at least in part of sequences of words in a natural human language, optionally organized into structures such as sentences, paragraphs, sections, chapters, titles, and/or keywords, where features may include words, phrases, word sequences, characters, character sequences, and/or statistics computed based on such features. Features may also include information relating to the relationship of documents to one another, such as hypertext “links” specified by uniform resource locators (URLs). Textual documents can include, without limitation, web pages, newspaper and magazine articles, books, scripts, poems, scholarly papers, catalog descriptions, program guide descriptions, electronic mail (e-mail) messages, blog postings, comments on web pages, status updates and/or comments on social media sites such as Facebook®, Twitter® messages, short message service (SMS) messages, instant messaging (IM) messages, advertisements, computer program source code, computer program documentation, help files, other textual computer files, textual data in computer databases, audio transcripts, and/or depositions.
Documents may also be parts of other documents or collections of documents, where such a collection may be implied by various means such as a document and documents it refers to (e.g., a Twitter message and any web pages referred to by URLs in the Twitter message), documents that are declared or inferred to be related to one another (e.g., multiple web pages that are parts of an overarching article), or documents a user interacts with in a given session of activity. In addition, a document may be a non-textual object that has text associated with it. Examples of such non-textual objects include, without limitation, motion pictures and television shows, with associated scripts, advertising materials, audio transcripts, subtitles, reviews, program guide listings, and/or descriptive web pages on web sites such as Wikipedia and/or the Internet Movie Database (IMDb); songs, with associated lyrics and/or descriptive web pages; computer programs and/or mobile phone apps, with associated product descriptions, reviews, documentation, and/or help files; people, with associated biographies and/or descriptive web pages; and goods and services available for purchase, with associated product descriptions and/or reviews. In some examples, documents may include objects that do not have associated text but from which features may be extracted that can be associated with concepts as required and described below.
From such a document and based at least in part on features associated with it, an analysis of the document can be constructed, where the analysis is an object containing a set of concepts implied as being relevant to the document. Each concept in the analysis can be drawn from a certain (e.g., preferably large) ontology or concept base containing a set of concepts that may be relevant to different documents. In an example, the concept base is considered to be isomorphic to a subset of the set of articles in Wikipedia, with each concept identified with a Wikipedia article. Alternative examples may employ other ontologies, such as the Library of Congress, Dewey Decimal, or Readers' Guide to Periodical Literature classifications, or may employ ontologies created for the purpose of constructing such analyses. In some embodiments, the analysis may also contain a set of categories, which may be hierarchical and which can represent broad topic areas implied as being relevant to the document. In some of these examples, some or all of the concepts may be associated with one or more categories, and these pairings can be referred to as “category paths”.
In some examples, concepts, category paths, and/or categories may be associated with a numeric score or other indication of the degree that the particular concept, category path, or category is considered to describe the document, ranging from an indication that the concept, category path, or category is merely mentioned in the document to an indication that that the document as saliently “about” the concept, category path, or category.
Features, including, without limitation, words and phrases, which do not only give evidence by their presence that a concept or category is descriptive of a document but are themselves taken to refer (possibly ambiguously and/or possibly not in all cases) to concepts or categories may be considered to be “potential concept indicators”, and the process of determining concepts or categories descriptive of a document may involve determining which, if any, concepts and categories are referred to by observed potential concept indicator features. This process of determining a referent for a feature may involve a process (such as method 21414 described below with respect to
The constructed analysis may be used to facilitate many tasks related to the document. For example, it may be used to identify the document as relevant to a user's search, and/or it may be used to determine a placement of the document in an abstract storage hierarchy or on a physical storage device. It may also be used to determine a management policy to apply to the document, and it may be used to identify a user to route the document to (as, for example, by e-mail) or a user to whose attention the document's existence should be brought. The constructed analysis may also be used to identify the document as potentially interesting to a particular user so that the document may be recommended to the user. Such recommendation may take the form of selecting the document (or information related to it) for inclusion in a catalog, magazine, web page, e-mail message, or list. It may be used in the construction and modification of a profile associated with a user who interacts with the document. In such an example, the analysis, optionally along with an indication from the user of a degree to which the user found the document interesting or not, may be used to construct a profile that indicates a degree of belief that the user finds and will find interesting documents associated with certain concepts, category paths, and categories. Such a profile may be used to select other documents as interesting to the user based on the analyses constructed for the other documents.
A subset of the set of concept candidates is chosen as winning concept candidates at 106, and at 108, an analysis that includes at least one concept in the set of concepts associated with at least one of the winning concept candidates is constructed. At least a portion of the concepts associated with the winning concept candidates can be included in an analysis that is constructed at 108. The concepts included in the analysis may also include concepts not associated with concept candidates in the set constructed at 104.
Concept extractor 210 can also include a feature filter 222. Feature filter 222 can remove particular features from the plurality of features or cause multiple features in the plurality of features to be treated as a single feature. Scoring function 216 can also be included in concept extractor 210 and can assign scores to category paths based on associated evidence. These scores can indicate a degree to which a concept was believed to have been mentioned in passing in the document and/or a degree to which the document was believed to have saliently been about the concept or the concept was believed to have been a major topic of discussion in the document.
Concept extractor 210 can further include category path extractor 214 and categorizer 220. Category path extractor 214 determines a set of category paths (and the concepts included in the category paths) that apply to the document using the information about the plurality of features determined at 102 of method 100 and the associated count map, as well as a categorization determined by categorizer 220 based on the features and the count map. Category path extractor 214 also determines evidence associated with each category path. Category path extractor 214 can also model the choice of concepts as an election, in which the features are considered to be voters, and choose a set that matches evidence across the features seen as described below with reference to
Category path extractor 214 can include categorizer 220 that can use merged and deleted features to determine a categorization of the document which contains a degree to with the document reflects each of various categories. In addition, global tables containing information for categories, concepts, and neighborhoods can be used in the construction of an analysis. A neighborhood can model the likelihood that one concept is mentioned in a document given that other concepts are mentioned, and will be further discussed with respect to
Processing system 230 includes at least one processor 232 configured to execute machine readable instructions stored in a memory system 234. Processing system 230 may also include any suitable number of input/output devices 236, display devices 238, ports 240, and/or network devices 242. Processors 232, memory system 234, input/output devices 236, display devices 238, ports 240, and network devices 242 communicate using a set of interconnections 244 that includes any suitable type, number, and/or configuration of controllers, buses, interfaces, and/or other wired or wireless connections. Components of processing system 230 (for example, processors 232, memory system 234, input/output devices 236, display devices 238, ports 240, network devices 242, and interconnections 244) may be contained in a common housing (not shown) or in any suitable number of separate housings (not shown).
Processing system 230 may execute a basic input/output system (BIOS), firmware, an operating system, a runtime execution environment, and/or other services and/or applications stored in memory 234 (not shown) that includes machine readable instructions that are executable by processors 232 to manage the components of processing system 230 and provide a set of functions that allow other programs (e.g., concept extractor 210) to access and use the components.
Processing system 230 represents any suitable processing device, or portion of a processing device, configured to implement the functions of concept extractor 210 as described herein. A processing device may be a laptop computer, a tablet computer, a desktop computer, a server, or another suitable type of computer system. A processing device may also be a mobile telephone with processing capabilities (i.e., a smart phone), a digital still and/or video camera, a personal digital assistant (PDA), an audio/video device, or another suitable type of electronic device with processing capabilities. Processing capabilities refer to the ability of a device to execute instructions stored in a memory 234 with at least one processor 232.
Each processor 232 is configured to access and execute instructions stored in memory system 234. Each processor 232 may execute the instructions in conjunction with or in response to information received from input/output devices 236, display devices 238, ports 240, and/or network devices 242. Each processor 232 is also configured to access and store data in memory system 234.
Memory system 234 includes any suitable type, number, and configuration of volatile or non-volatile storage devices configured to store instructions (e.g., concept extractor 210) and data (e.g., document 250 and analysis 260). An example of a document 250 includes input object 7102, as will be discussed further herein with respect to
The storage devices of memory system 234 represent computer readable storage media that store computer-readable and computer-executable instructions including concept extractor 210. Memory system 234 stores instructions and data received from processors 232, input/output devices 236, display devices 238, ports 240, and network devices 242. Memory system 234 provides stored instructions and data to processors 232, input/output devices 236, display devices 238, ports 240, and network devices 242. The instructions are executable by processing system 230 to perform the functions and methods of concept extractor 210 described herein. Examples of storage devices in memory system 234 include hard disk drives, random access memory (RAM), read only memory (ROM), flash memory drives and cards, and other suitable types of magnetic and/or optical disks.
Input/output devices 236 include any suitable type, number, and configuration of input/output devices configured to input instructions and/or data from a user to processing system 230 and output instructions and/or data from processing system 230 to the user. Examples of input/output devices 236 include a touchscreen, buttons, dials, knobs, switches, a keyboard, a mouse, and a touchpad.
Display devices 238 include any suitable type, number, and configuration of display devices configured to output image, textual, and/or graphical information to a user of processing system 230. Examples of display devices 238 include a display screen, a monitor, and a projector.
Ports 240 include suitable type, number, and configuration of ports configured to input instructions and/or data from another device (not shown) to processing system 230 and output instructions and/or data from processing system 230 to another device.
Network devices 242 include any suitable type, number, and/or configuration of network devices configured to allow processing system 230 to communicate across one or more wired or wireless networks (not shown). Network devices 242 may operate according to any suitable networking protocol and/or configuration to allow information to be transmitted by processing system 230 to a network or received by processing system 242 from a network.
In constructing an analysis of a document, concepts and categories are extracted from the document.
Example method 350-1, as illustrated in
At 326, a feature set is extracted from the parsed text. A feature table (e.g., table 212), which can associate with a feature an object that maps between concepts and probabilities, can be utilized to extract the feature set from the parsed text object. The feature table (e.g., table 212) can indicate which words and phrases may be of interest to a user, and which concepts they imply with what probability. The mapping object can encode a probability that an instance of the feature implies the presence of a concept. For example, a mapping object associated with a feature can include the probability that an instance of the word or phrase within some corpus (for example, Wikipedia) a document is text associated with a hyperlink to an article identified with a particular concept. In some cases a given feature may be associated, with different probability, to more than one concept. For example, “President Bush” may refer to the concept, “George W. Bush” and also to the “George H. W. Bush.” Features can also represent words and phrases that are not associated with links to other web pages or documents. Each of the number of features is characterized based on a content of the text and a location (or locations) of each of the number of features within the parsed text.
At 328, a categorization is computed for the features in the feature set. A feature set and/or document can be categorized based on the characterization of each of the number of features. For example, a web page may be determined about “sports” or, more specifically, “basketball.” The document may be associated with multiple categories and each such association may have a numerical strength determined. As will be further discussed herein, the document can be analyzed based on the categorization of the document, and an action can be performed based on the document analysis. In some examples, categories are not used and computing a categorization from a feature set at 328 may be omitted.
As previously noted, concepts represents topics that a document (e.g., a web page) can be “about” or that are mentioned in a document. For example, a concept can be identified with a particular Wikipedia® article. A concept can also include, but is not limited to, items in product catalogs, people in directories, web sites, books, and/or tags, among others. Each concept can have a number, and the numbers can be serially assigned. A concept can also have a name and a set of associated categories.
As will be discussed further herein, at 330, overlapping features in a feature set are removed, and a feature count map is computed at 332. Overlapping text can be removed so that each word in the text of the document is part of at most one feature. A count object can be an object that contains a count of the number of times a given feature appears and a weight based on the locations within the parsed text that the feature appears. A feature filter is applied to the feature count map at 334, and an evidence map which will be discussed further herein with respect to
An analysis object is constructed at 337, and the analysis object can include a map from category paths to evidence (e.g., an evidence map), a set of categories that pass a filter, a categorization, a feature set, input sentences, a filter result describing how the feature set was filtered, and a “scale factor” representing a score (e.g., a maximum score) for a category path, as discussed further herein.
A scoring function is applied to each piece of evidence at 338. The evidence can be scored, and this can be done after the category paths have been determined and evidence for them set. The scoring function can include a category component and a concept component. A category path filter can be applied to the category path/evidence map at 340 to determine that some extracted category paths should be excluded from the analysis. Such a determination may be based on the category paths having less than a threshold level of support or less than a threshold amount in common with other category paths in the analysis.
A categorization (e.g. a categorization computed at 328 of
A category can also be given a unique category number 456. For example, category “/Sports/Basketball” 456 may be given a category number of 47, while category “/Sports/Basketball/College and University” 459 may have a different category number 458 (e.g., category number 12). Categories can be numbered sequentially with no number gaps, and categories can be located using their unique number.
A category can also have a parent category. For example, the category “/Sports/Basketball” 454 can have a parent category of “/Sports” 452, the association represented by link 457. Parent category 452 may or may not have a category number, or it may have a category number of zero, as shown at 460, which can indicate that the parent category is not a category that a categorizer can identify or recognize. For example, in
An optional forwarding category can be implemented for numerous reasons. The owner or deployer of the system may feel a decision that a concept is in a subcategory is good enough evidence that it should be considered to be in a higher category. Furthermore, a forwarding category may be more understandable. For example, “/Games/Gambling/Sports/Racing” 462 may be easier to understand as “/Sports/Horse Racing.” A forwarding category may also be used if a certain category is to be “suppressed.” When a category is suppressed, the category is not included by the system in the resulting analysis. A category can be suppressed because it is determined that the system rarely gets the category correct or because it is felt that the presence of the category in an analysis could be embarrassing to a user or a company, among other reasons. For example, category “Pornography” 466 may be a suppressed category, and this status can be indicated by means of specifying a well-known “Suppressed” category 468 as its forwarding category. In alternative examples other means may be used to identify a category as suppressed.
While
Memory-efficient mapping can occur from concepts to categories and from concepts to names using arrays.
The sizes and layout of the fields within the entries of the arrays may vary in different examples (e.g., based on the natural word size of the machine or the virtual machine presented by the implementation language or based on the number of categories present in the system). In an example in which seven bits suffice to number the categories, it is possible to encode four categories, along with a three-bit discriminant in each entry in the encoded categories array 570, while if more than ten bits are required to identify a category, only two categories may be so encoded. In some examples, some categories (e.g. more common categories) may be represented in the retrieved value using fewer bits than less common categories. In such an example, a single-bit discriminant may be used to identify the case in which the retrieved value specifies an offset and number of categories to be retrieved from the extra categories array. The remaining 31 bits may be broken up into six one-bit fields representing the presence or absence of the six most common categories (e.g., “/Regional/United States” or “/Society/Politics”), three five-bit fields, which can encode up to three categories taken from the 31 next most common categories, and one ten-bit field, encoding up to one instance of any other category. In such a way, up to ten categories may be encoded for a concept without recourse to the extra categories array all but at most one such category is among a predetermined set of 37 categories.
To support mapping from concepts to names, and in order to decrease memory use, the system may keep the concept names in an external location such as a file and not obtain a given concept's name until the first time it is requested. However, the list of concepts may also be walked through, asking each for its name, which may cause each name to be loaded.
A concept class can include an offline string table to support the loading methods and the mapping from concepts to names.
A byte array can be constructed, and bytes can be read from the file of the character and can be used to fill the byte array. The byte array can be converted to a string, and the result can be added to the cache 692, with the position in the cache replacing a value 695 in the start array 688. In some examples, the cache 692 further includes a “trail”, which keeps track of old values of the start 688 and length 686 arrays. When the cache 692 reaches a particular size, elements can be discarded, with the information in the trail used to undo the corresponding modifications to the start 688 and length 686 arrays, returning them to their original values.
A JSON object can contain two keys, a “tweet” key 7112 and a “pages” key 7114. Either key can be absent. If the tweet key 7112 is present, it can refer to a string 7116 representing the text of a particular Twitter message, and a block can be made from its contents and added to a returned parsed text object. If the pages key 7114 is present, it can refer to a JSON array of JSON objects each descriptive of a particular web page. Each of these objects can contain associations optionally including “title,” 7106 “keywords,” 7110 “description,” and “body” 7108. Examples of a title 7106, keywords 7110, and text body 7108 are illustrated at blocks 7118, 7122, and 7120, respectively. Blocks corresponding to each of these can be seen as part of the corresponding parsed text object in block 7104. Block 7119 corresponding to the title 7118 has a block weight of 5, reflecting a decision that features contained within page titles are five times as important as features contained within similarly-sized other blocks. Similarly, block 7121 corresponding to the keywords 7122 has a block weight of 2.
To better support the extraction and weighting of features, the input text for a block may be split into separate sentences. This splitting may involve using a regular expression or other means to approximate the detection of human natural-language boundaries. Sentences 7124 and 7126 demonstrate two such sentences identified by splitting input text 7120. In some examples, the text of the identified sentences may be less than all of the input text for a block. Different techniques of text splitting may be employed to split different types of input text. For example, rather than splitting into an approximation of natural-language sentences, the keywords 7122 may be split as a comma-separated list resulting in the four “sentence” strings in block 7121 In some examples a piece of input text may be determined to consist of several paragraphs, sections, or other structures and multiple blocks may be created corresponding to the different parts. In some examples, markup tags, such as those used in Hyper-Text Markup Language (HTML) or Extensible Markup Language (XML) may be used to determine sentence or other structure boundaries.
In some examples, the text may be transformed before or after it is split. For example, if the text contains HTML entities, these entities may be converted into the characters or strings they encode, as replacing “&” to an ampersand or “<” by a less-than sign. In examples in which the input contains HTML or XML markup, such markup may be removed. In some examples text may be removed as unlikely to provide useful features. This removal may be based on the recognition of a pre-determined list of strings (e.g., “Follow us on Twitter”), by one or more patterns, or by other means.
In some examples, the body text (with or without markup) of a web page may be analyzed to distinguish text considered to be the page's actual content from text determined to be advertising, navigational links, boilerplate, links to other articles, comments, etc., with some of these classes of text being omitted from the resulting parsed text 7104. To try to distinguish content text from framing text, rules may be used to identify and omit text that is considered unlikely to represent natural language sentences. For example, a putative sentence may be omitted if it contains fewer than 20 characters or more than 500 characters or if it contains fewer than two sequences of spaces, indicative of word breaks. In some examples, there may be a number of maximum number of sentences that a block can contain or other similar limits on the amount of text processed or the number of blocks in a parsed text object.
As discussed with respect to
To facilitate the efficient recognition of a very large number of potential features, each of the substrings of text represented by an n-gram is converted to a number by a hashing function. In the example, a Mapped Additive String Hashing (MASH) algorithm described in George Forman and Evan R. Kirshenbaum “Method and System for Processing Text,” U.S. application Ser. No. 12/570,309 (filed Sep. 30, 2009), and/or George Forman and Evan Kirshenbaum, “Extremely Fast Text Feature Extraction for Classification and Indexing”, CIKM '08 can be used. In other examples, strings may be used directly or other hashing methods may be used. Examples of such other hashing methods include, but are not limited to linear congruent hashes, Rabin fingerprints, and/or cryptographic hashes such as the various message digest algorithms (e.g., MD-5) or secure hashing algorithms (e.g. SHA-1).
Returning now to
Within n-grammer 8128 in the example is a mapping array 8129 used to control the MASH hashing algorithm. The array 8129 contains one 64-bit entry for each character in the system's character set. In an alternative example, other numbers of bits may be used. Each character that is to be considered part of a word is associated with a substantially uniformly distributed number, as would be generated by a pseudorandom number generator seeded with a predetermined seed value, with the restriction that if two characters are to be considered equivalent, they are associated with the same value. In the example, uppercase and lowercase letters are considered equivalent, so the array entries associated with “E” 8133 and “e” 8132 contain the same value 8130. Similarly, the presence or absence of accent marks or other diacritics is considered insignificant, so the array entries for “e” 8132 and “é” 8134 contain the same value 8130. In the example, the characters that can be parts of words include letters, numbers, hyphens, slashes, and ampersands. Furthermore, in the example, periods 8138 are considered to be insignificant (e.g., allowing “U.S.A.” and “USA” to be treated as equivalent). This can be signaled by the presence of a predefined “IGNORED” value 8136, different from all word-character values.
Characters that are not intended to be considered as parts of words, such as commas 8142, are associated with a predefined “NON-WORD” value 8140, different from all word-character values. To enumerate all of the n-grams 8144 within an input text 8160, the n-grammer 8128 first enumerates all of the words and keeps track of their starting position, ending position, and hash. To detect and compute the hash for a word using the MASH algorithm, a 64-bit accumulator can be initialized to zero. For each character in the input text, the character is looked up in the mapping array and the associated mapped value is noted. If the mapped value is the NON-WORD value 8140 or if there are no more characters, the current word, if any, has ended. If the accumulator has a value of zero, there was no current word, otherwise, the current word is noted as a word running to the current character's position, then the accumulator is reset to zero and the current character's position is taken to be the start of the next word and the next character is processed. If the mapped value is the IGNORED value 8136, the next character is processed. Otherwise, the accumulator is modified by computing a value based on the current value of the accumulator and the mapped value (e.g., by rotating the current value of the accumulator and adding in the mapped value). Once the words are enumerated, n-grams 8144 are constructed from sequences of words up to some maximum length, where the hashes 8146 of multiword n-grams 8144 are computed by combining the hashes of the successive words they contain. In an example, this combination is performed by a different algorithm than was used to form the hashes of the individual words (e.g., by rotating the current value of the accumulator and XORing the hash of the next word).
In alternative examples, each uniform lookup table 9170 has its own associated decoder 9171. In other examples, the uniform map set contains a single uniform lookup table 9170 used for n-grams 8144 of any length. In further alternative examples, other mechanisms are used for the implementation of a feature table (e.g., table 212). Such other mechanisms may include hash tables, associative maps, parallel arrays, b-trees, or databases.
Each uniform lookup table 9170 contains parallel arrays of keys 9166 and values 9172, where the value at a particular index in the value array 9172 corresponds to the key at the same index in the key array 9166 and the elements of the key array are stored in a sorted order. In the example, the keys 9166 are stored in ascending numeric order. A uniform lookup table 9170 provides the ability to determine whether a particular value is a key in the key array 9166, to return the index in the key array 9166 of a value if it exists there, and to return the number at a particular index in the value array 9172.
To determine the index of a number in the key array, 9166 a variant of the binary search algorithm can be used. In this variant, the probe point at each iteration is chosen to be
where low and high are the current bounds on the range being searched, Htarget is the value being looked up, and Hlow and Hhigh are the values at positions low and high, respectively, in the key array 9166. In alternative examples, binary search, linear search, or other methods may be used instead of this algorithm. In the example illustrated in
To look up an n-gram (e.g., n-gram 8144), the uniform map set 9164 can obtain the number of words (e.g., 8150) in the n-gram (e.g., n-gram 8144) and can use that in an index into its array of uniform lookup tables. If a corresponding uniform lookup table 9170 exists, it then asks the uniform lookup table 9170 to lookup up the n-gram's hash (e.g., hash 8146). In this manner, it can determine whether it contains an entry corresponding to the n-gram (e.g., n-gram 8144) and it can also use the index returned by the uniform lookup table 9170 to at that time or later retrieve the value associated with the n-gram (e.g., n-gram 8144). To retrieve the value, it identifies the uniform lookup table 9170 associated with the n-gram's (e.g., n-gram 8144) number of words 8150, and obtains from that uniform lookup table 9170 the numeric value associated with the index. It then uses the decoder 9171 to convert this numeric value into a value in the uniform map set's 9164 range type.
After the n-grams (e.g., n-gram 8144) are enumerated by the n-grammer (e.g., n-grammer 8128), they are looked up in the feature table's (e.g., feature table 212) uniform map set 9164. For any which are found, a feature is created, which contains the n-gram (e.g., n-gram 8144) and the index corresponding to the n-gram (e.g., n-gram 8144) in the corresponding uniform lookup table 9170 in the uniform map set 9164. In the example, these features are associated with the sentences within the parsed text (e.g., text 7104) they are found in to form the feature set extracted at 326 in
Each feature is associated with a mapping, which can be referred to as a feature record, that maps between concepts and probabilities and gives an estimate of the likelihood that an occurrence of a given feature should be taken as implying the existence of a reference to a given concept. Such an estimate may be made based on the fraction of times the corresponding text was used in a given corpus in a way determined to be a reference to the concept. In an example, the underlying corpus is Wikipedia and concepts are identified with Wikipedia articles, the estimate may be based on the fraction of times that the text associated with the feature, when occurring within Wikipedia, is contained within a hyperlink that points to the article associated with a particular category.
When creating the feature record for feature 10175, a value 10167 is retrieved from the uniform map set (e.g., uniform map set 9164) and interpreted by a decoder 10187 (e.g., decoder 9171 as illustrated in
To interpret a probability value, the probability value is divided by the multiplier, so the probability in feature record 10188 is interpreted as being 52%. In alternative examples either the threshold value or the multiplier may be numbers other than 200 and they may differ from one another. In alternative examples the mapping between concepts and probabilities may be implemented in different ways, including, without limitation, having the internal concept array 10190 contain references to concept objects rather than concept numbers, having the probability array 10192 contain probability numbers directly rather than multiplied by a multiplier, using a single array of mapping objects, using lists rather than arrays, using a map or hash table rather than parallel arrays, or using a specialized object for the case in which there is only a single concept in the mapping.
When creating the feature record for feature 10177, a value 10169 is retrieved from the uniform map set (e.g., uniform map set 9164) and interpreted by the decoder as a concept/offset value of 12,148 and a probability/length value of 205. Since the probability/length value is greater than the threshold value, the threshold value is subtracted from it and the result, 5, is interpreted as a length value, with the concept/offset value being interpreted as an offset value. The decoder then uses the offset value as an index into its concept probability table 10191 and considers the range 10189 of entries starting at this index and extending based on the length value as referring to feature 10177.
The entries in the concept probability table are interpreted as concept values and probability values as described above. In some examples, probability values are constrained to be less than or equal to the threshold value, while in alternative examples, entries with probability values greater to the threshold value are interpreted recursively as offset values and length values and the corresponding sequences of concepts and probabilities are interpolated. The decoder creates feature record 10172 with an internal concept array 10174 containing concept values from the entries in the range and a parallel internal probability array 10173 containing probability values (e.g., 84, at 10181) from the entries in the range. When interpreting the mapping, each numbered concept mentioned is implied with the probability indicated by the corresponding probability value. For example, concept 1,875 in box 10178 is implied by feature 10177 with a probability of 24%, computed by taking the number 42 in box 10180 and dividing by the multiplier, 200. In the example, the parallel concept and probability arrays are ranked by probability, with the most probable association listed first. In alternative examples, the arrays are in some other order or in no particular order. In further alternative examples, the concept probability table in the decoder does not ensure that the resulting ranges will be in the correct order and the decoder sorts the arrays to put them in the proper order.
In addition to being able to enumerate its features, a feature set 11194 can return a feature count map (e.g., as illustrated at 332 of
where w is the block weight, l is the block length of the block of sentence sets that the sentence appears in, and the constants are chosen to give a minimum sentence weight of 0.05w for a sentence in a very long block and maximum sentence weight of 0.8w for a sentence in a one-sentence block. In alternative examples, other functions and constants can be used to determine sentence weights. In some alternative examples, different blocks (e.g., blocks created as the result of processing different parts of the input object 7102) may compute sentence weights by different means. In some alternative examples, different sentences within the same block may be associated with sentence weights computed by different means. For example, the first sentence in a block may have constants chosen to weight it higher than subsequent sentences in the block. Alternatively, the function for computing the sentence weight may take into account the ordinal position of the sentence in the block or the block in the parsed text object (e.g., object 7104). In some examples, when constructing a feature count map 11196, some features (e.g., features designated as “filter only”, as described below with respect to
In the example, categorizer 13238 contains an array of category score thresholds 12240, one per category with non-zero category number. In alternative example, categorizer 13238 may contain a single category score threshold used for all categories or such a category score threshold may be used implicitly. In further alternative examples, there may be several classes of categories, with categorizer 13238 containing or implicitly using different category score thresholds for categories in different classes. For example, there may be one category score threshold value used for all categories deemed to be regional categories and second category score threshold value used for all categories deemed to be non-regional categories.
From a categorization (e.g., categorization 12226), and in alternative examples from categorizer 13238, it may be possible to obtain a measure for a category, based on the score value associated with the category by the categorization (e.g., categorization 12226) and the category score threshold associated with the category by categorizer 13238, of a degree to which the score value exceeds the category score threshold. In an example, this measure is the ratio of the score value to the category score threshold. In alternative examples, other measures may be used, including, without limitation, the arithmetic difference between the score value and the threshold, the arithmetic difference or ratio of a numerically-adjusted (e.g., by taking a logarithm or other function) score value and the threshold, and considering the threshold value as a mean in a Gaussian probability distribution, and computing a cumulative density function of this probability distribution up to a point specified by the score value.
The categorizer 13238 can also include a uniform map set 13242 that maps features to weight sets, where a weight set is an association between categories in a subset of the set of categories and floating-point weights indicative of the likelihood that a document containing a given feature should be considered to be described by a given category. The uniform map set 13242 may be implemented in the same manner as the uniform map set 9164 associated with feature table 212, described above with respect to
In the example, a decoder 13239 associated with uniform map set 13242 contains an array 13246 of encoded weights, an array 13252 of offsets (or “starts”) into the array 13246 of encoded weight associations, an array 13254 of lengths of ranges within the array 13246 of encoded weight associations, a minimum weight 13256, and a maximum weight 13258. To construct a weight set associated with a given feature, that feature's n-gram is looked up in uniform map set 13242, which results in a numeric value being converted to a weight set by the decoder. To do this in the example, the decoder treats the numeric value as an index into both the array 13252 of offsets and the array 13254 of lengths, which together reference values that define a range 13241 of entries in the array 13246 of encoded weight associations. The entries in this range are then interpreted as a bit-field containing a category number 13248 and a bit-field containing an encoded weight 13250. The encoded weight may be the desired weight scaled such that a first threshold encoded weight (e.g., the maximum possible encoded weight 13250) value corresponds to a first threshold weight (e.g., the decoder's maximum weight 13258), and a second threshold encoded weight (e.g., the minimum possible encoded weight 13250) corresponds to a second threshold weight (e.g., the decoder's minimum weight 13256). The weight may be determined by dividing the encoded weight 13250 by a scale factor equal to the difference between the threshold encoded weights (e.g., maximum 13258 and minimum 13250 possible encoded weights) divided by the difference between the threshold weights (e.g., maximum weight 13257 and the minimum weight 13256) and then adding in the second threshold weight (e.g., minimum weight 13256).
In an alternative example, the decoder contains the scale factor rather than the first weight (e.g., maximum weight 13256). In alternative examples, the decoder may use other means to represent the mapping between features and weight sets and/or between categories and weights within a weight set. In some alternative examples, rather than using an array 13246 of encoded weight associations, the decoder may use two parallel arrays of category numbers (or other means of referring to categories) and weight values (or values from which weight values may be determined). In some alternative examples, the decoder may contain a single array containing references to objects, each of which contains information sufficient to create or identify a single weight set.
To compute the categorization (e.g., categorization 12226) in the example, categorizer 13238 first creates a new categorization (e.g., categorization 12226) with each category in the categorization (e.g., categorization 12226) associated with a category score of zero. In alternative examples, other initial values may be used and these values may differ from category to category. A feature set (e.g., feature set 12234) is then asked to create a feature count map (e.g., map 11196, as described above with respect to
In alternative examples, other forms of adjustment, including taking dividing by the sum of the feature weights associated with all features in the feature count map (e.g., map 11196), or no adjustment may be used. In alternative examples, the feature count associated with each feature in the feature count map (e.g., map 11196) may be used instead of the feature weight. The weight set, if any, associated with the feature is then obtained from uniform map set 13242. If an associated weight set exists, for each category in the weight set, the associated weight is multiplied by the adjusted feature weight and the resulting value is added to the score associated with the category in the categorization (e.g., categorization 12226). In alternative examples, other methods of categorization may be used to create the categorization (e.g., categorization 12226) including, without limitation, Naïve Bayes methods, Term Frequency*Inverse Document Frequency (TF*IDF) methods, and Support Vector Machines (SVM) methods.
The feature set (e.g., set 11194) may include features that textually overlap. For instance, a sentence containing, “Barack Obama's cabinet” may have features matching “Barack,” “Obama,” “Barack Obama,” and “Obama's cabinet.” In some examples, it is desirable to remove features from the feature set (e.g., set 11194 and at 330 in
At 14282, each feature priority object 14260 in the array is considered and loop 14283 is performed, focusing on that feature priority object 14260. At 14284, slots are checked corresponding to positions from the start index 14266 to the end index 14268 exclusive of the feature priority 14260, reflecting the positions of the words of the feature 14262 associated with the feature priority object 14260. If any of these array slots contain true values, a more-preferred feature has been chosen that overlaps with the feature 14262 associated with the current feature priority object 14260, and control passes to block 14289 and the next iteration of loop 14283. In this way, such a feature is removed from the weighted feature list since it was removed at 14280 and not added back. If none of the slots contain true values, the feature 14262 associated with the current feature priority object 14260 is added back to the weighted feature list at 14286, and each slot in the array considered at 14284 is set to a true value at 14288. Control then passes to block 14290 and the next iteration of loop 14283. When there are no more feature priority objects 14260 in the array, loop 14283 terminates and control passes to block 14291 and the next iteration of loop 14273.
For example, an article may use a person's full name once (e.g., “Michelle Obama”), and then switch to using a shorter form (e.g., “Obama”) as the article progresses. In this example, a page about Michelle Obama may have one or two mentions of “Michelle Obama” and twelve mentions of “Obama,” both of which would show up as features. However, the feature “Obama” on its own may be considered by the system to be more likely to refer to Barack Obama than to Michelle Obama. This may lead the concept extractor (e.g., extractor 210) to erroneously conclude that a page is about Barack Obama. The feature filter (e.g., filter 222) can be used to properly identify names in text, and the feature filter can merge features that consist of a single word into longer features for which the single word is the first or last work. The feature filter (e.g., filter 222) can also merge take into account prefixes (e.g., titles) and suffixes.
For example, it may decide that references to “Mrs. Obama” should also be merged into those for “Michelle Obama”, even though the former is not a substring of the latter. The feature filter (e.g., filter 222) may also be able to determine that the feature should be discarded as being unlikely to refer to any of the concepts it knows about. For example, if a web page contains references to “Obama” and “Mr. Obama”, both recognized as features known in a feature table (e.g., table 212), the system might be led to conclude that they referred to the concept “Barack Obama”, even though “Barack Obama” is not seen. But if there is a mention of “Joe Obama” in the text, not recognized as a feature (since not in feature table 212), these features may be discarded, as they likely actually refer to Joe Obama, who is not a concept the system knows about. In some examples, the feature filter (e.g., filter 222) may be composed of multiple feature filters. In some examples, the feature filter (e.g., filter 222) may make use of information not contained within the feature count map in making its determinations.
To perform this merging of different ways of referring to named entities the example feature filter (e.g., filter 222) contains a map from strings to named entity objects representing features determined by the feature filter (e.g., filter 222) to refer to the same named entity. In the example, a named entity object contains a collection of features identified as referring to it, with one of those features identified as being its primary feature. It also contains a set of named entities identified as being its “super-names”, named entities that are longer and may refer to the same concept. It further contains an indication of whether it is a single-word named entity and, if not, its first and last words.
At 15292, each feature in the feature count map is considered and loop 15293 is performed with respect to it. At 15298, the canonical form (e.g., form 8158) of the feature's n-gram (e.g., n-gram 8144) is obtained. In the example, the canonical form is computed based on the sequence of characters covered by the n-gram (e.g., n-gram 8144) in an underlying string (e.g., underlying string 8152), and this underlying string is taken from the sentence in the parsed text object (e.g., parsed text object 7104). Initial and final sequences of characters considered to be non-word characters by the n-grammer (e.g., n-grammer 8128) in a feature table (e.g., feature table 212) are removed. Other maximal sequences of non-word characters are removed by single spaces. Characters considered to be ignored characters by the n-grammer (e.g., n-grammer 8128) are removed. Letters are converted to their lowercase forms and unaccented characters replace accented characters. At 15302, the canonical form of the n-gram (e.g., n-gram 8144) is split into words to yield an array of strings representing the individual words of the feature.
At 15304, this array of words is analyzed and a subset, which need not be proper, of these words is identified as the “core” of the feature. In an example, the array is scanned from the beginning, and each word is checked against a set (canonicalized) words considered to be prefixes, including titles (e.g., “dr”, “senator”, etc.) and articles (e.g., “the”, “a”, “an”, etc.) identifying matched words as not being part of the core until a word is found that is not in the set. In an example, the array is scanned from the end, each word is checked against a set of words (e.g., canonicalized words) considered to be suffixes, including, but not limited to, “st”, “ave”, “jr”, and/or “md”, identifying matched words as not being part of the core until a word is found that is not in the set. In such examples in which the n-grammer (e.g., n-grammer 8128) considers the apostrophe character to be a non-word character, the set of suffixes may contain “s”, to allow, e.g., “Barack Obama” to be considered to be the core of “Barack Obama's” (which canonicalizes to “barack obama s”). In some examples, processing of suffixes may stop once the scan moves to words previously identified as prefixes.
In alternative examples, words from the middle of the string (e.g., words identifiable as middle initials or nicknames) may be identified as not being part of the core. In some examples, information other than the canonical form of the words may be used to identify words to be excluded from the core. In some such examples, the underlying string (including factors such as capitalization and punctuation) may be used. The remaining words are identified as the core of the feaure. For example, “The Reverend Dr. Martin Luther King, Jr.'s” may be determined to have a core of “Martin Luther King,” and “Rev. King” may similarly be determined to have a core of “King.” In some examples, if the determined core is empty (e.g., because all words have been determined to be non-core words), the entire initial array of words may be considered to be the core. In some examples, words may be replaced by equivalent words. For example, in examples in which “&” is a possible word, it may be replaced by “and” to allow, e.g., “Tom & Jerry” and “Tom and Jerry” to be determined to have an identical core of “tom and jerry”. In some examples such substitutions may include the replacement of nicknames such as “Bobby” by more commonly official names such as “Robert”. In some examples, stemming algorithms may be used to transform words. In further examples, words or sequences of words determined to be in one language may be replaced by translations into another language
At 15306, the text of the core is used as a key to find a named entity in the feature filter's named entity map. If no such named entity is found, one may be created based on the core text and associated with the core text. The current feature is then added to the named entity's set of features, and control passes to the next iteration of loop 15293 at 15307. In some examples, when a new named entity is to be created, a check is made to see whether the first word of the core is one of a small set of words that have been found to cause problems at the beginning. Similar tests can be made for the last word being disallowed at the end and for any word being disallowed in the middle. If any of these tests pass, the named entity can be considered to have stopwords. For example, “state” may be disallowed at the end because otherwise “Washington” would be seen as an alias for “Washington State,” when these may refer to two different schools. Similarly, “west” may be disallowed at the beginning to avoid “Virginia” being seen as an alias for “West Virginia” and words like “and” and “in” may be disallowed in the middle.
When loop 15293 terminates, at 15294 for each named entity in the named entity map that is not considered to be a single-word named entity, loop 15295 is performed. At 15308, the named entity checks to see whether the named entity map contains named entities associated with either its first or last words. For any such matching named entities, the current named entity adds is added to the matching named entity's collection of super-names, and control passes to the next iteration of loop 15294 at 15309. In some examples, if the named entity has been determined to have stopwords, it does not perform the check at 15308. In some examples, the named entity keeps track of whether it has stopwords at the beginning or the end and only skips checking for named entities corresponding to its first (respectively, last) word if it has stopwords at the beginning (respectively, end). In alternative examples, the named entity may check for named entities matching longer or other sequences of words within the core of the feature that was responsible for its creation.
When loop 15295 terminates, at 15296 for each named entity in the named entity map, loop 15297 is performed. At 15310, a determination is made as to whether the named entity contains a single super-name. If this is the case, at 15312 that super-name is set up as an alias target as described below. Then, at 15314, the count objects associated in the feature count map 11196 with each of the current named entity's features are added (e.g., by adding counts and weights) to the count object associated in the feature count map 11196 with the super-name's primary feature. Finally, control passes to the next iteration of loop 15297 at 15324.
An example method for setting up a named entity as an alias target, at 15312, is shown in inset 15319. At 15318, one of the named entity's features is chosen as its primary feature. If a primary feature was previously identified for the named entity, subsequent procedures of the method may be omitted. If the named entity has only one feature, it is selected and the subsequent procedures of the method may be omitted. If there is a feature whose text exactly matches the core text which led to the named entity's creation (e.g., without prefix or suffix words having been removed and without transformation), that feature is chosen. Otherwise, the feature with the highest count value associated with it in the feature count map (e.g., map 11196) is chosen. If there is no exact match and more than one feature has the highest count value, one is chosen arbitrarily. In alternative examples, other criteria may be used for choosing the primary feature. In some examples, the chosen primary feature may not be one of the named entity's features. At 15320, a new count object is created, and the count objects associated in the feature count map (e.g., map 11196) with all of the named entity's features are added to it and removed from the feature count map (e.g., map 11196). This combines the count and weight information for all features that have a common core. At 15322, the newly-created count object is associated in the feature count map with the named entity's primary feature.
Returning to 15310, if the determination is made that the named entity does not contain a single super-name, there are two possibilities: either it contains no super-names or it contains more than one super-name. In either case, at 15316, the named entity is set up as an alias target as describe above to merge information from all features that have a common core, and control passes to the next iteration of loop 15297 at 15324. In an alternative example, when it is determined that there is more one super-name, method 15290 may attempt to identify one of the super-names as more likely, for example, by noting that one is associated with substantially higher counts than the others or by noting that one is associated with concepts or categories that have substantially more support than others.
In an example, the feature filter (e.g., feature filter 222) further builds a filter result object 12236 (as in
In an example of method 15290, “The Reverend Dr. Martin Luther King, Jr.”, “Martin Luther King”, “Dr. King”, “King”, and “Martin”, can all merge their information under “Martin Luther King.” Possessives, as well as names of newspapers and organizations with and without a leading “The” may be merged, as well. However, if there is an ambiguity, the merging may not take place. For example, if both “Barack Obama” and “Michelle Obama” occur in the text, a bare “Obama” may not be merged with either, and it can remain as a feature to be resolved in later processing.
In an example, the feature filter (e.g., filter 222) uses information about common names to detect situations in which features represent bare first names or bare last names (with or without attached prefixes or suffixes) that may be spurious and delete such features from the feature count set 11196. To support this, a feature table (e.g., table 212) is augmented by a uniform map set that maps from n-grams (and, therefore, features) to sets of objects of an enumerated “use class” type. Among the possible use classes may be “First Name”, for features that represent names used as first or given names, “Last Name”, for features that represent names used as last or family names, and “Initial”, for features that represent single initials.
In some examples, the “Initial” use class may be merged with the “First Name” use class. In some examples, there may be other use classes reflecting uses such as titles, suffixes, and words like “Street” (to allow for recognition that, e.g., “Lincoln Street”, if not recognized in full as a feature, should not be taken as referring to Abraham Lincoln) or “University”. Some features, such as “Frank”, which can be both a first name and a last name, may be associated with more than one use class, while many features will be associated with none. In some examples, features may be included in the feature table (e.g., table 212) solely because they are known to be in one or more use classes. To mark these, they are further associated with a “Filter Only” use class, reflecting that they should not be included in the resulting analysis. When constructing a feature count map (e.g., map 11196) from a feature set (e.g., set 11194), any features marked “Filter Only” are ignored.
When applying the feature filter (e.g., filter 222), a pass is made to identify all of the “questionable” features in the feature set (e.g., set 11194), where a questionable feature is either a (non-filter-only) feature considered to be a “Last Name” that immediately follows a feature considered to be a “First Name” or “Initial” or a (non-filter-only) feature considered to be a “First Name” or “Initial” that is immediately followed by a feature considered to be a “Last Name”. In alternative examples, other rules may be used to determine features to be questionable. To determine which features are questionable, it suffices to process all of the feature set's weighted feature lists. For each list, the features (which do not overlap, having had overlapping features removed at 330 in
In the example, if a feature is questionable, then it—and any feature that merges with it—can be treated as spurious unless there is some extension of it that's also known to be a feature. As an example, if “Obama” is seen, it will likely be taken to refer to “Barack Obama” (unless other evidence on the page leads to another interpretation also associated in the feature table (e.g., table 212 with “Obama”). However, if “Obama”, a known last name, is seen following “Joe”, a known first name, it becomes questionable, and the system defaults to believing that its instances of “Obama” actually refer to “Joe Obama”. On the other hand, if the document also contains “Barack Obama”, then even though there was initial reason to believe that “Obama” might have been spurious, there is also reason to believe that it might not be, and so it may be left as a feature.
To implement this, at 15306, when the feature is added to a named entity, if the feature has been determined to be questionable, the named entity is marked as being questionable. Then, following 15296, another pass is made over the named entities in the named entity map. The features for any questionable named entities are removed from the feature count map (e.g., map 11196). For any such named entity that had been merged into another named entity, the counts would already have been removed, at 15320, and added into other counts, so the only ones that get removed here are those that weren't merged, which is precisely the ones that have no observed extension.
The concept extractor (e.g., extractor 200) can take the feature set's (e.g., set 11194) feature count map (e.g., map 11196) and the categorization (e.g., categorization 12226) and identify category paths that characterize the document and associate with each a set of evidence. As discussed above, a category path is an association between a category (possibly in a hierarchical category structure) and a concept. In some examples, a category path may be a determined sequence of categories paired with a concept. Such a sequence may be chosen path through the parentage hierarchy of a category, where the category hierarchy is a directed acyclic graph. A choice of concepts can be modeled as an election in which concepts are the candidates, and the goal is to choose a set which matches evidence across features seen (viewed as voters in the election each with a number of votes based on the weight associated with it in the feature count map 11196 and with votes allocated, perhaps fractionally, based on the feature record 10174 associated with it by the feature table 212). A consensus may then be found among the chosen concepts as to which categories have the broadest support. In the example, each feature ultimately chooses to support (and become evidence for) at most a single concept. In the example, the consensus also takes into account the likelihood that a candidate concept is part of the consensus based on the other concept candidates that have not yet been eliminated.
In the example, neighborhood 16332 includes several parallel arrays containing information about each of its neighbor concepts, with each neighbor concept associated with a particular index. These arrays include an array of neighbor concept numbers (X) 16334, an array of neighbor probabilities 16336 conditional on the concept (i.e., P(X|C)), an array of positive likelihood ratios
and an array of negative likelihood ratios
In alternative examples, the positive likelihood ratio array 16326 and negative likelihood ratio array 16328 (or their individual slot values) may be constructed as needed. In the example, neighborhood 16332 also includes a base size 16324 indicative of the relative frequency of mention of concept C, which may be based on the number of times the concept was mentioned in the corpus used to generate the neighborhood.
As neighborhood objects can be a fairly large and as there may be a large number of concepts (e.g., millions or more) known to concept extractor (e.g., extractor 200), where only a small fraction of them may be used in any given extraction, it may be beneficial to delay the construction of neighborhood objects (e.g., objects 16332) until needed. To construct neighborhood objects a number of arrays (or, in alternative examples, similar data structures) may be used. In the example, the arrays can include an array 16330 of 8-bit indicators of the approximate number of occurrences for each concept, an array 16338 of 8-bit counts of the number of neighbor concepts in a concept's neighborhood, and an array 16342 of 32-bit indices into the data array indicating where a concept's neighborhood data starts. For each of these arrays, there is one entry per known concept and the concept's number is used as the index into the array. There can also be an array 16340 of 32-bit data, parsed as 24 bits of neighbor concept number followed by 8 bits of an indicator of the approximate number of co-occurrences between the concept and the neighbor. In alternative examples, different sizes and configurations of the data in these arrays may be used and other data structures may be used to associate the needed data with individual concepts.
Since these arrays may be quite large, it is desirable to save memory by encoding indicators for approximate counts for the number of neighbors 16338 and the co-occurrence counts in the data 16340. In the example, these indicators are 8 bits wide and interpretable with respect to an example decode table 17344 shown in
The concept candidate also contains an indicator 18368 of whether the candidate is considered to still be “active” in the election and a current score 18372, indicative of a level of belief given current evidence that the candidate's concept 18354 is mentioned in the document. The concept candidate further contains a set of imputations (discussed below with respect to
Alternative examples may omit some of these components. In particular, examples that do not make use of inter-concept probability, as discussed above with respect to
When loop 20391 terminates, at 20398, for each candidate currently in the election, loop 20399 is performed. In some examples, this is performed by enumerating based on a copy of the set of candidates to ensure that only candidates created during loop 20391 are considered. In some examples, consideration for each candidate at 20398 may be omitted.
At 20402, for each of the first ten concepts in the neighborhood 18364 associated with the current concept candidate (e.g., candidate 18352), loop 20403 is performed. In alternative examples, different numbers of neighboring concepts are used, including all concepts. In some examples, the number of concepts used, when less than all concepts, is different for different current concept candidates. At 20408, an imputation (e.g., imputation 19376) is created based on the current candidate, the neighboring concept, and information associated with the neighboring concept in the current concepts neighborhood (e.g., neighborhood 18364). This imputation (e.g., imputation 19376) refers as its target candidate (e.g., candidate 19378) to the candidate associate with the neighboring concept. If no such candidate exists in the election, one may be created.
Such a newly-created concept candidate will necessarily have no votes from features. In some examples, if no such candidate exists, no imputation is created and control passes to the next iteration of loop 20403. The imputation (e.g., imputation 19376) is added to the current candidate's (e.g., candidate 18352) imputed candidates (e.g., candidates 18360). At 20410, the imputation (e.g., imputation 19376) is added to the imputation's target candidate's (e.g., candidate 19378) imputing candidates (e.g., candidates 18374). At 20412, the features voting for the current candidate (e.g., in the current candidate's vote map 18356) are added to the imputation's target candidate's imputing features (e.g., features 18362). Since the imputing features (e.g., features 18362) are, in the example, a multiset, adding features that already exist in the imputing features (e.g., features 18362) will increase the number of times that they are represented. Control then passes to the next iteration of loop 20403 at 20413.
When loop 20403 terminates, at 20404, for each of the remaining concepts in the neighborhood (e.g., neighborhood 18364) associated with the current concept candidate (e.g., candidate 18352), loop 20405 is performed. In some examples, fewer than all of the remaining neighboring concepts are enumerated. In some examples, consideration for remaining neighbors at 20404 is omitted. At 20406, substantially the same processing takes place as at 20408, but rather than being added to the set of imputed candidates (e.g., set 18360), the created imputation (e.g., imputation 19376) is added to the set of interesting candidates (e.g., set 18370). In this example, loop 20405 does not contain analogues of adding an imputation to a target's imputing candidates at 20410 or adding voters to a neighbor's imputing features at 20412. Control then passes to the next iteration of loop 20405 at 20407. When loop 20405 terminates, control passes to the next iteration of loop 20399 at 20400.
Allowing imputed candidates without feature support can permit candidates to hypothesize a context that could have been mentioned, but was not, or hypothesize a context that was not mentioned in a manner recognizable by the feature table (e.g., table 212). For example, the concepts for Jack Brickhouse, a Chicago Cubs announcer, and Kerry Woods, a later Chicago Cubs player, may not refer to one another in their respective neighborhoods (e.g., neighborhood 16332). However, if both concepts are candidates in the analysis of a document, both candidates may impute a “Chicago Cubs” concept, not explicitly mentioned on in the document. By each of them imputing “Chicago Cubs,” it can be determined that Jack Brickhouse is the correct referent of the feature “Brickhouse”.
Candidates whose concepts will be used to describe a page can be determined based on the construction of the election.
At 21416, a set of concept candidates (e.g., 18352) is partitioned in sets containing those concept candidates whose associated vote maps (e.g., map 18356) are empty (“imputed only” candidates) and those concept candidates whose associated vote maps (e.g., map 18356) are non-empty (“remaining” candidates, as discussed above). At 21418, an empty set of winning candidates is constructed.
At 21420, each candidate's initial score (e.g., score 18372) is computed. First candidates with votes (those in the “remaining” set) have their scores initialized to their maximum probability (e.g., probability 18358). Next imputed-only candidates have their scores initialized to the maximum over the candidate's imputing candidates' imputations (e.g., imputation 18374) of the imputations' imputed probability (as described above with respect to
At 21424, while the “remaining candidates” set is not empty, loop 21425 is performed to select, remove, and process candidates. At 21416, for each remaining candidate (e.g., for each candidate in the “remaining candidates” set), loop 21427 is performed to update its current score (e.g., score 18372). At 21428, a determination is made as to whether the current concept candidate is inactive (e.g., has a false active indication 18368 due to having an empty vote map 18356 and an empty imputing candidates set 18374). If this is the case, the candidate is removed from the set of remaining candidates at 21430, and control passes to the next iteration of loop 21427 at 21431. At 21432, a determination is made as to whether the current concept candidate has no associated votes (e.g., has an empty vote map 18356). If this is the case, at 21434, the candidate is removed from the set of remaining candidates and added to the set of imputed-only candidates, and control passes to the next iteration of loop 21427 at 21431. At 21440, a new score is computed for the candidate but not set as the candidate's current score (e.g., score 18372). Details of methods for computing of the new score will be given below.
At 21442, a determination is made as to whether the new score is below a threshold (e.g., 0.05). If it is below the threshold, at 21444, the candidate is removed from the set of remaining candidates, and for each of the features voting for it, the vote from that feature to the candidate is removed and the total number of votes for that feature is decreased. If the candidate was removed at 21444, control then passes to the next iteration of loop 21427 at 21431. Otherwise, at 21454, the new score is associated with the current concept candidate in a map. By doing so, each candidate's score can be based on the scores of other candidates after the prior iteration.
When loop 21427 terminates, at 21436, for each imputed-only candidate, loop 21437 is performed. At 21438, a new score is computed for the candidate as the maximum value of the imputed probability of the imputations (e.g., imputation 19376) in the candidate's imputing candidates set (e.g., set 18374) and this score is associated with the candidate in a map. In the example, the same map is used as is used at 21443. In alternative examples, other rules may be used for computing the new score. Control then passes to the next iteration of loop 21437 at 21439.
When loop 21437 terminates, at 21446, the scores associated with candidates at 21443 and 21438 are assigned as new values of the respective candidate's current scores (e.g., score 18372).
At 21448, a “worst” candidate can be chosen from the imputed only set. The determination that a candidate C is worse than a candidate C (and therefore more worthy of being chosen) may be based on CL's current score (e.g., score 18372) being less than that of C2. In some examples, if the difference between the current scores is sufficiently small (e.g., less than 0.001), other means of making the determination may be used. In some such examples, the secondary determination may be based on C1's probability (e.g., maximum probability 18358) being less than that of C2. If these probabilities are sufficiently close to one another (e.g., less than 0.05 apart), still further considerations, such as a comparison between C1's vote total (e.g., total 18366) and that of C2. In some examples, the sequence of tests may include the same test both with and without a threshold or with multiple thresholds. In the example, the sequence of tests consists of a comparison of current score, with a threshold of 0.001, a comparison of maximum probability, with a threshold of 0.05, a comparison of vote total, and a comparison of maximum probability, with no threshold. If no test distinguishes two concept candidates, they are considered to be indistinguishable, and either may be chosen as worse.
At 21450, the identified worst candidate is removed from the set of remaining candidates. At 21452, for each feature in the worst candidate's vote map (e.g., map 18356), if this is not the sole remaining vote for that feature, the feature's vote for the worst candidate is removed. At 21456, a determination is made as to whether the worst candidate has remaining votes (e.g., votes not removed at 21452). If it does, it is added at 21458 to the set of winning candidates created at 21418. In either case, control passes to the next iteration of loop 21425 at 21459.
Following method 21414, additional candidates may be added, in some examples, to the set of winning candidates from the set of imputed-only candidates. In some such examples, a score is computed for each imputed-only candidate as at 21440 (rather than as at 21438) and this score is compared to a threshold (e.g., the threshold used at 21442). If the score is above the threshold, the candidate is added to the set of winning candidates and its score remembered, as at 21443. When all imputed-only candidates have been processed, the remembered scores are assigned as at 21446.
When a feature is dropped as a voter for a candidate, for example at 21444 or 21452, this can result in the candidate no longer having any votes. As a result, whether the candidate remains active can depend on whether its imputing candidates set (e.g., set 18374) is empty. If it is still active, each of the imputations (e.g., imputation 19376) in the imputed candidates set (e.g., set 18360) can be considered, and the feature can be removed from each imputation's target's (e.g., 19378) imputing features multiset (e.g., multiset 18362). If it is no longer active, the imputations (e.g., imputation 19376) imputed candidates (e.g., candidates 18360) can be considered, and each imputation's target candidate (e.g., target candidate 19378) can be instructed to remove the imputation. The imputed candidate can do this by removing the imputation from its imputing candidates set (e.g., set 18374), and if this results in it no longer being active, it can further walk its imputed candidates set (e.g., set 18360) and ask that the imputations contained there be removed from their targets. In some examples, when a feature is removed as a voter for a candidate, this may trigger a new computation of the maximum probability (e.g., probability 18358) for that candidate over the remaining features in the candidate's vote map (e.g., map 18356).
In an example, the computation of a new score for a concept candidate (e.g., candidate 18352), at 21440 makes use of a modified version of the likelihood computation of a Naïve Bayes classifier. In a Naïve Bayes classifier, the likelihood ratio for a particular class C given a set of evidence E is computed as the product of a base likelihood ratio
based on a prior estimate of unconditional probability P(C), and the likelihood ratios of the conditional probability of each piece of evidence e give the class
Since P(C|E)+P(
under the assumptions that all eεE are independent of one another.
In the example, score computation method the base prior estimate P(C) of unconditional probability is taken to be the maximum probability (e.g., probability 18358) associated with that candidate and the evidence is taken to be the presence or absence of support for each imputation in its imputed candidates (e.g., candidates 18362) and interesting candidates (e.g., candidates 18370) sets. In alternative examples, other base prior estimates of unconditional probability may be used. In some examples, the prior estimate may be based on a fraction of documents in some corpus that are determined to be associated with the candidate's concept. In alternative examples, other evidence may be used instead of or in addition to imputations. In some such examples, the evidence may be features in the feature count map.
An imputation from C to a candidate X is considered to be supported if X is active and if at least one feature in X's imputing features (e.g., features 18362) that is not also contained in C's vote map (e.g., map 18356). That is, if there is some feature evidence that leads us to believe that X is present that might not also be evidence for C. When an imputation (e.g., imputation 19376) is supported, the likelihood ratio used in the computation is the imputation's positive likelihood ratio (e.g., ratio 19380) raised to the power of the imputation's probability (e.g., probability 19384). In alternative examples, other likelihood ratios may be used. In some such examples, the imputation's positive likelihood ratio (e.g., ratio 19380) may be used directly. When an imputation (e.g., imputation 19376) is not supported, the likelihood ratio used is the imputation's negative likelihood ratio (e.g. ratio 19386). In alternative examples, other likelihood ratios may be used.
The final score may be computed as P(C|E) above, given the prior probability and evidence likelihood ratios. That is, the likelihood ratio is computed and converted to a conditional probability by dividing the likelihood ratio by one more than the likelihood ratio. In the case when this computation results in an infinite value, the score is taken to be 1.0.
The score for a category candidate 22460 in the example is computed as the product of the categorization vote and the concept vote. In the example, the categorization vote is computed as
where s is the score given to the category 22462 in the categorization 12226, t is the category's threshold according to the categorizer (e.g., categorizer 13238) that constructed the categorization (e.g., categorization 12226), and b and k are parameters. For the expression above, b is the categorization vote for a category whose score is precisely at its threshold, and k is be the number of multiples of threshold that a score would have to be for the categorization value to be 1.0. In an example, b=0.8 and k=2.
When loop 23475 completes, at 23482, an empty map from concepts to collections of category paths is created or otherwise obtained. At 23492 the set of known category candidates 22460 is constructed and designated as the set of remaining category candidates. While this set is non-empty, loop 23493 is performed.
At 23484, the best category candidate is chosen from among the remaining category candidates and removed from the set of remaining category candidates. In the example, category candidates 22460 whose categories 22462 are not suppressed are considered better than those whose categories 22462 are suppressed. Otherwise, a sequence of tests is performed until one is found that distinguishes the category candidates. The example sequence prefers category candidates that have higher scores, then higher concept votes (e.g., votes 22472), then more unclaimed voters (e.g., voters 22468), then more voters (e.g., voters 22472), then higher categorization votes (e.g., votes 22470). Category candidates that are the same for all tests are considered to be indistinguishable, and either may be considered better than the other. As with comparing concept candidates, as described above, in alternative examples, tests may include absolute or relative thresholds such that if the difference between two category candidates is less than the threshold, the test does not distinguish the category candidates.
At 23494, for each concept candidate in the best category candidate's set of voters (e.g., set 22466), loop 23495 is performed. At 23486, a determination is made as to whether the concept candidate is also in the category candidates' set of unclaimed voters 22468. If it is, then at 23496, the for each category associated with the concept candidate's associated concept, loop 23497 is performed. At 23498, a determination is made as to whether the current category is the same as the best category candidate's associated category 22462. If they are, control passes to the next iteration of loop 23497 at 23499. At 23502, a determination is made as to whether the current category has the same regionality as the best category candidate's associated category 22462 (e.g., are they both regional categories or both non-regional categories).
In alternative examples, as described above, more or fewer such category classes may be employed. In such examples, the determination may be whether the categories share any classes, all classes, a sufficient number of classes, or some other criterion. If the categories are determined to not have the same regionality, control passes to the next iteration of loop 23497 at 23503. At 23504, the current concept candidate is removed from the set of unclaimed voters 22468 in the category candidate associated with the current category, and that category candidate's concept vote 22472 is updates. Control then passes to the next iteration of loop 23497 at 23503.
Returning to the unclaimed determination at 23486, if the determination is that the concept candidate is not in the unclaimed voters set 22468, at 23488, a determination is made as to whether the category candidate contains enough unclaimed voters to proceed anyway. In the example, a category candidate is considered to have enough unclaimed voters if the size of the unclaimed voters set (e.g., set 22468) is at least half the size of the voters set (e.g., set 22466). In alternative examples, other rules and thresholds may be employed. In alternative examples, the “enough unclaimed” determination at 23488 may be omitted, with control flowing as though the determination had been that the number of unclaimed was insufficient. If it is determined that there are not enough unclaimed voters, control passes to the next iteration of loop 23495 at 23508.
If there are enough unclaimed voters at 23488 or if the current concept candidate is unclaimed and following 23496, at 23490 a new category path object is created combining the category (e.g., category 22462) associated with the best category candidate (e.g., candidate 22460) and the concept (e.g., concept 18354) associated with the current concept candidate (e.g., candidate 18352). A collection of category paths associated with the concept is obtained from the map created at 23482 (creating it, if necessary), and the newly-created category path is added to the collection. Control then passes to the next iteration of loop 23495 at 23508. When loop 23495 terminates, control passes to the next iteration of loop 23493 at 23510.
When loop 25533 terminates, at 25540, for each imputation in the concept candidate's set of imputing candidates (e.g., 18374), loop 25541 is performed. At 25542, for each of feature in the vote map (e.g., map 18356) of the current imputation's source candidate (e.g., candidate 19382), loop 25543 is performed. At 25544, a piece of evidence (e.g., evidence 24516) is constructed, substantially as at 25538, but with a count (e.g., count 24524) and a weight (e.g., weight 24520) discounted based on the current imputation (e.g., by multiplying by the current imputation's imputed probability). Control then passes to the next iteration of loop 25543 at 25545. When loop 25543 terminates, control passes to the next iteration of loop 25541 at 25547. When loop 25541 terminates, control passes to the next iteration of loop 25531 at 25553.
When loop 25531 terminates, the associations between category paths and evidence objects may be used as evidence map (e.g., map 12228) in the constructed analysis (e.g., analysis 12222).
A scoring function (e.g., function 216) can be applied to each evidence object (e.g., object 24506) in the evidence map to annotate it with an overall score (e.g., score 24514 and as illustrated in
As discussed with respect to
The use of an overall score 24514 and a scale factor 12224, results in a scaled score.
The function has a maximum value of one, and is linear up to 2F−F2, with a quadratic compression afterwards. When the scale factor is 1 (e.g., when all overall scores 24514 are less than or equal to 1), the entire curve is be linear. When the scale factor is two or more, the entire curve is compressed. In between, the curve is mostly linear, but compressed on top, as shown by curve 26554 in
A category path filter can be applied (e.g., as illustrated in
In alternative examples, other criteria may be used to determine that an insufficient number of category paths with the current category exist in the evidence map. If the determination is that the count is less than the threshold, control passes to the next iteration of loop 27559 at 27569. At 27564, the ratio of the categorization score associated with the current category and the categorization threshold associated with the current category is computed. At 27564, a determination is made as to whether this ratio is less than a given threshold (e.g., 1.0). In alternative examples, other criteria may be used to determine that the categorization score for the category is insufficiently high. If the determination is that the ratio is less than the threshold, control passes to the next iteration of loop 27559 at 27569. At 27572, the current category is added to good category set (e.g., set 12230) in the analysis (or to a collection that will become good category set 12230 in the analysis) and control passes to the next iteration of loop 27759 at 27569.
In alternative examples, method 27556 may be performed in substantially different order. For example, a pass may be made through all of the category paths in the evidence maps, collecting the count and score (e.g., maximum score) for the categories as they are encountered and a second pass made over the categories encountered to determine whether they pass or fail the tests. In alternative examples, some or all of the example tests may be omitted and other tests may be added. In some examples, tests may be made as to whether categories are suppressed or otherwise inherently to be excluded. In alternative examples, a category may be determined to be a good category based on passing fewer than all of the tests. In some examples, rather than collecting “good” categories, the category path filter may collect “bad” categories based on categories failing tests. In some examples, rather than creating a separate collection of good or bad categories, the category path filter may remove categories associated with category paths that fail tests from the evidence map.
Using the collected information, an analysis object can be constructed based on the document, and this analysis object, alone or in combination with other analysis objects obtained by analyzing other documents, can be used in the performance of actions related to the document, to other documents, or to other objects or entities related to the document. Such other objects or entities include, without limitation, users who have (or have not) interacted with the document, who have purchased the document, or who have expressed or been determined to have an opinion about the document, storage locations (including disks, servers, and web sites) that contain or contain references to the document, and information sources (including web sites, blogs, RSS feeds, newspapers, television shows, and authors, including users of Twitter or social media) who make reference to or discuss the document.
Examples of actions that may be performed include, without, limitation, classifying the document, recommending the document to a user, including the document in a publication, altering the configuration of a location of the document so as to emphasize the document or make it easier to find, determining a price to charge for accessing the document, determining a location for the document, sending a reference to the document to a user, and determining a management policy to apply to the document. In each of these, “the document” should be read as including other documents, and other objects or entities related to the document.
A document can be further used to synthesize, over a large number of document viewings, a profile that describes sudden interests of a user, long-term interests (e.g., concepts and categories that show up again and again), and other interests. The profile can include the interests of a user, and the profile and document analysis can also be used to personalize content served to the user to increase satisfaction, to recommend content, to decide how similar multiple users' interests are, or display a graphical representation of a user's interests. The comparison of multiple users' interests can be used for collaborative filtering, among other uses. The graphical representation can be used as a selling feature for devices and other services, among other uses.
The above specification, examples and data provide a description of the method and applications, and use of the system and method of the present disclosure. Since many examples can be made without departing from the spirit and scope of the system and method of the present disclosure, this specification merely sets forth some of the many possible example configurations and implementations.