The present invention relates to automated analysis, of computer-accessible content, to produce demographic data regarding such content. More particularly, the present invention relates to the production of demographic data from the information upon which a search result is based.
The product of a research project, whether performed by manual and/or automated means, can often be expressed as a “result” (or results), where each such result is supported by items drawn from the various content-sources searched. Herein, each result can be referred to as a “result-value” and the items supporting such result-value can be referred to as its “result-base.” The pair, of a result-value and its result-base, can be referred to as a “result-pair.”
Having obtained a result-value, there are many situations in which it is useful to know various demographics about its result-base. An example situation, where such demographics are often useful, is where the research product is a profile. If a collection of sought-for values (i.e., a collection of result-values) has been identified, where each relates back to a common entity (as used herein, an “entity” can refer to virtually anything, regardless of whether the item referred-to is completely abstract or more concrete), the collection can be referred-to as a “profile.”
The utility of a “profile,” for describing entities of various types, is well known: if there is a need to quickly obtain an understanding of a particular entity, the review of profile, if available, can be an extremely effective tool for doing so.
Some example profiles are as follows:
In general, the faster demographic data can be made available, regarding the result-bases forming the basis of a research project's product, the faster a productive use, of such research product, can be accomplished. Since automated (or largely automated) processes are, in general, faster than those that are manual (or largely manual), there is a need for tools that can automatically generate demographic data regarding such result-bases.
The accompanying drawings, that are incorporated in and constitute a part of this specification, illustrate several embodiments of the invention and, together with the description, serve to explain the principles of the invention:
Reference will now be made in detail to various embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
Please refer to the Glossary of Selected Terms, included at the end of the Detailed Description, for the definition of selected terms used below.
1 Overview
2 Example Search Results
2.1 Generic Search
2.2 Technology Profiling
2.3 Healthcare-related
2.4 Brand Research
3 Determining Demographics
3.1 Overview
3.2 Confidence Distributions
3.3 Combining Confidence Distributions
3.4 Declarant Demographics
4.1 Technology Profiling
4.2 Healthcare-related
4.3 Brand Research
5 Frame-Based Search
5.1 Overview
5.2 Instance Generation
5.3 Instance Merging
5.4 Instance Selection
5.5 Result Presentation
6.1 Technology Profiling
6.2 Healthcare-related
6.3 Brand Research
7 Further Information
7.1 Logical Form
7.2 Frame Extraction Rules
7.3 Features
7.4 Snippet Formation
7.5 Computing Environment
8 Glossary of Selected Terms
9 Summary
Section 2 presents some example products, of research projects, for which demographic data can be useful. In Section 3, several techniques for producing demographics, in an automated fashion, are presented. The techniques of Section 3 can be applied to a research product that has been produced by any technique, so long as the product satisfies the following two conditions:
However, the techniques of Section 3 are particularly useful when applied to an automated frame-based approach and this type of approach is presented in Section 4.
This section presents the following four types of example search results for which demographic determination can be suitable:
2.1 Generic Search
Within a search type, Search Object 111 represents the particular query for which the search is performed. For each of Sections 2.2 to 2.4 below, its Search Object is, respectively:
A “Search Aspect” provides a category, under which a collection of result-values can be organized.
Screen 100 shows that there are two basic “modes” by which a search result can be displayed:
In screen 100, Record View 101 has been selected. For each result-value shown, a number of records, of its corresponding result-base, can also be displayed. In
A definition, of a demographic, is given below in the Glossary of Selected Terms. The particular form of data display, used for the demographics of
Specifically, data grouping 101 contains result-value 121 and result-base 180. Result-value 121, in
Data grouping 102 contains result-value 122 and result-base 190. Result-value 122, in
2.2 Technology Profiling
The production of a profile, regarding a technology, can be useful as part of a technology scouting project. In technology scouting, a technology searcher begins with a problem (call it, in general, “P_1”) and looks for an existing technology (call it, in general, an “ET_1”) to solve or otherwise address P_1. If a technology search process (such as the search techniques discussed in the '122 and '127 Applications) has identified a candidate ET_1, a further evaluation, of the suitability of applying ET_1 to P_1, can be aided by having a profile of ET_1 (where the profile can be produced with the techniques of the '068 Application). If the technology searcher knows the demographics, of each result-base of the profile, the profile can be more useful to the evaluation of ET_1.
Definitions, respectively, for each of these demographic characteristics are as follows:
A brief discussion, of how the demographics depicted in
Specifically, result-pair 201 contains result-value “Hitachi” and result-base 280. Result-pair 202 contains result-value “Sony” and result-base 290. As can be seen, the records of result-base 280 are enumerated by starting at 1 (for the leftmost record) and continuing up to 27. The records of result-base 290 are enumerated by starting at 1 (for the leftmost record) and continuing up to 18.
2.3 Healthcare-Related
Healthcare-related content is a knowledge domain that is of both great importance and vast size. Items sought-for, in a healthcare-related search, can include the following: a treatment for a condition, the causes and/or complications of a condition and the pros and/or cons of a treatment. For a set of records (the result-base) identified as addressing any of these sought-for items (the result-value), understanding the result-base's demographics can be useful.
An example search result, for treatments to the condition “heart attack,” is shown in
These demographic characteristics are the same as those discussed above for a Technology Profile search. Therefore, please refer to Section 2.2 for definitions of them.
A brief discussion, of how the demographics depicted in
Specifically, result-pair 301 contains result-value “Aspirin” and result-base 380. Result-pair 302 contains result-value “Chocolate” and result-base 390. As can be seen, the records of result-base 380 are enumerated by starting at 1 (for the leftmost record) and continuing up to 78. The records of result-base 390 are enumerated by starting at 1 (for the leftmost record) and continuing up to 818.
2.4 Brand Research
A professional of marketing research is constantly seeking to better understand the perception of brands, as seen by members of a relevant market. As part of achieving this, a set of records (i.e., a result-base) can be identified as addressing an important characteristic (i.e., a result-value) of a brand. In a manner similar to that discussed above for a Technology Profile, a collection of result-pairs (each addressing a distinct but important characteristic) can be determined and the results presented in the form of a Brand Profile.
Once a Brand Profile has been determined, a next task can be to better understand the market members responsible for the result-value. Demographics can be very useful to achieving this goal.
Definitions, respectively, for each of these demographic characteristics are as follows:
A brief discussion, of how the demographics depicted in
Specifically, result-pair 401 contains result-value “bad for dentures” and result-base 480. Result-pair 402 contains result-value “removes lipstick” and result-base 490. As can be seen, the records of result-base 480 are enumerated by starting at 1 (for the leftmost record) and continuing up to 27. The records of result-base 490 are enumerated by starting at 1 (for the leftmost record) and continuing up to 18.
3.1 Overview
Having introduced the utility of demographics, for several example research projects, this Section addresses techniques for determining such demographics. Section 3.2 introduces “Confidence Distributions” as a way of representing the application of a demographic characteristic, both at the level of an individual record and for summarizing a population (or result-base) of records. Section 3.3 addresses techniques for combining Confidence Distributions. This is particularly useful for determining the summarizing Confidence Distribution of a population, since it can be produced by combining the Confidence Distributions produced for each individual record. Section 3.4 discusses types of demographic characteristics, and ways in which such demographic characteristics can be determined.
3.2 Confidence Distributions
Determination of a demographic characteristic “DC_1,” with respect to a population “P_1,” involves at least the following two levels of determination:
At the Individual Member level, two main types of results, of applying a demographic characteristic DC_1 to a population member M_1, are addressed:
Total Certainty can be regarded as a sub-variety of Partial Certainty, where the following two limitations apply:
Summarization of the demographic characteristic, at the Whole Population level, depends upon whether individual members of the population have been assigned values with Total or Partial Certainty.
With Total Certainty at the member level, a population can be summarized with a histogram: for each value, from the range of potential values, the number of members assigned such value can be provided. If desired, the histogram can be normalized, so that each value is assigned a number in the range of 0.0 to 1.0 and the assigned numbers sum to 1.0.
With Partial Certainty at the member level, summarization for the population is more complex.
If desired, the confidence levels can be normalized such that they all fit within a predetermined range, such as 0.0 to 1.0 (in the manner of probabilities) or 0.0 to 100.0 (like percentages), and the sum of the confidence levels typically equals (but does not exceed) the maximum value of the range.
Each of these Confidence Distributions, however, can be interpreted, with respect to a single record, as follows:
Each of these Confidence Distributions, however, can be interpreted, with respect to a single record, as follows:
3.3 Combining Confidence Distributions
Once the Confidence Distribution has been produced, for each member of a population P_1, any suitable technique can be used to combine such Confidence Distributions into a value or values that appropriately summarize P_1 with respect to a demographic characteristic DC_1. For purposes of example, one combining technique is presented herein.
The combining technique presented herein is depicted graphically in
The combining techniques of
Utilization of multiple Approaches, for determining a same demographic characteristic, can be useful in a variety of situations. For example, the Confidence Distributions of different Approaches can reinforce each other, thus leading to higher net confidence levels in the values identified for a record. Also, the impact of an erroneous Confidence Distribution, from one Approach, can be mitigated by other Approaches producing more accurate Confidence Distributions.
Once multiple Confidence Distributions, resulting from the application of multiple Approaches to a single record M_1, have been combined to produce a single Confidence Distribution, this single Confidence Distribution can be treated, for purposes of combining M_1 with other records, as the single Confidence Distribution for M_1. Thus, in the example discussed above, where CD 1 and CD 2 were each described as representative of different records, respectively, M_1 and M_2, it is possible that each of CD 1 and CD 2 has been produced by some prior combining process, in which the results of multiple Approaches were applied to M_1 and M_2.
The variable current_search_aspect (line 1,
The “for” loop, of lines 3-8,
In the same manner as described above (for the “for” loop of lines 3-8), the “for” loop, of lines 10-15,
For result-pair 401, each of the “for” loops iterates over the 27 records, regarding “Bad for dentures,” to produce Confidence Distributions 470 and 472. Lines 17-18,
In its second iteration, the “for” loop of lines 1-20 sets the current result-pair to 402 (see
Within the general requirement, for a display or visualization that can represent a variation of confidence, as a demographic's values vary, any suitable technique, for data display or visualization, can be used. Some of these display or visualization techniques can include (but are in no way limited to) the following:
3.4 Declarant Demographics
Example demographic characteristics, discussed above, are as follows:
Characteristics 1 and 2 can be put under the more general classification of “Declarant Demographics,” where a Declarant Demographic can be any demographic characteristic regarding the Declarants of a result-base's records. More example Declarant Demographics can include the following:
This section focuses on Approaches (where an “Approach” was introduced above in Section 3.3) for determining Declarant Demographics. The Approaches discussed herein can be used individually, in any combination with each other or in combination with other Approaches not addressed herein. The Approaches discussed herein can be summarized as follows:
Each of these Approaches is now addressed in greater detail.
3.4.1 Linguistic Clue
3.4.1.1 Lexical-to-Demographic Association
The technique of “lexical-to-demographic association,” when applicable to a demographic characteristic “DC_1,” works as follows. If a particular lexical unit is present in a record, there is a certain (above zero) probability that the Declarant of the record has characteristic “DC_1.” Depending on the demographic sought and the lexical unit detected, the probability can range from low or inaccurate (e.g., 0.2) to high or accurate (e.g., 0.9). Even a low level of probability, however, can be useful—particularly if combined with probability information determined from other Approaches to the same demographic.
Lexical-to-demographic association can be used, for example, with regard to determining the geographic location of a Declarant. This is because certain lexical units are known to be more frequently utilized (or, perhaps, only utilized) in certain geographical areas. Thus, if lexical units are included in a record, where such lexical units are indicative of a geographical area “GA1,” there is a certain (above zero) probability that the Declarant of the record is from GA1.
Sources of geographically-indicative language include the web site (www.UrbanDictionary.com) and books (such as “Urban dictionary: fularious street slang defined,” Andrews McMeel Publishing, Kansas City, Mo., 2005) by Aaron Peckham. For example, the word “hyphy” has been associated with the area of Oakland, in the San Francisco Bay Area, CA, U.S.A.
3.4.1.2 Self-Referential Demographic Identification
For some demographic characteristics, a record can be analyzed for statements wherein the Declarant describes himself or herself as having a sought-for characteristic “DC_1.” If a self-referential statement is found that has the sought-for properties, there is a certain (above zero) probability that the Declarant of the record has the characteristic DC_1. This technique can be referred to herein as “self-referential demographic identification.” As with “lexical-to-demographic association,” the probability can range from low or inaccurate (e.g., 0.2) to high or accurate (e.g., 0.9).
For example, a linguistic rule can be written that triggers upon a Logical Form that satisfies all of the following properties:
Such linguistic rules can be written in the form of “frame extraction rules,” as discussed below in Sections 4-6 and defined in Section 7.2 (“Frame Extraction Rules”). However, the “action” portion of a frame extraction rule, suitable for identification of a demographic characteristic, does not need to produce a frame instance when triggered. Instead, the action needs to indicate, with an appropriate Confidence Distribution, presence of the demographic characteristic.
Self-referential demographic identification can be used, for example, with regard to determining the geographic location of a Declarant. For example, a linguistic rule can be written that triggers upon a Logical Form satisfying all of the following properties:
Self-referential demographic identification can also be used, for example, with regard to determining the gender of a Declarant. For example, a linguistic rule can be written that triggers upon a Logical Form satisfying all of the following properties:
For the linguistic rules presented thus far, the confidence in the presence of the demographic characteristic, if found, is very high (e.g., 1.0 on a scale of 0.0 to 1.0). However, the confidence of a match, by a particular linguistic rule, can vary depending upon the particular lexical unit (or units) that are part of the match. In this case, lexical units associated with the detection of a particular value V_1 (such as “Female”), from the range of values (e.g., Male or Female) that can be assigned by a demographic characteristic DC_1 (e.g., gender) to a member M_1 of its population, can be paired with an appropriate confidence level that V_1 is, in fact, present. Any suitable data format, to represent such pairing, can be used. For purposes of simplicity of exposition herein, the above-given feature set for FEMALE can be expressed as follows:
The above feature set of pairs, contains all of the same lexical units as present in the non-paired form, except the lexical unit “secretary” has been added. As can be seen, “secretary” is the one lexical unit shown that is not paired with a confidence level of 1.0. This is because a Declarant, describing himself or herself as a “secretary,” does not lead to Total Certainty (where “Total Certainty” is discussed above in Section 3.2 “Confidence Distributions”) that the Declarant is female (e.g., a confidence level of 0.7 is shown). Depending upon the application, the pairing can be between a lexical unit and a Confidence Distribution. In the case of the gender demographic, since only two values are possible, a Confidence Distribution need only have two values. For example, FEMALE can be expressed as follows (where each Confidence Distribution is ordered with the confidence values for Female, Male):
3.4.2 Content Source Demographics
It is often the case that the producer (or publisher) of a content source keeps demographic data on its content contributors and users. Also, there are companies that specialize in producing demographic data on content providers.
For a record “M_1,” of a result-base, its content producer “C_1” can be identified and the demographics, of such content source, can be accessed. Such demographic information can be used to deduce a Confidence Distribution, for the Declarant of “M_1,” with respect to a particular demographic characteristic “DC_1.”
For example, DC_1 can be gender and the demographic data, for C_1, can be that 90% of its contributors are female while only 10% are male. Thus, in the particular case of record M_1, it can be reasonable to deduce that there is a 90% probability that its Declarant is female.
An example category of content source are online sources, such as Internet web sites. In this case, many web sites compile demographic data on its contributors and users. Also, there are companies that compile demographics across many online content sources.
Example web sites, that compile demographic data on its contributors and users, include:
An example company, that compiles demographics across many online content sources, is www.QuantCast.com, operated by Quantcast Corporation, San Francisco, Calif., U.S.A. Quantcast provides a database wherein an Universal Resource Locator (or “url”) can be input and a variety of demographics, describing that url, are output.
3.4.3 Explicit Declarant Information
If explicit information about the Declarant of a record is available, it can be used to deduce demographics of the record's Declarant.
For example, the Declarant's name can be included as part of a record (sometimes in a specific field where “authors” are identified). Based on a Declarant's name, demographics, such as the Declarant's likely gender and/or age, can be determined.
For example, if a Declarant's name is “Mary,” it can be deduced, with a high level of probability, that the Declarant is female. However, since the name “Mary” has been popular for a long time, and remains popular, it is not useful for deducing the age of the Declarant. Names such as “Gertrude” or “Beatrice” are no longer popular and therefore it can be deduced, with a certain level of probability, that the Declarant is in an older age range (such as 50 years or older).
An example database, that provides detailed information on the popularity of names over a long period of time (e.g., over the past 100 years), is www.BabyNameWizard.com, operated by Laura Wattenberg, Wellesley, Mass., U.S.A and Generation Grownup, LLC.
3.4.4 Machine Learning
Additional linguistic patterns for detection of demographics, that are not amenable to being manually deduced, can be created by the application of automated machine learning procedures. Any suitable machine learning procedures can be used, with such procedures executed on source (or “training”) corpora.
An example type of linguistic patter, that can be deduced from machine learning, is as follows. It can be determined that the presence of a particular lexical unit, in a record, implies a certain (above zero) probability that the record's Declarant has a particular demographic characteristic. In this situation, machine learning can be use to produce additional lexical-to-demographic associations, as described above in Section 3.4.1.1.
Frames and frame-based search systems are discussed extensively in the following patent applications (see citations above): the '122 Application, the '127 Application and the '068 Application. The entirety of each of these applications is incorporated by reference in the present description. However, for purposes of convenience, certain information of such applications is repeated herein.
A key advantage, of a frame-based search system, is that result-pairs can be generated automatically. The following nomenclature can be used herein:
For any of the techniques described in Sections 2 (“Example Search Results”) and 3 (“Determining Demographics”), the result-pairs, result-values and result-bases can be replaced by, respectively, their frame-produced versions. In terms of determining demographics, as addressed above in Section 3 (“Determining Demographics”), the “items” or “records” processed, for purposes of evaluating a demographic characteristic, can be replaced by snippets. For example, with respect to
A generic frame-based search engine (or FBSE) is described below in Section 5 (“Frame-Based Search”) and, more particularly, in Section 5.1.3. The following sub-Sections 4.1 to 4.3 show how to apply this FBSE to each of the three example search areas.
4.1 Technology Profiling
Technology profiling and, more broadly, the profiling of an entity, is addressed extensively in the '068 Application. While the entirety of the '068 Application has been incorporated by reference, for purposes of convenience, certain information of such application is repeated herein.
Described herein are techniques for generating a profile of an entity as it is addressed by a corpus of natural language (or “Source Corpus”). More particularly, the profile is generated by using a set of frames referred to as an “Entity Profile Frame Set.” Each frame, of an Entity Profile Frame Set, shares a role in common, called herein an “Anchor Role.” For each instance produced from an Entity Profile Frame Set, the value assigned to its Anchor Role is called herein an “Anchor Role Value.” The particular entity, that an Anchor Role Value indicates (or maps to) is called herein an “Anchor Entity.”
The profile of an entity (or an “Entity Profile”) is a set of instances (called herein an “Entity Profile Instance Set”) that satisfies the following two properties:
An “Anchor Entity” (as used herein) is an abstraction, defined, in practice, by the range of Anchor Role Values that are understood as indicating a same Anchor Entity.
With regard to the example technology profiling of Section 2.2 above, and its illustration in
An example Entity Profile Frame Set is presented below in Section 6.1 (“Technology Profiling”). In the remainder of this Section, the operation of this Entity Profile Frame Set is illustrated by pursuing an example Entity Profile Instance Set through the operation of the generic FBSE. For this example, it is assumed that the Entity Profile Instance Set corresponds to an Instance Superset.
4.1.1 Instance Generation and Merging
An Instance Superset, to which Instance Merging can be applied, is assumed to have been already produced and is depicted in
The frames upon which each instance is based are as follows:
After the merging, the following instances, with their instance-mentions, are as follows:
4.1.2 Instance Selection
Specifically, all the instances of
Thus, the instances of
In terms of the example Technology Profiling discussed above in Section 2.2 (“Technology Profiling”) and illustrated in
Each of 730, 731 and 732 corresponds to what is called an “Aspect” in
4.2 Healthcare-Related
4.2.1 Overview
A search engine, specialized for the domain of healthcare-related computer-accessible content, can be referred to as “healthcare-related search engine” (or “HRSE”). Currently available HRSE's include, for example, the following web sites: PubMed (provided by the United States National Library of Medicine of the National Institutes of Health), WebMD (provided by the WebMD Health Corporation) and Healthline (provided by Healthline Networks, Inc.).
An approach to a frame-based search engine (“FBSE”) is presented in this section (Section 4.2 “Healthcare-related”). More particularly, the principles of frame-based search are applied to the domain of healthcare-related knowledge. The resulting system can be referred to as a frame-based HRSE.
The development of a frame-based HRSE includes the development of a Frame Set (called a “Healthcare Frame Set” or “HFS”) that models concepts of particular importance to people working within the healthcare field (or “healthcare professionals”).
4.2.2 Healthcare Frame Set
This section presents an example HFS, called “HFS52,” that contains 5 frames, with each frame having two roles. Each frame of HFS52 is depicted in
Within each frame of
A set of values, that can be assigned to the roles of an “instance” of a frame, is indicated in
For each frame of
More detailed discussion, of each of the five frames of HFS52, can be found in Section 6.2 (“Healthcare-related”).
4.2.3 Frame-Based HRSE
An approach, to implementing a frame-based HRSE, is as follows: utilize four frame-based search engines (or FBSE's), where each such FBSE has been described, generically, in Section 5.1.3. Each of the four FBSE's accomplishes the following:
Each of the main steps of an FBSE, customized for Healthcare-related search, is described in more detail in the following sub-sections of Section 4.2.
4.2.4 Example Instance Generation
To illustrate the principles of Instance Generation, presented in Section 5.2, this section (Section 4.2.4) presents an example of Instance Generation related to a specific healthcare-related search.
Specifically, the example of Instance Generation relates to a user seeking to find treatments for the condition “heart attack.”
As presented in Section 2.3 (“Healthcare-related”), the results of the search (shown in
According to the Pre-Query Processing of Section 5.2.2, the results of
Applying the Post-Query Processing of Section 5.2.3.1 (“Producing A Query Selective Corpus”), with the FBDB being FBDB (Treatment Frame) and the query being “heart attack,” can produce a Query Selective Corpus that includes snippets 801-804 of
Applying the Post-Query Processing of Section 5.2.3.2 (“Producing An Instance Superset”), to snippets 801-804, can produce an Instance Superset that includes instances 810-815 of
4.2.5 Example Instance Merging
An example Instance Superset (produced above in Section 4.2.4), to which Instance Merging can be applied, is depicted in
After the merging,
Regarding instances 820 and 821, each contains the following instance-mentions:
4.2.6 Example Instance Selection
Specifically, all the instances of
Thus, the instances of
4.2.7 Example Result Presentation
4.2.7.1 Role-Value Oriented Presentation
The role-value oriented approach, to Search Result presentation, can be illustrated with the example Search Result fragment of
An example, of being able to see the snippets forming the basis of a role-value-oriented search result, is shown in
4.2.7.2 Grouping by Role Value Type
For example, in the case of a frame-based HRSE, it can be useful to group the treatments, found for a condition, according to type. Example types, into which treatments can be grouped, include the following: “Drugs and Medications,” “Foods and Plants,” and “Other Treatments.” An example use of these three types, for grouping potential treatments for “heart attack,” is shown in
4.2.7.3 Grouping by Frame Type
In the example of the treatments for a heart attack, as shown in
However, for the example of finding the pros and cons of using aspirin, as shown in part 902 of
Part 903, of
4.3 Brand Research
An FBSE that performs brand research, in the manner of the example of Section 2.4 (“Brand Research”), can be constructed very similarly to a technology profiling FBSE, as discussed above in Section 4.1 (“Technology Profiling”). Rather than producing a profile of a technology, the profile of a “brand” is produced instead.
To determine the pros and cons of a brand, essentially the same techniques can be used as those that were described for finding the pros and cons of a technology.
Finding pros and cons of a technology are discussed in the following sections:
For any brand-related frame extraction rule, rather than use the feature TECHNOLOGY, as is used, for example, in the example frame extraction rule of Section 6.1.2.1 (“Benefits Frame”), a feature called BRAND can be substituted. It is possible to produce a useful brand research system where the definition, for the BRAND feature, is essentially the same as that given (see, for example, Section 7.3 “Features”) for TECHNOLOGY.
In addition to a Pros Frame and a Cons Frame, additional frames, that can be useful for brand research, include:
This section (i.e., Section 5) addresses how a search result can be produced using frames, where such search result uses the knowledge (or semantics) expressed in the corpus of natural language (or “Source Corpus”) that is searched.
5.1 Overview
5.1.1 Frames
In general, a frame is a structure for representing a concept, wherein such concept is also referred to herein as a “Frame Concept.” A frame specifies a concept in terms of a set of “roles.” Any type of concept can be represented by a frame, as long as the concept can be meaningfully decomposed (or modeled), for the particular application, by a set of roles.
The attribute “role name” stores a label for a role that is unique (at least within its frame). In
A role's value requires some kind of representation, referred to herein as its “role value representation” (in
Depending upon the role, and its function in representing a frame's Frame Concept, a particular “type” (or types) of role value can be assigned to it. Thus, among the full set of values that could otherwise be assigned to a role, a “role type” serves to limit the set of permissible values. The type of a role value can be specified by one or more attributes. In the example of
A set of frames, that serves as the semantic basis of a frame-based search, can be called the search's “Frame Set.” Example Frame Sets, for the example search types discussed herein, are presented below in Section 6 (“Example Frame Sets”).
5.1.2 Frame Extraction Rules
A particular “invocation” (see below Glossary of Selected Terms for definition) of a Frame Concept, by a “UNL” (see below Glossary of Selected Terms for definition), can be represented by an “instance” of the frame (also called a “frame instance”). A frame instance is the same as the frame itself, except that, for each role, a value (also referred to herein as a “role value”) has been assigned (in
Identification, of when a frame's Frame Concept is invoked by a UNL, can be determined by a set of linguistic rules, each rule herein called a “frame extraction rule.” A set of frame extraction rules, that all relate to a particular frame, can be called the frame's “Rule Set.” Ideally, a frame's Rule Set is able to detect whenever the frame's Frame Concept is invoked, and thereby produce a frame instance representing each particular use of the Frame Concept. “Frame extraction,” as used herein, refers to the utilization of a frame extraction rule to determine whether a frame is invoked by a UNL.
Example frame extraction rules, for the example search types discussed herein, are presented below in Section 6 (“Example Frame Sets”).
5.1.3 Frame-Based Search Engine
A Frame-Based Search Engine (FBSE), that accepts a user's query and outputs a search result, can be described as operating in three main steps (see
Instance Generation is performed before the steps of Instance Merging or Instance Selection. Instance Merging and Instance Selection, however, can be performed in either order, depending upon the particular application. For example, if the ordering is Instance Merging 1420 followed by Instance Selection 1430, then the input to Instance Merging 1420 is Instance Superset 1405 and the input to Instance Selection 1430 is Merged Superset 1406. Alternatively, if the ordering is Instance Selection 1430 followed by Instance Merging 1420, then the input to Instance Selection 1430 is Instance Superset 1405 and the input to Instance Merging 1420 is Search Result 1404 (with Merged Superset 1406, produced by Instance Merging 1420, serving as the actual search result).
Production of a search result can be accomplished by using, for example, the computing environment described in Section 7.5.
5.2 Instance Generation
5.2.1 Overview
The Instance Superset can be generated in accordance with any suitable technique, depending on the particular application. While the principles described herein can be applied to a small Source Corpus, this section (Section 5.2) will focus upon instance generation where the Source Corpus is large. Small and large Source Corpora are defined as follows:
Instance Generation is described below in conjunction with
5.2.2 Pre-Query Processing
5.2.2.1 Overview
The objective of pre-query processing is to produce a “Frame-Based DataBase” (FBDB) from the Source Corpus. An FBDB means that a Source Corpus has been analyzed for where (if at all) certain concepts are used within it. The concepts, for which the Source Corpus is analyzed, are the Frame Concepts of the Organizing Frames of the FBDB. An FBDB may be produced for just one Frame Concept as represented by one Organizing Frame.
Production of an FBDB means that, at least, an index has been produced. The index permits the fast location of occurrences, in the Source Corpus, of concepts modeled by the Organizing Frames.
Thus, in
Pre-query processing can be divided into the following two main operations:
Each of these operations is described below.
5.2.2.2 UNL by UNL Preprocessing
A large Source Corpus can be processed on a UNL-by-UNL basis (e.g., on a sentence-by-sentence basis) to produce an FBDB. For each UNL (“UNL current”) processed, each potentially applicable frame extraction rule (“rule_current”) is evaluated for whether it is invoked by UNL_current to produce an instance (“I_current”).
Whether a rule_current is evaluated depends upon the FBDB being generated and the frame(s) such FBDB includes as its Organizing Frames.
To determine whether a rule_current applies to a UNL_current, each UNL_current can be converted, by a semantic parser, into a representation known as “Logical Form.” Logical Form is described in greater detail in below Section 7.1 (“Logical Form”). To present more detailed definitions, of example frame extraction rules, a pseudo-coded representation is defined below in Section 7.2 (“Frame Extraction Rules”).
5.2.2.3 Pre-Query Indexing
For each UNL (“UNL_current”) that produces an instance I_current, as a result of the UNL-by-UNL processing, an amount of content, referred to herein as a “snippet,” that at least includes UNL_current, can be indexed for an FBDB. The index then makes it possible, during post-query processing, that a suitably fast response is provided to a user's query.
Design of a Pre-query Indexing process involves the following choices:
For keyword indexing, any kind of conventional keyword index can be produced. For this type of index, each word of each snippet, except for “stop words” (see below Glossary of Selected Terms for definition), can be indexed.
For frame indexing, the following can be performed. Each time a focus UNL causes the invocation of a frame “F_1,” to produce an instance “I_1,” each word, of the role value of I_1's input role, can be indexed.
A frame index is likely to yield less snippets, in response to a user's query, than a keyword index. If the topic sought for searching is thoroughly discussed by the Source Corpus (e.g., it is a well-known condition or treatment), then utilization of a frame index will likely yield sufficient results. If the topic sought for search is infrequently discussed by the Source Corpus (e.g., it is a rare condition), then a frame index may not produce sufficient results. For such infrequently-referenced topics, a user may want to apply a keyword index as an addition to, or instead of, a frame index.
5.2.2.4 Alternation
Alternation, between UNL-by-UNL Preprocessing and Pre-query Indexing, can vary depending upon the particular application and the particular desired characteristics of the FBSE. For example, essentially all UNL-by-UNL Preprocessing can be completed first, before Pre-query Indexing is begun. As another example approach, pre-query processing can alternate (between UNL-by-UNL Preprocessing and Pre-query Indexing) on a UNL-by-UNL basis.
5.2.3 Post-Query Processing
The objectives of post-query processing (shown within dotted outline 1417 of
Each of these operations is further described below.
5.2.3.1 Producing a Query Selective Corpus
Production of a reduced corpus can also be called production of a query-selective corpus, since a query is the basis by which to select limited content from a Source Corpus.
Once produced, as described above (see Section 5.2.2.3 “Pre-query Indexing”), an index or indexes can be used (by, for example, Corpus Reduction process 1413 of
For example, when used with a keyword index, the query can be decomposed into a set of its constituent lexical units, excepting any stop words (while well known in the art, a definition of stop words is presented in the Glossary). Such set of lexical units is called herein a “constituent lexical unit set.” Standard techniques can then be used, that access the keyword index with each member of the constituent lexical unit set and produce an initial set of snippets for a Query Selective Corpus. In essentially the same manner as for a keyword index, a frame index can be accessed, with the constituent lexical unit set of the query, to produce a Query Selective Corpus.
The snippets of the Query Selective Corpus can be ranked in order of decreasing match quality to the query (e.g., Query 1401). If the Query Selective Corpus is too large, only the first “n” snippets can be kept for further processing. As an example value for “n,” only the first 3,000 snippets can be kept in the Query Selective Corpus.
5.2.3.2 Producing An Instance Superset
A similar process to that described above (see Section 5.2.2.2, “UNL by UNL Preprocessing”), which is applied to a Source Corpus, can be applied (by Instance Generation process 1415) to the Query Selective Corpus. The Query Selective Corpus can be processed on a UNL-by-UNL basis (e.g., on a sentence-by-sentence basis). For each UNL (“UNL_current”) processed, each potentially applicable frame extraction rule (“rule_current”) is evaluated for whether it is invoked by UNL_current to produce an instance (“I_current”). Each I_current produced can be added to the Instance Superset (such as Instance Superset 1405 of
5.3 Instance Merging
5.3.1 General Description
The set of instances to which Instance Merging is applied can be the Instance Superset, if just Instance Generation (as described above in Section 5.2) has been performed. Alternatively, Instance Merging can be applied to a Search Result, if Instance Generation and Instance Selection (described below in Section 5.4) have been accomplished. For either case, in this section (Section 5.3), the input set of instances shall be referred to as the “Instance Set.”
The Instance Merging described in this section assumes that each member, of the Instance Set, has just two roles. A subset of an Instance Set (called “Subset_1”) can have its members merged together when such subset satisfies the following two conditions:
5.3.2 Pseudo-Code
Instance_Set is assumed to have internal state (referred to herein as “sequence-state”) whereby a function, such as “Next_Pair” (line 7), is able to generate a sequence of possible instance pairs of Instance_Set. Instance Merge begins by re-setting the sequence-state of Instance_Set with a call to “Reset_Next_Pair” (line 4).
A “while” loop is then begun (line 7), that continues to execute while Next_Pair returns TRUE. Each call to Next_Pair causes the following. A pair of instances is selected from Instance_Set and assigned to: Instance_1 and Instance_2. For a given set of instances in Instance_Set, Next_Pair is defined as returning a sequence (one pair per invocation) of the possible instance pairs.
First and second tests are then performed which, if both satisfied, result in a merging of Instance_1 and Instance_2 by “Merge_Instances” (line 19). Merge_Instances is defined as replacing its two arguments, in Instance_Set, with the merger of Instance_1 and Instance_2. Such modification of Instance_Set creates the possibility for a new sequence of possible instance pairs. For this reason, following Merge_Instances, the sequence-state is reset by a call to Reset_Next_Pair (line 20).
The first test checks for whether Instance_1 and Instance_2 were produced from the same frame by calling “Same_Frame” (line 10). Same_Frame is defined to return TRUE if the instances are from a same frame. If the first test is satisfied, a second test checks for whether the corresponding roles, of Instance_1 and Instance_2, have sufficiently similar values. The second test is performed by two calls to “Match_Role_Values” (lines 13-14 and 16-17). In the first call to Match_Role_Values (lines 13-14), the role value assigned to the input role of Instance_11 is compared to the role value assigned to the input role of Instance_2. In the second call to Match_Role_Values (lines 16-17), the role value assigned to the output role of Instance_1 is compared to the role value assigned to the output role of Instance_2.
A discussion of Match_Role_Values is presented in the following section.
5.3.3 Matching Role Values
Matching, between role values, depends on the type of representation to be compared. The two main role value representations addressed herein are:
5.3.3.1 Lexical Unit
Matching, between role values, where each role value is treated as one or more lexical units, can proceed as follows.
If the two role values are identical, a match can be indicated.
If one role value is determined to be a substring of the other role value, or if both role values are determined to share a sufficiently substantial substring, a match can be indicated. Any suitable techniques for substring matching, known in the art, can be used. For example, the phrases “the fuel cell technology,” “the fuel cell application” and “the fuel cell software” can all be regarded as sharing a sufficiently substantial substring, such that all can be regarded as referring to a “fuel cell.”
If one role value is determined to be an acronym of the other, or if both role values are determined to be acronyms of a common term, a match can be indicated.
Any suitable techniques for acronym matching, known in the art, can be used. For example, if one role value is “natural language processing,” the role value “NLP” could be regarded as matching.
Each role value RV_1 can be replaced by a set of role values RVS_1, where each member of RVS_1 is believed to mean the same as RV_1, by a process called “lexical expansion.” In lexical expansion, the following operation can be performed on any combination of the lexical unit or units forming RV_1: for each lexical unit, within RV_1, it can be replaced by another lexical unit that is known to be synonymous. For example, if a role value “fuel cell” is to be matched, the lexical unit “battery” could replace the lexical unit “cell.” Such replacement would mean that the role value “fuel battery” could be regarded as matching the role value “fuel cell.”
5.3.3.2 Logical Form
Matching, between two role values, where each role value is represented as (or converted into) a Logical Form, can proceed as follows.
Matching can begin at the root node of each role value. Each corresponding pair of nodes can be selected by traversing each Logical Form in any appropriate order (e.g., depth first or breadth first). Each node, of a pair of corresponding nodes, has a fragment (comprised of one or more lexical units) of the UNL that triggered creation of the Logical Form. In a manner similar to that discussed above (5.3.3.1 “Lexical Unit”), the pair of textual fragments can be compared.
As long as identity, or sufficient identity, is determined between each pair of textual fragments traversed, the two Logical Forms are considered to match.
As an example, two Logical Forms may only be traversed from the root (which can represent a verb) to the direct children (that can represent the object of the verb). For example, the phrases “increased density” and “increased bone density” will appear the same if the Logical Form for each is only compared from the root to the direct child. In each case, the root is the verb “increased” and the object (at the direct child level) is “density.” The modifier “bone,” for the phrase “increased bone density,” appears in the Logical Form at the grandchild level.
5.4 Instance Selection
Once a suitable Instance Set has been generated, selection, of those instances corresponding to an input query, can be accomplished. As discussed above, such selection can be accomplished before or after Instance Merging.
The outputs of Instance_Select are as follows:
Instance_Select begins by setting (at line 4) the output set (i.e., Output_Rep_Set), of query (or input role value) representations, to contain just the query passed to the procedure by Input_Rep.
Next, a “while-loop” is begun (line 7). The while-loop iterates through the instances of Input_Instance_Set by successively calling “Next_Instance.” Next_Instance sets Instance_Current to a next instance of Input Instance_Set, and Next_Instance sets state, associated with Input_Instance_Set, such that, after sufficient calls to Next_Instance, each instance of Input Instance_Set has been assigned to Instance_Current.
Within each iteration of the while-loop, a “for-loop” is begun (line 9). The for-loop iterates through the query values stored in Output_Rep_Set, setting each such representation to Current_Rep. For each iteration of the for-loop, Match_Role_Values (line 10) compares the role value of the input role of Instance_Current with the query value assigned to Current_Rep. If the values are the same, or sufficiently similar, lines 11-16 are executed. Lines 11-16 perform the following.
The Instance_Current is moved from the Input_Instance_Set to the Output_Instance Set (line 11). Next, a test is made of whether the two values, just compared by Match_Role_Values, are represented in exactly the same way (line 12). If the two representations are not exactly the same, then the input role value assigned to Instance_Current appears to represent a broadening of the set of possible representations for the query, and lines 13-14 are executed. Lines 13-14 perform the following.
The alternate representation of the query, assigned to Instance_Current, is added to the set Output_Rep_Set (line 13). Also, the iteration of the while-loop, through the instances of Input Instance_Set, is reset (line 14). The reset is performed because the new representation of the query, added to the set Output_Rep_Set, means that each instance not previously added to Output Instance_Set, because its input role value did not match the set of query representations of Output_Rep_Set, might now match the newly-added query representation.
Regardless of whether the two values match exactly, once a match has been determined, it is known that the for-loop need no longer be executed, since Instance_Current has already been added to the Output_Instance_Set. Therefore the for-loop is ended (line 16).
5.5 Result Presentation
5.5.1 Overview
Once a Search Result has been produced, any appropriate technique(s) can be used to achieve a more effective presentation, to the user, of the instances of which it is comprised. This Section presents several example techniques. Any combination of the following techniques can be used, depending upon the particular application.
5.5.2 Role-Value Oriented Presentation
It is often useful to present to a user a Search Result that emphasizes the role values of the output roles. The usefulness of this presentation approach arises from the fact that it emphasizes the information the user is seeking. Also, because a single instance can represent multiple records (or snippets) that have matched a query, it also presents a more compact search result that a user can review more quickly.
A role-value oriented presentation, of a search result, can be achieved as follows: for each instance of the Search Result, its output-role role value is displayed (by using some appropriate character string) as a primary “result” of the search. The character string, representative of the role value of an instance's output role, can be referred-to as a “role-value-oriented search result.”
When an instance contains only one output-role role value, its role-value-oriented search result can be the same as its output-role role value. However, where an instance of a search result is comprised of multiple instance-mentions, a common (or summarizing) role-value-oriented search result is needed. Any suitable technique can be used to determine a role-value-oriented search result that represents an appropriate commonality and/or summarization of an instance's multiple output-role role values.
Once a user has identified a role-value-oriented search result of particular interest, the user can be provided with an option to view the records (or snippets) on which it is based. For each such snippet displayed, the portion that corresponds to the role-value-oriented search result can be highlighted (or otherwise emphasized). Also, it may be useful to display, in some other highlighted (or emphasized) way, the portion of each snippet that matches the input query.
5.5.3 Grouping by Role Value Type
When viewing a role-value oriented search presentation, it can be useful to group the role-value-oriented search results according to their type.
5.5.4 Grouping by Frame Type
When viewing a search result, of a frame-based HRSE, it can be useful to group each search result according to the frame from which it is instantiated.
6.1 Technology Profiling
6.1.1 Overview
This section describes an example Entity Profile Frame Set, where the entity for profiling is a Technology Candidate. Such Entity Profile Frame Set can be referred to as a “Technology Profile Frame Set.”
Frames are first categorized according to their meta-type, which are:
Within each frame meta-type, each frame definition follows the following format:
For just the Benefits Frame, an example frame extraction rule is also presented in a pseudo-coded form. The pseudo-code format is defined in section 7.2 (“Frame Extraction Rules”). An example frame extraction rule, for each of the other frame types, can be found in the '068 Application.
Since the Anchor Role Value, for each frame of this section, is representative of a technology-type entity, all of the example frame extraction rules use a “feature” (see Section 7.3 “Features,” for definition of feature) called TECHNOLOGY. An example definition for the TECHNOLOGY feature is also presented in Section 7.3. The example definition of TECHNOLOGY is intended to be broad. In this way, when a Source Corpus is subjected to Entity Profile processing, the set of entities with profiles will be broader and more likely to cover a Technology Candidate of the technology searcher.
6.1.2 Action-Centric Type
6.1.2.1 Benefits Frame
6.1.2.1.1 Frame Definition
Name: “Benefit Frame TP”
Technology_Role: <Technology Name>
Benefit_Role: <Benefit Here>
6.1.2.1.2 Discussion
The “Benefit Frame TP” is used to answer the question: “what are the benefits of this technology?” Benefit Frame TP (where “TP,” when used as part of the name for a frame or a frame extraction rule, means Technology Profiling) is used as part of profiling a Technology Candidate. Compared with the Benefit Frame utilized for technology scouting in the '122 and '127 Applications (that is re-presented below in Section 6.2.1 “Benefit Frame”), Benefit Frame TP is simplified. The Instrument and Benefactor roles, of the Benefit Frame, become the Technology_Role in Benefit Frame TP. Of the other roles of Benefit Frame, just the Benefit role is used in Benefit Frame TP.
6.1.2.1.3 Example Rule
An example frame extraction rule for a Benefit Frame TP is shown in
In terms of the pseudo-code of
Once the conditional part of the IMPROVE_Rule_TP has been determined to fire, its action part can do the following:
6.1.2.2 Problems Frame
6.1.2.2.1 Frame Definition
Name: “Problem Frame TP”
Technology_Role: <Technology Name>
Problem_Role: <Problem Here>
6.1.2.22 Discussion
The Problem Frame TP is used to answer the question: “what are the problems with this technology?” Problem Frame TP is used as part of profiling a Technology Candidate. Compared with the Problem Frame utilized for “market scouting” in the '122 and '127 Applications (that is re-presented below in Section 6.2.2 “Problem Frame”), Problem Frame TP is simplified. The Adversary and Method roles, of the Problem Frame, become the Technology_Role in Problem Frame TP. Of the other roles of Problem Frame, just the Problem role (specified as Problem_Role) is used in Problem Frame TP.
Problem Frame TP allows one to identify problems, of the Technology Candidate itself, from users of a technology. (To highlight the differences, between the profiling of a technology and the technology scouting process by which candidate technologies can be identified, it is worth noting that the Problem role, of the Benefit Frame of technology scouting, is used to identify technologies that can solve a problem.) Knowledge of a technology's problems can be helpful for such activities as: the design of a new product or the improvement of an existing product.
6.1.3 Relational Type
6.1.3.1 Inventors Frame
6.1.3.1.1 Frame Definition
Name: “Inventors Frame”
Technology_Role: <Technology Name>
Inventors_Role: <Inventor Here>
6.1.3.1.2 Discussion
The value for the Inventor role describes an entity that has developed or contributed to the development of the Technology. Typically the entity is a person. The Inventors Frame is used to answers the question: “who invents the technology?”
6.1.3.2 Experts Frame
6.1.3.2.1 Frame Definition
Name: “Experts Frame”
Technology_Role: <Technology Name>
Experts_Role: <Expert Here>
6.1.3.22 Discussion
The value of the Expert role describes a person who has been noted for their expertise in the Technology. The Experts Frame is used to answer the question: “who are the experts on this technology?”
6.1.3.3 Sellers Frame
This frame is used to answers the question: Who makes or sells the technology?
6.1.3.3.1 Frame Definition
Name: “Sellers Frame”
Technology_Role: <Technology Name>
Sellers Role: <Seller Here>
6.1.3.3.2 Discussion
The value of the Seller role describes an entity that sells the Technology. Typically the entity is a company. The Sellers Frame is used to answers the question: “who makes or sells the technology”
6.1.3.4 Users Frame
6.1.3.4.1 Frame Definition
Name: “Users Frame”
Technology_Role: <Technology Name>
Users_Role: <User Here>
6.1.3.4.2 Discussion
The value of the User role describes an entity that uses the Technology. Typical entities can include an organization, person or location. The Users Frame is used to answers the question: “who uses this technology?”
6.1.3.5 DerivedProducts Frame
6.1.3.5.1 Frame Definition
Name: “DerivedProducts Frame”
Technology_Role: <Technology Name>
DerivedProducts_Role: <DerivedProduct Here>
6.1.3.5.2 Discussion
The value of the Derived Products (or Products Based On) role describes a product that is based on the Technology. A product can be a branded commercial product such as “TOYOTA PRIUS” or a product category such as “staplers.” The DerivedProducts Frame is used to answers the question: “which products are derived from this technology?”
6.1.4 Categorical Type
6.1.4.1 Descriptor Frame
6.1.4.1.1 Frame Definition
Name: “Descriptor Frame”
Technology_Role: <Technology Name>
Descriptor_Role: <Definition Here>
6.1.4.1.2 Discussion
The Descriptor Frame is used to produce a definition of the Technology Candidate indicated by <Technology Name>.
6.1.4.2 Pros Frame
6.1.4.21 Frame Definition
Name: “Pros Frame”
Technology_Role: <Technology Name>
Pros_Role: <Pro Here>
6.1.4.2.2 Discussion
The Pros Frame is an example of a Modifier frame type, within the Categorical frame meta-type. It is used to represent GOOD features of the Anchor Entity and, in the case of a Technology Candidate, favorable modifiers of such technology.
6.1.4.3 Cons Frame
6.1.4.3.1 Frame Definition
Name: “Cons Frame”
Technology_Role: <Technology Name>
Cons_Role: <Con Here>
6.1.4.3.2 Discussion
The Cons Frame is an example of a Modifier Frame type, within the Categorical frame meta-type. It is used to represent BAD features of the Anchor Entity and, in the case of a Technology Candidate, unfavorable modifiers of the technology.
6.2 Healthcare-related
6.2.1 Overview
An example HFS, HFS52, that can be used to produce a frame-based HRSE (Section 4.2), is presented in this section (Section 6.2).
In the following Sections 6.2.2 to 6.2.6, each of the five frames of HFS52 is defined. An example frame extraction rule is also presented in Section 6.2.2 for the Treatment Frame. The frame extraction rule is presented in a pseudo-coded form and is then used to produce an example instance from an example input sentence.
The frame extraction rule pseudo-coded format is defined in Section 7.2. Before being tested against a frame extraction rule, an example input sentence is converted (by a semantic parser) into a representation called “Logical Form.” The Logical Form format used herein is defined in Section 7.1.
When considering example pseudo-coded frame extraction rules, a “feature” can be identified as follows:
6.2.2 Treatment Frame
6.2.2.1 Frame Definition
In general, and as depicted by frame 1003 of
II. Condition
6.2.2.2 Example Rule
An example frame extraction rule for a Treatment Frame is shown in
Line 1 of
First, a node must be found, in the Logical Form matched against the rule, that matches the root Logical Form rule of line 2 of
If the root Logical Form rule is satisfied, there are two mandatory Logical Form rules:
The frame extraction rule of
6.2.3 Cause Frame
In general, and as depicted by frame 101 of
I. Condition
The fact that the values for CONDITION and CAUSE are drawn from a same Condition Lexicon permits a cause to be, itself, input as a condition and its cause found with the Cause Frame. The process, of finding the “cause of a cause,” can be applied indefinitely and/or in conjunction with finding the “effect of an effect” (see Section 3.3 “Effect Frame”).
6.2.4 Effect Frame
In general, and as depicted by frame 102 of
I. Condition
The fact that the values for CONDITION and EFFECT are drawn from a same Condition Lexicon permits an effect to be, itself, input as a condition and its effect found with the Effect Frame. The process, of finding the “effect of an effect,” can be applied indefinitely and/or in conjunction with finding the “cause of a cause” (see Section 3.2 “Cause Frame”).
Rather than using the more generally-know term of “effect,” the healthcare profession generally refers to an effect as being either or both of the following: a “complication” or a “symptom.” If needed for the particular HRSE, specialized lexicons, such as a Complication Lexicon of complications and a Symptom Lexicon of symptoms, can be used to appropriately categorize an effect.
6.2.5 Pro Frame
In general, and as depicted by frame 104 of
I. Treatment
6.2.6 Con Frame
In general, and as depicted by frame 105 of
I. Treatment
6.3 Brand Research
7.1 Logical Form
In general, a Logical Form representation is produced from analysis of a UNL “UNL_current” (where the UNL focused-upon herein is a sentence).
A Logical Form can be produced by what is known as, in the field of natural language processing, a “semantic parser.” A Logical Form is intended to represent the semantics of a UNL_current. For this reason, it is desirable to produce a Logical Form that is, as much as possible, “semantically canonical.” This means the following:
For example, a semantically canonical semantic parser, when given a passive sentence and an active sentence that both express the same meaning, will translate both sentences, as much as possible, into a same Logical Form.
A Logical Form can be represented as a collection of nodes, with each node representing a particular semantic aspect of a UNL_current. Assigned to each node of a Logical Form can be a fragment, of a UNL_current, closely associated with the semantics represented by such node.
If arranged in a tree form, such nodes (with their links) can be referred to as a “logical dependency tree.” Some characteristics, of a dependency tree, are as follows:
Semantic constituents comprise at least the following two types: core and modifier. Core semantic constituents specify key information, such as “who did what to whom.” A core semantic constituent is also called (in the field of natural language processing) an “argument.” Modifier semantic constituents carry information about other aspects of an action, that are optional or are only sometimes applicable.
Three core semantic constituents, and their definitions, follow:
Example modifier semantic constituents, and the types of questions they answer, include the following:
An important type of logical dependency tree (called herein an “ISA” tree) can be generated for what are called, at a more surface level, copula and appositive structures. Both copula and appositive structures refer to sentence forms that define a term (e.g., a noun phrase) by linking it to a definition (e.g., another noun phrase). For copula structures, the linking is performed by a verb (such as “to be” or “to become”). For appositive structures, the linking is indicated by a syntactic symbol (such as the comma) or by trigger words (such as “namely,” “i.e.” or “such as”).
For an ISA dependency tree, the root node is the noun phrase that is being defined. One of the core semantic constituents is an “ISA” node that indicates a definitional noun phrase.
Examples, that help illustrate the above-listed semantic constituents, follow.
Because the Actor and Undergoer are logical, a passive and an active sentence, which both express the same meaning, will have the same Actor and Undergoer. For example, in both of the following sentences, “exercise” is the Actor and “bone density” is the Undergoer:
In both of the following examples (which are in ergative form at a surface level), the Undergoer is “the door”:
For both of the following sentences, “John” is the Actor, “book” is the Undergoer and “Mary” is the Complement:
For the following phrase, “somebody” is the Undergoer and “for something” is the Complement:
The modifier semantic constituent Cause can be identified by searching for particular expressions that are indicative of something being a cause. Such expressions can include: “due to,” “thanks to,” “because of” and “for the reason of.” In one of the above example sentences, depending upon the semantic parser, “by exercise” may also be identified as the Cause for the action “can be enhanced.”
Each node of a Logical Form, with the exception of the root node, can be represented by the following two parts:
The root node of a Logical Form can be represented by the following two parts:
7.2 Frame Extraction Rules
7.2.1 Overview
As discussed above, a frame extraction rule specifies a pattern that matches against a Logical Form which has been produced from an input statement (i.e., a UNL, such as a sentence). If the frame extraction rule matches, a frame instance is produced.
An overall structure, for a frame extraction rule, is that it expresses a tree pattern for matching against an input Logical Form. In general, a frame extraction rule has two main parts:
For purposes of organization, each frame extraction rule can be given a name.
A frame extraction rule can be expressed as a collection of simpler rules, each such simpler rule referred to herein as a “Logical Form rule.” A Logical Form rule, like the overall frame extraction rule of which it is a part, can also have a conditional part and an action part. Logical Form rules can be of two main varieties: mandatory and optional. For a frame extraction rule to take action, all of its mandatory Logical Form rules must be satisfied. Any optional Logical Form rules, that are also satisfied when all mandatory Logical Form rules are satisfied, can specify additional action that can be taken by the frame extraction rule.
7.2.2 Pseudo-Coded
In order to further discuss frame extraction rules, in general, it will be useful to present a format for presenting such rules as pseudo-code. An example tree-structured frame extraction rule, shown in the pseudo-code, is presented in
For the pseudo-coded frame extraction rules presented herein, each line (other than the line specifying a name for the frame extraction rule) represents a Logical Form rule. Each Logical Form rule is mandatory, unless the entire line is enclosed in a pair of parenthesis.
For the type of Logical Form rule presented herein, its conditional part specifies the conditions under which it is satisfied by a node “n1” of the input Logical Form while its action part specifies the role, of a frame instance, that is assigned the value “n1.”
The conditional part, of a Logical Form rule, can itself be comprised of two sub-parts (both of which must be satisfied by a single node of a Logical Form):
For each Logical Form rule presented herein, its syntax divides it into three parts (from left to right):
As can be seen, the node-based sub-part is separated from the text-based sub-part by a colon symbol, while the text-based sub-part is separated from the action by a right-pointing arrow symbol.
The node-based sub-part can specify either of the following two conditions:
The action specifies a role, of the frame instance created, that is assigned a value as a result of the Logical Form rule being satisfied. The value assigned to a role can comprise the textual part of the Logical Form node that satisfies the rule's node-based sub-part. Additional information, that can comprise the value assigned to a role, includes the following: if the node “n1,” satisfying the node-based sub-part, is the root of a sub-tree, the textual parts of some essential child nodes, of such sub-tree, can be assigned to the role. For example, if n1 is the root of a verb phrase, it is typical for only the core argument structure, of such verb phrase, to be assigned to the role. The core argument structure of a verb phrase typically consists of the verb itself and, possibly, the undergoer and/or complement. Such core verb phrase typically excludes adverbial details, such as time and/or location. Assignment of the selected core textual parts, of a sub-tree's child nodes, is indicated herein by enclosing the role name in square brackets.
Regarding the specification of conditions, for matching the node-based sub-part of a Logical Form rule, line 2 of
Typically, only one Logical Form rule, of a frame extraction rule, uses a node-based sub-part that requires its matching node to serve as the sub-tree root. This Logical Form rule can be referred to as the “root Logical Form rule.” The root Logical Form rule can be used as the entry point for a frame extraction rule: it can be tested, for matching against an input Logical Form, before any other Logical Form rules are tested. If the root Logical Form rule does not match, then no further Logical Form rules of the frame extraction rule need be tested.
The text-based sub-part, of a Logical Form rule, specifies a pattern of lexical units and/or features that need to appear in the textual part of a Logical Form node, even if that node already matches the node-based sub-part of the Logical Form rule. A “feature” is represented, in the pseudo-coded frame extraction rules, by any word that is entirely capitalized. Please see section 6.4 (“Features”) for a definition of a feature. The frame extraction rule of
One type of pattern, that can be specified by the text-based sub-part, is a prepositional phrase. In particular, the text-based sub-part can specify that a preposition must be followed by a specific noun or by a feature that represents a collection of nouns. For example, the text-based sub-part of line 6 of
The tree structure, specified by a pseudo-coded frame extraction rule, can be indicated by the indentation of its Logical Form rules and by the use, or non-use, of blank lines between such Logical Form rules. As with specifying the Logical Form itself, greater indentation of a line (i.e., further distance of a line from the left margin) is used herein to indicate a Logical Form rule calling for a node farther from the root.
A Logical Form rule “LF1” and a Logical Form rule “LF2” specify, respectively, two nodes in a parent and child relationship when LF1 is the first Logical Form rule that is both above LF2 and LF1 has a lesser indentation than LF2. For example, in
Logical Form rules “LF1” and “LF2” specify two nodes in a sibling relationship when the following conditions are satisfied:
In certain cases, multiple Logical Form rules can be combined, with an appropriate logical operator, to form one compound Logical Form rule. For example, a group of Logical Form rules can be combined by the XOR operator. In this case, when one, and only one, of the Logical Form rules is satisfied, the compound Logical Form rule is also satisfied.
For the pseudo-coded example frame extraction rules presented herein, a pair of Logical Form rules “LF1” and “LF2” are implicitly combined with an XOR operator when the following conditions are satisfied:
7.3 Features
This section presents an example defining set (i.e., a set of lexical units) for each feature utilized in the example benefit frame extraction rules presented herein. As discussed above, a “feature” is represented, in the pseudo-coded frame extraction rules, by any word that is entirely capitalized. A multi-word lexical unit, that is a member of a defining set, is connected with the underscore character.
ABSTRACT_NOUN
7.4 Snippet Formation
As discussed above, a snippet refers to the locality around the match of a frame to a location in computer-accessible content. More specifically, if a match of a frame has occurred in a UNL “UM1,” the snippet comprises a copy of UM1 (also called the “focus” UNL) and may also comprise a copy of additional, surrounding, contextual content.
Choosing an appropriately-sized snippet depends on several factors. First, it can depend upon the UNL by which frame instances are identified (e.g., whether frames are identified within individual sentences or across larger units of text). Second, it can depend upon providing sufficient surrounding context for keyword searching. Third, snippet size can depend upon the amount of text necessary, for a user of a search system, such that a snippet can be read and evaluated, apart from its original source content.
A specific issue to consider, in determining snippet size, is pronoun resolution. In the context of snippet size determination, the pronoun resolution problem can be stated as follows. If a pronoun occurs in a UNL “U1,” in which a frame instance has been identified, it is desirable that the pronoun's antecedent noun appear in the snippet context that surrounds “U1.” The larger the snippet size, the more likely it is that all pronouns of “U1” will be resolved. Counterbalancing pronoun resolution, however, are such factors as making a snippet small enough for fast comprehension by the searcher.
If the UNL by which frame instances are identified is the sentence, a snippet size of five sentences has been experimentally determined as desirable. Once a frame instance has been identified in a focus sentence “S1,” two sentences before S1 and two sentences after S1 can be added to the snippet to provide sufficient context for S1. While a desirable goal, depending upon the logical organization of the computer-accessible content from which snippets are being extracted, an individual snippet may comprise less than five sentences. For example, the computer-accessible content may be organized into separate documents. If S1 is at the beginning of a document, two sentences prior to S1 may not be available for addition to the snippet. Similarly, if S1 is at the end of a document, two sentences after S1 may not be available for addition to the snippet.
7.5 Computing Environment
Cloud 1530 represents data available via the Internet. Computer 1510 can execute a web crawling program, such as Heritrix, that finds appropriate web pages and collects them in an input database 1500. An alternative, or additional, route for collecting input database 1500 is to use user-supplied data 1531. For example, such user-supplied data 1531 can include the following: any non-volatile media (e.g., a hard drive, CD-ROM or DVD), record-oriented databases (relational or otherwise), an Intranet or a document repository. A computer 1511 can be used to process (e.g., reformat) such user-supplied data 1531 for input database 1500.
Computer 1512 can perform the indexing needed for formation of an appropriate FBDB (for example, an FBDB as discussed in section 5.2.2.3 “Pre-query Indexing”). The indexing phase scans the input database for sentences that refer to an organizing frame, produces a snippet around each such sentence and adds the snippet to the appropriate frame-based database.
Databases 1520 and 1521 represent, respectively, stable “snapshots” of databases 1500 and 1501. Databases 1520 and 1521 can provide stable databases that are available to service search queries entered by a user at a user computer 1533. Such user query can travel over the Internet (indicated by cloud 1532) to a web interfacing computer 1514 that can also run a firewall program. Computer 1513 can receive the user query and perform a search upon the contents of the appropriate FBDB (e.g., FBDB 1521). The search results can be stored in a database 1502 that is private to the individual user. When a snippet of interest is found in the search results, input database 1520 is available to the user to provide the full document from which the snippet was obtained.
In accordance with what is ordinarily known by those in the art, computers 1510, 1511, 1512, 1513, 1514 and 1533 contain computing hardware, and programmable memories, of various types.
The information (such as data and/or instructions) stored on computer-readable media or programmable memories can be accessed through the use of computer-readable code devices embodied therein. A computer-readable code device can represent that portion of a device wherein a defined unit of information (such as a bit) is stored and/or read.
While the invention has been described in conjunction with specific embodiments, it is evident that many alternatives, modifications, variations and equivalents will be apparent in light of the foregoing description. Accordingly, the invention is intended to embrace all such alternatives, modifications, variations and equivalents, as fall within the scope (both literally and by reason of the doctrine of equivalents) of the appended claims.
As provided for under 35 U.S.C. § 120, this patent claims benefit of the filing date of the following U.S. patent application, herein incorporated by reference in its entirety: “Method and Apparatus For Determining Search Result Demographics,” filed 2010 Apr. 22 (y/m/d), having inventors Michael Jacob Osofsky, Jens Erik Tellefsen, Wei Li, and Ranjeet Singh Bhatia and application Ser. No. 12/765,848. As provided for under 35 U.S.C., § 119(e) and/or § 120, this patent claims benefit of the filing date for the following U.S. patent application(s), which are herein incorporated by reference in their entirety: “Method and Apparatus For Frame-Based Search,” filed 2008 Jul. 21 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and application Ser. No. 12/177,122 (“the '122 Application”); “Method and Apparatus For Frame-Based Analysis of Search Results,” filed 2008 Jul. 21 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and application Ser. No. 12/177,127 (“the '127 Application”); and “Method and Apparatus For Automated Generation of Entity Profiles Using Frames,” filed 2009 Jul. 20 (y/m/d), having inventors Wei Li, Michael Jacob Osofsky and Lokesh Pooranmal Bajaj and App. No. 61/227,068 (“the '068 Application”). This application is related to the following U.S. patent application(s), which are herein incorporated by reference in their entirety: the '122 Application; the '127 Application; and the '068 Application.
Number | Name | Date | Kind |
---|---|---|---|
5694523 | Wical | Dec 1997 | A |
5940821 | Wical | Aug 1999 | A |
5963940 | Liddy et al. | Oct 1999 | A |
5995922 | Penteroudakis et al. | Nov 1999 | A |
6202064 | Julliard | Mar 2001 | B1 |
6269356 | Hatton | Jul 2001 | B1 |
6278967 | Akers et al. | Aug 2001 | B1 |
6453312 | Goiffon et al. | Sep 2002 | B1 |
6560590 | Shwe | May 2003 | B1 |
6571240 | Ho | May 2003 | B1 |
6578022 | Foulger et al. | Jun 2003 | B1 |
6584464 | Warthen | Jun 2003 | B1 |
6671723 | Nguyen | Dec 2003 | B2 |
6675159 | Lin et al. | Jan 2004 | B1 |
6738765 | Wakefield et al. | May 2004 | B1 |
6965857 | Decary | Nov 2005 | B1 |
7072888 | Perkins | Jul 2006 | B1 |
7305336 | Polanyi | Dec 2007 | B2 |
7356540 | Smith et al. | Apr 2008 | B2 |
7389201 | Chickering et al. | Jun 2008 | B2 |
7496593 | Gardner et al. | Feb 2009 | B2 |
7779007 | West et al. | Aug 2010 | B2 |
7805302 | Chelba et al. | Sep 2010 | B2 |
7822597 | Brun | Oct 2010 | B2 |
7849030 | Ellingsworth | Dec 2010 | B2 |
7890514 | Mohan et al. | Feb 2011 | B1 |
8046348 | Rehling et al. | Oct 2011 | B1 |
8055608 | Rehling et al. | Nov 2011 | B1 |
8131540 | Marchisio et al. | Mar 2012 | B2 |
8935152 | Li et al. | Jan 2015 | B1 |
9047285 | Li et al. | Jun 2015 | B1 |
20020091671 | Prokoph | Jul 2002 | A1 |
20030093421 | Kimbrough et al. | May 2003 | A1 |
20030172061 | Krupin et al. | Sep 2003 | A1 |
20030216905 | Chelba et al. | Nov 2003 | A1 |
20040044952 | Jiang et al. | Mar 2004 | A1 |
20040078190 | Fass et al. | Apr 2004 | A1 |
20040150644 | Kincaid et al. | Aug 2004 | A1 |
20050149494 | Lindh et al. | Jul 2005 | A1 |
20050165600 | Kasravi et al. | Jul 2005 | A1 |
20060009966 | Johnson | Jan 2006 | A1 |
20070156677 | Szabo | Jul 2007 | A1 |
20090112892 | Cardie | Apr 2009 | A1 |
20090306967 | Nicolov et al. | Dec 2009 | A1 |
20090327854 | Chhajer et al. | Dec 2009 | A1 |
20100318348 | Chelba et al. | Dec 2010 | A1 |
20110145217 | Maunder | Jun 2011 | A1 |
Entry |
---|
Gautam et al., published Feb. 17, 2008 (y/m/d), pgs. 2040-2042. “Document Retrieval Based on Key Information of Sentence,” IEEE ICACT. |
Ku et al., published Mar. 27, 2006 (y-m-d), 8 pgs. “Opinion Extraction, Summarization and Tracking in News and Blog Corpora,” AAAI Spring Symposium Series 2006. |
Ruppenhofer et al., published Aug. 25, 2006 (y/m/d), 166 pages. “FrameNet II: Extended Theory and Practice,” International Computer Science Institute, University of California at Berkeley, USA. |
Wu, Tianhaow et al., published May 3, 2003 (y/m/d), 12 pgs. “A Supervised Learning Algorithm for Information Extraction From Textual Data,” Proceedings of the Workshop on Text Mining, Third SIAM International Conference on Data Mining. |
Zadrozny, Slawomir et al., published 2003, 5 pgs. “Linguistically quantified thresholding strategies for text categorization,” Systems Research Institute, Polish Academy of Sciences, Warszawa, Poland. |
Zhang et al., published Jun. 22, 2010 (y/m/d), 10 pgs. “Voice of the Customers: Mining Online Customer Reviews for Product Feature-based Ranking,” Proceedings of the 3rd Wonference on Online social networks (WOSN '10). USENIX Association, Berkeley, CA, USA. |
Lucene Support Page 2454, with comments dated May 10, 2010-Jul. 16, 2010; https://issues.apache.org/jiralbrowse/LUCENE-2454; retrieved Jul. 24, 2019 (y/m/d); 9 pages. |
Lucene Slide Share Presentation, dated May 7, 2010; https://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene; retrieved Jul. 24, 2019 (y/m/d); 15 pages. |
readme.txt in LuceneNestedDocumentSupport.zip, creation date May 10, 2010; retrieved Jul. 25, 2019 (y/m/d); 2 pages. |
NestedDocumentQuery.java in LuceneNestedDocumentSupport.zip, creation date Aug. 25, 2010; retrieved Jul. 25, 2019 (y/m/d); 8 pages. |
PerParentLimitedQuery.java in LuceneNestedDocumentSupport.zip, creation date Sep. 8, 2010; retrieved Jul. 25, 2019 (y/m/d); 10 pages. |
Number | Date | Country | |
---|---|---|---|
Parent | 12765848 | Apr 2010 | US |
Child | 14704919 | US |