1. Field of the Invention
The invention generally relates to retrieving, organizing, and indexing documents, and more particularly to a process for information extraction from large text collections operating by taking as an example either a few documents or a portion of a few documents.
2. Description of the Related Art
Within this application several publications are referenced by Arabic numerals within brackets. Full citations for these, and other, publications may be found at the end of the specification immediately preceding the claims. The disclosures of all these publications in their entireties are hereby expressly incorporated by reference into the present application for the purposes of indicating the background of the present invention and illustrating the state of the art.
Several real world problems fall into the category of single class learning, where training data is available for only a single class. Examples of such problems include the identification of a certain class of web-pages from the Internet; e.g., “personal home pages” or “call for papers”[12]. Building training data for such problems can be a particularly arduous task. For example, consider the task of building a classifier to identify pages about IBM. Certainly, all pages that mention IBM are not about IBM. To build such a binary classifier would require a sample of positive examples that characterize all aspects that can be considered to be about IBM. Constructing a negative class would require a uniform representation of the universal set excluding positive class[12]. This is too laborious a task to be performed manually.
Information extraction is yet another area where a significant number of problems fall into the category of single-class learning. Ranging from identifying named-entities to extracting user-expressed opinion from a body of text, information extraction has a wide array of problems. Information needs of users are quite diverse and numerous thus precluding the creation of significant numbers of labeled examples. For example, consider an oil company's corporate reputation management group interested in monitoring articles about its and its competitor's image in the areas of diversity at work place, oil spill issues, environmental policies etc. Obtaining positive and negative labeled data for each such topic is almost impossible. Users are typically willing to provide very few carefully crafted hand-labeled data. It is precisely this single-class problem with very few labeled examples, which is has not been addressed by conventional solutions.
The need for single-class learning has been recognized and there have been a few previous efforts focusing on learning from positive examples. In one conventional approach[9], the solution operates by trying to map the data using a kernel and then using the origin as the negative class. In practice this conventional technique has been found to be very sensitive to parametric changes[5], where it has been suggested to include some heuristic modifications to include more than just the origin into the negative class. Recent work on including unlabeled examples in an iterative framework that identifies examples that do not share features with positive examples has been described[12]. These are treated as negative examples to learn a support vector machine. Moreover, these approaches have concentrated on identifying negative examples and using them in a discriminative training framework. The motivation in these approaches has been towards building classifiers that do not degrade in accuracy with the growth in the size of labeled data[12].
Generative modeling approaches have also been applied to the problem of partially labeled data. Unsupervised approaches to modeling use joint distributions over the features to identify clusters in the data. In particular, finite mixture models, whose parameters are learned using the popular expectation maximization (EM) methodology, are used extensively. Another conventional approach[6] modifies the EM methodology to allow for the incorporation of labeled data. This approach can be used with very limited labeled data. A variant of this approach to the single-class problem, but with larger amounts of labeled data, has been described in other solutions as well[4].
Query-by-example (QBE) has been around for a long time. However, the problem has not been successfully treated in the past. Existing methods treat the problem in a simplistic fashion. The most popular technique is nearest neighbor. Besides nearest neighbor, some simple partially supervised methods have been used, without complete success.
Consider the following latent variable model:
Such models are useful in classification problems where the latent variable a is interpreted as class labels. Training of this model involves adjusting the parameters of the probability distributions p(z|a) and p(a). This model can be trained effectively using the EM methodology. Next, one derivation of the EM methodology that will be extended subsequently to the invention's multistage methodology is provided. Given a dataset {z1, z2, . . . , zn} of individual observations of z, the log likelihood of the model is:
The EM methodology is derived by introducing an indicator-hidden variable. Writing the bound, and taking expectations of equation (2), it can be shown that the log-likelihood of the model is bounded from below by the following Q function:
where q(a|zi) is equal to p(a|zi). The Q function is proportional to the log-likelihood of the joint distributions log p(a, z). The EM methodology is defined by maximizing Q, instead of the original log-likelihood, in an iterative process comprising the following two steps: (1) E-Step: Compute q(a|zi)=p(a|zi) keeping the parameters fixed; (2) M-Step: Fix q(a|zi) in equation (3) and obtain the maximum likelihood estimate of parameters of p(zi|a) and p(a).
A labeled example, which is also referred to as a seed in the description of the present invention, is a data point that is known to a particular class (topic) of interest. The EM methodology for the model shown in equation (1) is an unsupervised methodology; that is, there are no labeled examples. Thus, a few labeled examples for the class of interest must be introduced. Incorporating this information into the EM methodology results in a semi-supervised version[6]. Again, the EM methodology introduces a hidden variable, and in the E-step the methodology computes the expected value of these hidden variables. For the labeled examples the value of the hidden variable is known. It will be assumed that a=1 is the class of interest. Instead of computing the expected value in the E-step the semi-supervised methodology simply assigns q(a=1|zseed)=1 and q(a≠|zseed)=0. This will be referred to as “seed constraint.”
However, with very few labeled examples, seed constraints alone are not sufficient to tackle the above-identified problem. Thus, a more powerful model and methodology are needed. Therefore, due to the limitations of the conventional approaches, there remains a need for a novel QBE process used for single-class learning, which overcomes the problems of the conventional designs.
In view of the foregoing, the invention provides a system for extracting information comprising a query input; a database of documents; a plurality of classifiers arranged in a hierarchical cascade of classifier layers, wherein each classifier comprises a set of weighted training data points comprising feature vectors representing any portion of a document, and wherein the classifiers are operable to retrieve documents from the database matching the query input; and a terminal classifier weighing an output from the cascade according to a rate of success of query terms being matched by each layer of the cascade, wherein each classifier accepts an input distribution of the training data points and transforms the input distribution to an output distribution of the training data points, wherein each classifier is trained by weighing training data points at each classifier layer in the cascade by an output distribution generated by the preceding classifier layer, wherein weights of the training data points of the first classifier layer are uniform, wherein each classifier is trained according to the query input, and wherein the query input is based on a minimum number of example documents. The documents comprise any of text files, images, web pages, video files, and audio files. In fact, the documents comprise a file format capable of being represented by feature vectors.
According to the invention a classifier at each layer in the hierarchical cascade is trained with an expectation maximization methodology that maximizes a likelihood of a joint distribution of the training data points and latent variables. Each layer of the cascade of classifiers is trained in succession from a previous layer by the expectation maximization methodology, wherein the output distribution is used as an input distribution for a succeeding layer. Alternatively, each layer of the cascade of classifiers is trained by successive iterations of the expectation maximization methodology until a convergence of parameter values associated with the output distribution of each layer occurs in succession, wherein the successive iterations comprise a fixed number of iterations.
In another embodiment, all layers of the cascade of classifiers are trained by successive iterations of the expectation maximization methodology until a convergence of parameter values associated with output distributions of all layers occurs, wherein during each step of the of the iterations, the output distribution of each layer is used to weigh the input distribution of a succeeding layer. The terminal classifier generates a relevancy score associated with each data point, wherein the relevancy score comprises an indication of how closely matched the data point is to the example documents, wherein the relevancy score is computed by combining the relevancy scores generated by classifiers at each layer of the cascade. In an embodiment of the invention, the terminal classifier generates a relevancy score associated with a document, wherein the relevancy score is calculated from relevancy scores of individual data points within the document. Alternatively, each classifier layer generates a relevancy score associated with each data point, wherein the relevancy score comprises an indication of how closely matched the data point is to the example documents. According to another embodiment of the invention features of the feature vectors comprise words within a range of words located proximate to entities of interest in the document.
In another embodiment, the invention provides a method of extracting information, wherein the method comprises inputting a query; searching a database of documents based on the query; retrieving documents from the database matching the query using a plurality of classifiers arranged in a hierarchical cascade of classifier layers, wherein each classifier comprises a set of weighted training data points comprising feature vectors representing any portion of a document; and weighing an output from the cascade according to a rate of success of query terms being matched by each layer of the cascade, wherein the weighing is performed using a terminal classifier.
The invention works in a novel way by weighing data at each stage by the output distribution from the previous stage. In the first stage the previous output distribution is assumed to be uniform. In the process it creates these sets of information, whereby as one moves further away from the core, the topic is less related to what is wanted.
These and other aspects and advantages of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating preferred embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such modifications.
The invention will be better understood from the following detailed description with reference to the drawings, in which:
The invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the invention. The experiments described and the examples used herein are intended merely to facilitate an understanding of ways in which the invention may be practiced and to further enable those of skill in the art to practice the invention. Accordingly, the experiments and examples should not be construed as limiting the scope of the invention.
As mentioned, there is a need for a novel QBE process used for single-class learning, which overcomes the problems of the conventional designs. Referring now to the drawings and more particularly to
According to an embodiment of the invention each classifier accepts an input distribution of the training data points and transforms the input distribution to an output distribution of the training data points, wherein each classifier is trained by weighing training data points at each classifier layer in the cascade by an output distribution generated by each previous classifier layer, wherein weights of the training data points of the first classifier layer are uniform, wherein each classifier is trained according to the query input, and wherein the query input is based on a minimum number of example documents.
The invention also provides a classifier at each layer in the hierarchical cascade is trained for each layer with an expectation maximization methodology that maximizes a likelihood of a joint distribution of the training data points and latent variables. Each layer of the cascade of classifiers is trained in succession from a previous layer by the expectation maximization methodology, wherein the output distribution is used as an input distribution for a succeeding layer. Alternatively, each layer of the cascade of classifiers is trained by successive iterations of the expectation maximization methodology until a convergence of parameter values associated with the output distribution of each layer occurs in succession, wherein the successive iterations comprise a fixed number of iterations.
In another embodiment, all layers of the cascade of classifiers are trained by successive iterations of the expectation maximization methodology until a convergence of parameter values associated with output distributions of all layers occurs, wherein during each step of the of the iterations, the output distribution of each layer is used to weigh the input distribution of a succeeding layer. The terminal classifier generates a relevancy score associated with each data point, wherein the relevancy score comprises an indication of how closely matched the data point is to the example documents, and wherein the relevancy score is computed by combining the relevancy scores generated by classifiers at each layer of the cascade.
In an embodiment of the invention, the terminal classifier generates a relevancy score associated with a document, wherein the relevancy score is calculated from relevancy scores of individual data points within the document. Alternatively, each classifier layer generates a relevancy score associated with each data point, wherein the relevancy score comprises an indication of how closely matched the data point is to the example documents.
According to the invention, a feature vector is a vector of counts for all the features in a data point. For example, if a data point is a text document, then a feature can be a word, an n-gram, a stemmed word, or other features used in linguistic tokenization, and a feature vector is a vector of counts of how many times each word appears in the document. Additionally, the feature vectors may comprise words within a range of words located proximate to certain entities of interest appearing in the documents from which the data points are formed.
The invention provides a semi-supervised query-by-example methodology for single class learning with very few examples. The problem is formulated as a hierarchical latent variable model, which is clipped (edited) to ignore classes not of interest. The model is trained using a multi-stage EM methodology. The multi-stage EM methodology maximizes the likelihood of the joint distribution of the data and latent variables, under the constraint that the distribution of each layer is fixed in successive stages. The invention uses a hierarchical latent variable model and in contrast to conventional approaches, the invention concentrates only on the class of interest. Furthermore, the invention's methodology uses both labeled and unlabeled examples in a unified model. As is further discussed below, under certain conditions, namely when the underlying data have hierarchical structures, the invention's methodology performs better than training all layers in a single stage. Finally, as described below, experiments are conducted to verify the performance of the methodology on several real-world information extraction tasks.
Next, extensions to the simple latent variable model described above are discussed. To begin with, a hierarchical latent variable model followed by a constrained version of this hierarchical model suitable for single-class classification are described.
Consider a two level hierarchical model:
where a0 and a1 are the two levels in the hierarchy. Given the same observed data, the likelihood is:
If the task is only to identify a single class from multiple possibilities, there is a trade-off between the number of hidden classes (hence the computational cost) and the precision of the chosen class. The mixture model (1) represents not only the required class but also other classes present in the data. Training such a simple model has two significant drawbacks: if the number of components in the mixture model is small then the chosen class will contain most of the items of interest along with a large number of spurious items. If the number of classes is large, the conventional EM methodology spends much of the computational resources in training large number of classes that are not of interest. Similarly, a full-blown model of the form (4) can be expensive to train due to the combinatorial effect of the hierarchical hidden variables in the E-step.
If the data has a hierarchical structure, it is intuitively plausible that a methodology that progressively “zooms in” on the identified class may be beneficial. This is beneficial because the computing resource is not used for discriminating between other topics not of interest. In particular, if one is interested in a=1 it might appear that a “clipped model” of the following form could be effective:
However, this zooming effect cannot be achieved with training in a single stage.
The clipped model (6) is advantageous if the training is performed in the following stagewise fashion: layer m is trained in stage m by fixing all of the layers before m. The log-likelihood of the model
The EM methodology can be generalized to maximize the objective function:
in the M step, where q(a|zi)is fixed in the E step asp(a|zi). The objective function of equation (3) is a special case of this objective function, when q(zi) is uniform over all zi. For the clipped hierarchical model, the E step calculates
q(a0,a1|zi)=p(a0,a1,zi)=p(zi,a0,a1)/p(zi), (9)
subject to the seed constraints, and the M Step maximizes
According to the invention, the multistage EM methodology trains each layer successively while imposing layered constraints on both q and p. In one embodiment of the invention, this training is performed in a two-stage process. In the first stage, a0 is the only latent variable. The M step maximizes:
where q0(zi)=q(zi)=1/N is fixed to the output distribution and q(a0|zi)=p(a0|zi) is calculated in the E step, subject to the seed constraints. The final values, which p and q converge to, are denoted p0 and q0. These will not be changed in subsequent stages. The second stage involves latent variables a0 and a1. The M step maximizes
with the condition that
q(zi,a0)=q0(zi,a0), (13)
p(a0)=p0(a0), (14)
p(zi|a0)=p0(zi|a0), ∀a0≠1. (15)
In other words, only the part of model that involves a1 is allowed to change, and it is regarded that q(zi,a0) is the output distribution of “expanded data” involving both z and a0. To derive the M step, the objective function is expanded as
On the right hand side of equation (16) since p(a0) is fixed, the first term is constant. Since p(zi|a0) is fixed for a0≠1, the second term is constant. To maximize the third term, consider the following factorization (keeping in mind that a0=1)
q(zi,a0,a1)=q(a0)q(zi|a0)q(a1|zi,a0) (17)
Both q(a0) and q(zi|a0) are fixed from the previous layer. The last factor, subject to seed constraints, is calculated as (E step):
p(a1|zi,a0=1)=p(a1|zi)=q1(a1|zi) (18)
Therefore the M step maximizes
where we have defined q1(zi)=q0(zi|a0=1).
For the multistage EM, when training for the distributions involving a1, the expanded output distribution q(zi,a0) is fixed from the previous layer. In contrast, for the full EM methodology, only q1(zi) is fixed at any time of the training process. This can be generalized to multiple layers. For layer m, the M step computes pm(zi,am) to maximize
where qm(zi)=qm−1(zi|am−1=1) comes from layer m−1 and qm (am|zi)=pm(am|zi), subject to seed constraints, is calculated in the E step.
The basic idea behind the methodology provided by the invention is that by weighing each datapoint with qm(zi), less emphasis is placed on those zi that are less likely to be in class 1. Again, this is beneficial because the computing resource is not used for discriminating between other topics not of interest. The discrimination in layer m could conceivably concentrate on finer details difficult to be addressed at layer m−1.
Next, there is a relationship between the multi-stage EM methodology provided by the invention and boosted density estimation provided by other approaches[8]. At each stage m in the multi-stage EM methodology the model built so far is denoted by:
In a departure from conventional methods[8] the patterns in each successive layer are weighted using the output distribution from the previous layer. This suggests that the invention weighs the patterns according to how well they performed in the previous layer. This is so because, unlike boosted density estimation, one of the objects of the invention is classification error. The invention's partially supervised methodology is concerned with the classification of a single class but is also trying to improve the classification by trying to get successively better density estimates for that single class. Another difference between the conventional methods[8] and the invention is the fact that the invention is a learning methodology using the weights within the iterations of the EM methodology. More specifically, the weights are used in the M-Step as described in equation (19). The multi-stage EM methodology in boosting framework is indicated below:
As mentioned, information extraction from large text collections is an important problem. Within this class a particularly interesting problem is that of identifying appropriate topics. In particular, one concern is with the problem of identifying topics in relationship to specific named-entities as is described in detail below. Often users are interested in information pertaining to a specific person (or persons), company(ies) or place(s). These names of people, companies and places have a special place in natural language processing and are called named-entities. The reason for the special treatment is that these are valuable, non-ambiguous, user-defined terms. For example, consider a user who is interested in keeping track of Intel Corporation's strategy to produce cheaper, faster and thermally more efficient microchips and microprocessors. Ideally, the user should be able to express this query in natural language and the system would respond with the answer.
Recognizing that named-entities are important, unambiguous, user-defined terms anchored topic retrieval uses the immediate context of these named-entities to determine the topic pertaining to them. Consider, e.g., the portion of a document shown below:
Clearly, the discussion is about Intel moving to a 0.13 micron manufacturing process using a 300 mm wafer, which will reduce Intel's manufacturing cost, increase the speed and produce cooler chips and would be relevant to the query. However, not all relevant occurrences of Intel will necessarily contain the terms faster, cheaper and cooler in its context. The complicated semantic nature of the query requires a more sophisticated response.
The anchored topic retrieval problem uses an example, of the sort shown above, as a substitute for the query. The name is derived from the fact that the portion of the document used as a query is anchored on named-entities (Intel in this example). The underlying corpus is processed and every occurrence of the named-entity, with its associated context, is considered a candidate. Formally the problem is described as follows: We are given a set of identified anchors in documents and a query q, which is a small sub-set of the identified anchors. The problem at hand is to classify the remaining anchors as being relevant to the query or not.
Surrounding each anchor is a context. The context is restricted to tokens within l characters on each side of the anchor. The text within the window is tokenized into words. Partial words at the boundaries of the window and stop words are removed. Suffix stemming is performed using the well-known Porter's stemmer on each word. This results in a sequence of tokens. Furthermore, lexical affinities; i.e., pairs of tokens within a window of five tokens of each other are also used as features. All terms that occur in less than three contexts are discarded. The context around each anchor is now represented as a vector of features, each feature being either a token or a bigram.
Each zi in the anchored topic retrieval consists of an anchor xi and context yi. It is assumed in the model that conditioned on the latent variable, the anchor and the context are independent. The simple latent variable model for the anchored retrieval problem is therefore written as:
The hierarchical and the clipped models, mentioned before, can be extended to include xi and yi. The probability model assumed for p(xi|a) is a simple multinomial where each xi takes on one of X possible unique anchor values. The probability model for p(yi|a) is a vector with length equal to the size of the dictionary obtained using the previously described preprocessing. The anchored topic retrieval problem has some specific characteristics peculiar to the problem. Foremost, limiting the context of every anchor results in a text classification problem that has all documents of approximately the same length. Further, the limited context keeps the length of the document short. Moreover, since the context is around a named-entity it is often true that the topic of discussion is fairly focused. At first glance this might seem like an easy problem. However, the limitation of very few labeled examples increases the difficulty of the task.
Experiments on real-world datasets have been conducted to prove the validity of an embodiment of the invention. The document collection is gathered from the Tech News section of an online website, Cnet, by crawling the site for news articles and extracting them in an XML format. The news articles are mostly about the business aspects of information technology companies. A total of 17,184 documents were retrieved over a period of several weeks. Duplicates were removed using a cosine similarity measure on the frequency of tokens, leaving 5,268 unique documents in the collection. It is not uncommon for a single article to discuss multiple topics, sometimes interleaved. For instance, industry analysts may publish their opinions on several companies or technologies in a single article. These real-world characteristics of the news article collection make this an interesting and challenging testbed for the invention's methodology.
Anchors can be spotted either as named entities, as given patterns of regular expressions, or as an explicitly given set of names. For the experiments with CNet Tech News articles, a commercially available named-entity tagger[7] is used. It identified 7116 unique entities as organizations in various contexts. This list contains many false positives. However, to test the robustness of the methodology, experiments were conducted with several ways of reducing this list. The list was pruned by visual inspection resulting in a list of 2151 unique entities with a total of 87,251 occurrences in the corpus. Since one of the queries used to test the invention's methodology is in the semiconductor manufacturing domain, the list of 2151 entities was pruned and all names that were definitely not semiconductor manufacturers were removed, resulting in a list of 181 different names with a total of 29,253 occurrences.
For the multistage methodology the number of components in the mixture model was kept at two. Laplace's rule is used for smoothing. For comparison, the experiment evaluated the invention against two standard methodology namely; nearest neighbor[11] and the single layer latent variable model trained using the partially supervised EM methodology[6]. For both topics a type of proximity search was also tested based on patterns given by domain experts. The proximity search is performed on the identified anchors with the original (not tokenized) contexts for varying window sizes.
The experiments were conducted in two domains: semiconductor manufacturing and Web Services. The specific topics within these domains can be described by the following descriptions: (1) Topic 1 includes steps taken by semiconductor manufacturers to produce cheaper, faster and thermally more efficient microprocessors and microchips; and (2) Topic 2 includes web service protocols for business process integration. The topics used in these evaluations are chosen specifically to illustrate the advantages and potential pitfalls of using unlabeled documents within the model. Specifically, Topic 1 is chosen to be a broad topic and the chosen seeds are such that the overall essence of the topic is captured by the entire context. On the other hand Topic 2 is chosen as a narrow topic. For this topic existence of specific words and/or phrases is sufficient to indicate whether the anchor belongs to the topic. For semiconductor manufacturing, three anchors occurring in two documents are identified from the corpus as relevant to the query. In web services, three anchors from three documents are selected as seeds. Results produced by the retrieval methodology are manually evaluated by the domain experts, producing precision recall results described below.
The results obtained by the methodology on different parameter settings are shown in
The reason for the drop in performance of the hierarchical model for Topic 2 can be best explained as follows. It has been shown before that modeling text data is accomplished better in lower dimensional subspaces. Techniques such as LSI[2] and PLSI[3] have been proposed for this purpose. For unsupervised learning in text, it has shown that feature selection is important in identifying appropriate underlying topics[10]. It is believed that learning with unlabeled data feature selection is equally significant. The experiments used all features (except for very rare and very common tokens), which increase the varying results produced by Topic 2. This effect is less pronounced in Topic 1 due the fact that the entire context surrounding all the seed anchors is relevant (a fact evident from the poor performance of the proximity pattern search).
The requirement for a topic-specific, named-entity list can be a potential drawback. To check the sensitivity of our methodology to the choice of the named-entity tagger experiments were performed on Topic 1 (chip manufacturing) using both a topic-specific named-entity list (
Information extraction from text is an extremely important problem with some major challenges. In particular it introduces the problem of single-class learning with very few labeled examples. The invention addresses this problem with a novel, clipped hierarchical latent variable model. Further, the invention provides a new variant of the EM methodology to learn the parameters of this model. The results on real-world examples reflect the validity of the invention.
A system in accordance with an embodiment of the invention is shown in
A representative hardware environment for practicing the present invention is depicted in
Thus, the invention provides a technique of organizing and retrieving documents based on a query, whereby a few (minimum number of) example documents are used as a basis for the query. Then, using the query, the invention uses a cascade of classifiers, which act as filters filtering the documents for relevancy against the particular query input. The invention performs an expectation maximum methodology at each level of the cascade of classifiers in order to generate an output for each classifier indicating the relevancy of that particular classifier for the query. The invention arranges the output using a terminal classifier in such a way as to provide a user with the most relevant documents in a database, which match, or most closely match, the query. The invention is able to achieve high recall query by using multistage semi-supervised learning in its application of the expectation maximum methodology. In fact, the invention is able to retrieve and sort many documents based on just a few documents as a starting point for the query.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.