Embodiments of the present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings. Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings.
In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of embodiments in accordance with the present invention is defined by the appended claims and their equivalents.
Various operations may be described as multiple discrete operations in turn, in a manner that may be helpful in understanding embodiments of the present invention; however, the order of description should not be construed to imply that these operations are order dependent.
The terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still cooperate or interact with each other.
For the purposes of the description, a phrase in the form “A/B” means A or B. For the purposes of the description, a phrase in the form “A and/or B” means “(A), (B), or (A and B)”. For the purposes of the description, a phrase in the form “at least one of A, B, and C” means “(A), (B), (C), (A and B), (A and C), (B and C), or (A, B and C)”. For the purposes of the description, a phrase in the form “(A)B” means “(B) or (AB)” that is, A is an optional element.
The description may use the phrases “in an embodiment,” or “in embodiments,” which may each refer to one or more of the same or different embodiments. Furthermore, the terms “comprising,” “including,” “having,” and the like, as used with respect to embodiments of the present invention, are synonymous.
In various embodiments of the present invention, methods, apparatuses, and systems to facilitate the classification of text fragments are provided. More specifically, techniques, systems and apparatuses for performing implicit and positional contextualization of text fragments are disclosed. The gain from this contextualization is that as much information is extracted from a text fragment as possible. In this manner, every available piece of information may be utilized to generate a feature set which is then capable of classification. Such a classification, for example, may notify a user that the text fragment contains inappropriate material, or conversely, no inappropriate material. The inventive techniques may be implemented in any device suitably configured for receiving text fragments including but not limited to: cellular devices, smart phones, personal digital assistants (“PDAs”), personal computers, and other networked devices. The invention is not to be limited in this regard.
Referring now to
In the illustrated embodiment, the host device 100 which, as stated previously, may be any device suitably configured for receiving wireless or wired text fragments, receives a text fragment 108. The text fragment 108, in the illustrated embodiment, is a wireless document having a layout structure which includes a title 102, and a body of text 104. Upon receiving the text fragment 108, the host device 100 may generate individual contextualized tokens 106. Contextualized tokens may be generated using implicit contextualization, which will be discussed more fully herein, or positional contextualization. In the illustrated embodiment, the host device 100 utilizes positional contextualization to contextualize each term within the text fragment 108. More specifically, in the illustrated embodiment, the host device 100 generates a contextualized token 106 by effectively pairing a term with its positional context, i.e., title or text. In the illustrated embodiment, the host device 100 ignores punctuation, case, and terms less than three characters. In various other embodiments these guidelines may be modified. The host device 100 may then determine a feature set 110 based on the contextualized tokens 106. The feature set 110 may then be used, in various embodiments, to facilitate classification of the text fragment 108. In the illustrated embodiment, the contextualized token includes a term from the text fragment 108 and its respective positional context. It is contemplated, however, that a contextualized token may include any number of terms and/or any number of respective contexts.
In various other embodiments, the text fragment 108 may be a Short Message Service (“SMS”) message, a chat message, a Uniform Resource Locator (“URL”), and/or any other form of wirelessly or wired received text. Additionally, within each of the various embodiments, the text fragment 108 may also utilize formatting characteristics including, but not limited to: layout structures, text formatting, text coloring, punctuation, various case usage, unique number sequences, images, and/or links. In certain embodiments these characteristics may be used to facilitate positional contextualization of the text fragments. For instance, in one embodiment, a host device 100 may receive a URL and utilize the contexts inherent in a URL, such as: a server, a path, a filename, and a file_type. In another embodiment, a host device 100 may receive an SMS message and utilize contexts that utilize human notions such as: first_sentence, URL, text, and upper_case_text. In still another embodiment, the host device 100 may receive a chat message, and utilize contexts including: URL, text, or upper_case.
Referring now to
In the illustrated embodiment, the host device 100 which, as stated previously, may be any device suitably configured for receiving wireless or wired text fragments, receives a text fragment 208. The text fragment 208, in the illustrated embodiment, is a Hypertext Markup Language (“HTML”) webpage. In various embodiments, the HTML webpage does not display the source code, but rather only the web page. In such instances, the HTML code may remain available to the host device 100. In the illustrated embodiment, the HTML code in text fragment 108 contains the components “<title>” 210, “<h1>” 212, and “<body>” 214. Upon receiving the text fragment 208, the host device 100 may then generate individual contextualized tokens 204. In the illustrated embodiment, the host device 100 utilizes implicit contextualization to contextualize each term within the text fragment 208. More specifically, in the illustrated embodiment, the host device 100 generates a contextualized token 204 by effectively pairing a term with its implicit context, i.e., the title (<title>) 210, emphasis (<h1>) 212, or text (<body>) 214, within the HTML code. In the illustrated embodiment, the host device 100 ignores punctuation, case, and terms less than three characters. In various other embodiments, these guidelines may be modified. The host device 100 may then determine a feature set 216 based on the contextualized tokens 204. The feature set 216 may then be used, in various embodiments, to facilitate classification of the text fragment 208. In the illustrated embodiment, the contextualized token includes a term from the text fragment 208 and its respective implicit context. It is contemplated, however, that a contextualized token may include any number of terms and/or any number of respective contexts. In various other embodiments, implicit contextualization may be applied to other document formats, and/or other programming languages.
Referring to
Referring now to
Referring to
Although certain embodiments have been illustrated and described herein for purposes of description of the preferred embodiment, it will be appreciated by those of ordinary skill in the art that a wide variety of alternate and/or equivalent embodiments or implementations calculated to achieve the same purposes may be substituted for the embodiments shown and described without departing from the scope of the present invention. Those with skill in the art will readily appreciate that embodiments in accordance with the present invention may be implemented in a very wide variety of ways. This application is intended to cover any adaptations or variations of the embodiments discussed herein. Therefore, it is manifestly intended that embodiments in accordance with the present invention be limited only by the claims and the equivalents thereof.
The present application claims priority to U.S. Provisional Application No. 60/800,509, filed May 12, 2006, entitled “Methods and Apparatus for Positional and Implicit Contextualization of Text Fragments into Features,” the entire disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60800509 | May 2006 | US |