The invention is directed to the management of content, and more particularly, to multi-lingual search and retrieval for catalogued archives of digital content.
A content management system, such as a Digital Asset Management system (DAM) is often employed to enable multiple users to store, search, and access content that is owned or licensed by an organization. This content is generally provided as one or more media objects in a digital format, such as pictures, text, videos, graphics, illustrations, images, audio files, fonts, colors, and the like. To make content globally available, it is desirable for users to search for content using a desired language. To accommodate multiple languages, a searching system may use multiple search indices, such as one search index for each language. It is generally time consuming and expensive to create and maintain indices in multiple languages.
In addition, it is desirable to include multiple categories of metadata about the content that may be searched. Some search systems use only keywords. Such keywords may comprise a controlled vocabulary that uniquely identifies each keyword, and distinguishes meanings when a keyword has multiple meanings. Keywords illustrate an example of structured metadata. However, it is desirable to also enable searching of other categories of metadata, such as captions, titles, paragraphs, date, context, and/or other categories of metadata that may be known about content beyond just keywords. Such categories are sometimes referred to as unstructured metadata. Further, it is desirable to enable searching of all categories in multiple languages. However, creating and maintaining multiple language indices that include multiple categories is generally more time consuming and expensive than a single language index.
Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following drawings. In the drawings, like reference numerals refer to like parts throughout the various figures unless otherwise specified.
The invention now will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific exemplary embodiments by which the invention may be practiced. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Among other things, the invention may be implemented in different embodiments as methods, processes, processor readable mediums, systems, business methods, or devices. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense.
Briefly, the present invention relates to multi-lingual search and retrieval for catalogued archives of digital content. The following example embodiments are generally described in terms of a multi-lingual system that uses English as a primary language. Accordingly, these embodiments generally describe creating an English language database that associates non-English terms with English terms. These embodiments also generally describe methods and systems for evaluating non-English search terms submitted by a user, to determine English search terms that can be used to perform a search for content. Multiple categories of metadata may be searched, including structured and unstructured metadata. Submitted query terms, translated English query terms, structured metadata, unstructured metadata, search content, and/or search results, can be weighted or prioritized. For example, the English language query terms themselves can be weighted based on pre-defined priorities. In addition, or alternatively, a match found with structured metadata, such as a keyword, may be given more weight than a match found with unstructured data, such as a caption.
Reference is now made to
An embodiment of a multi-lingual search and retrieval system generally includes a content catalogue and a translation machine.
In accordance with the present invention, a cataloguing system may include both controlled and uncontrolled metadata. Keywords from a controlled vocabulary have precise meanings, and are represented by unique identifiers. Each controlled term is thus unique within metadata, and represents a precise concept. In one embodiment of the present invention, controlled vocabulary terms are identified by unique IDs.
In another embodiment of the present invention, referenced in the ensuing description, controlled vocabulary terms are uniquely identified by “tag-term” pairings, where the “tag” indicates a context and the term indicates the specific keyword. Thus, the tag-term pair GAN:Turkey, for example, has a tag “GAN” indicating a Generic Animal Name, and a term “Turkey”. It will be appreciated by those skilled in the art that the same term may appear with different tags, since the same term may have multiple contexts. Thus, a completely different “tag-term” pairing would be used to refer to “Turkey” as a country. In uncontrolled free-text metadata, such as titles and captions, the word “turkey” could also appear, but would lack the contextual information found in a controlled vocabulary. Contextual information and/or other meaning limitations can be identified by other unique identifiers, such as numerical codes, flags, pointers, and the like.
Controlled vocabularies also support maintenance of synonyms; i.e., different terms with the same or similar meanings. When synonymous terms exist, one of them is designated as the Preferred term and the others are designated as similar terms, sometimes known as “lead-ins”.
Metadata in a cataloguing system may exist in one or more languages and queries may be formulated in one or more languages. For the sake of clarification and definitiveness,
In accordance with the present invention, a translation system dynamically translates queries with text expressed in a first language, say, Language A, into queries with text expressed in a second language, say, Language B, based on a list of language equivalencies. Generally, language associations are complex, and not simply one-to-one. That is, a term in Language A may have multiple equivalents or similar terms in Language B, or it may not have any equivalents. In some cases, a term in Language A can be expressed in Language B only through a combination of words and phrases. In order to accommodate these and other complexities, the list of language equivalencies is flexible enough to handle a variety of linguistic situations, including compound expressions, as described in detail herein below. As used herein, the term “equivalent” generally means an associated term or terms in another language. The associated term or terms may or may not have an identical definition as the original term.
Referring to
To this end, a translation machine 140 mediates between client computer 110 and search engine 120. Translation machine 140 accepts as input a query expressed by the user in Language A and, using a parser, 143, a list of equivalencies 145 and a query generator 147, produces as output a corresponding query expressed in Language B. Parser 143 accepts as input a query expressed in Language A and produces as output individual terms and expressions from the input query. Although parser 143 is illustrated as parsing queries expressed in Language A, and the list of equivalencies 145 is illustrated as storing equivalent terms from Language A and Language B, in general parser 143 is used to parse multiple languages, and the list of equivalencies 145 stores many language equivalencies. It will be appreciated by those skilled in the art that query generator 147 may also re-format the user's query to conform to a standard query language such as SQL. The query output by translation machine 140 is suitable as input for search engine 120. Search results may be returned in Language B or may be processed in a similar manner to provide at least some of the results data in Language A.
To further clarify the description of the examples below, Language A, the user's query language, will be referred to henceforth as a non-English language (more precisely, a non-US-English language), and Language B, the catalogue language, will be referred to henceforth as the English language (more precisely, the US-English language).
Reference is now made to
It will be appreciated by those skilled in the art that an English equivalent term may be ambiguous in its meaning. For example, a search for the French term “dinde” would be translated into English as “turkey”. Since “dinde” refers only to the bird, and not to the country, it is desirable to limit the English equivalent to the controlled vocabulary term; namely, GAN:Turkey. Otherwise, the results retrieved may include irrelevant items. Equivalencies may therefore be limited to unique controlled values only, such as the unique “tag-term” combination. For non-ambiguous terms, the equivalency may include both the controlled value and its free-text equivalent. For example, the Spanish term “caballo” may be listed as being equivalent to “GAN:Horse”, or equivalent to “horse”, or equivalent to both forms. A search for digital content corresponding to GAN:Horse; namely, only those items associated with that controlled keyword, is narrower than a search for digital content corresponding to “horse”; i.e., those terms with the word “horse” mentioned anywhere in controlled or uncontrolled metadata. Depending on the meaning of the non-English term, either form, or both, may be appropriate.
Often words appear in queries that are less significant than other words, and may be dropped from a user's search query in order to improve the search results. Such words are referred to herein as “noise words”. Reference is now made to
Reference is now made to
When parser 143 of
pomme de terre=potato
If a French query includes the word “de” as part of the multi-word “pomme de terre”, then this multi-word is translated into the English “potato”. Otherwise, if the French query includes the word “de” but not as part of a multi-word, then the word “de” is dropped by translation machine 140.
It may be appreciated from
Reference is now made to
Reference is now made to
As shown in
The present invention further provides a capability for a user to import external files including inter alia non-primary language dictionaries, and to create and import user-defined “complex equivalencies” as described in detail hereinbelow with reference to
Referring back to
Reference is now made to
dinde=GAN:Turkey
where the French word “dinde” is equivalent to the controlled English word GAN:Turkey.
Data 710, controlled equivalency data 720 and one or more external files including user-defined equivalencies 730, are used to generate the list of equivalencies 145, whereby foreign words are listed with English equivalents from controlled vocabulary 123, free-text, or both. Specifically, the list of equivalencies 145 may be additionally populated (i) by adding English lead-in terms to the controlled terms from list 720, (ii) by adding user-defined equivalencies, such as from an external dictionary, and (iii) by adding complex equivalencies, as described with respect to
Referring back to
arena, sand
with comma-separated terms would indicate (incorrectly) that both “arena” and “sand” are equivalents of the Spanish word “arena”; whereas an entry
arena=sand
with an equals sign indicates (correctly) that only “sand” is an equivalent of the Spanish word “arena”.
In accordance with an embodiment of the present invention, non-English terms may be flagged as “Do Not Search”. For example, some non-English terms may be obtained from an external dictionary, and a language expert may determine that certain non-English terms should not be associated with certain English terms, to avoid irrelevant search results. Multi-lingual translations generator 520 is instructed not to include such terms as non-English equivalents in the equivalents database 145. Referring back to
As described hereinabove, when synonymous English terms are used to catalogue digital content, one of them is designated as being a Preferred term, and the others are designated as being “lead-in” terms. For example, the English expression “terrorist attack” is a lead-in to the English Preferred term “act of terrorism”. It may be appropriate to include lead-ins as equivalencies when the Preferred terms appear in the list of equivalencies 145. For example, the French expression “acte de terrorisme” is equivalent to the English expression “act of terrorism”. Since “terrorist attack” is a lead-in to “act of terrorism”, it may be appropriate to add an equivalency between the French “acte de terrorisme” and the English “terrorist attack” in list 145; i.e., the entry
acte de terrorisme=act of terrorism, terrorist attack
may be generated in list 145.
In accordance with an embodiment of the present invention, each English lead-in to a Preferred English term may be flagged as “Include Lead-In in List of Equivalencies”. Lead-in terms may be individually accessed for flagging within the Termulator user interface shown in
As mentioned hereinabove, data in the searchable catalogue may include controlled vocabulary keywords 123 with unique meanings, or free-text 127. When an English term is ambiguous (e.g., Turkey), it is desirable to limit an equivalency to the controlled value only. For non-ambiguous terms, equivalencies should also include free-text values. For example, if translation machine 140 receives as input a Spanish query with the term “caballo”, it may include “GAN:horse” (i.e., unique controlled vocabulary term) or “horse” (i.e., free-text), or both, within its English query output.
In accordance with an embodiment of the present invention, controlled vocabulary keywords may be flagged as “Include Tag in Equivalency File”. Button 670 from the Termulator interface in
It will be appreciated by those skilled in the art that the data processing flow illustrated in
TABLE I summarizes the various control flags described hereinabove, used by multi-lingual translations generator 520 in generating the list of equivalencies 145, in accordance with an embodiment of the present invention. The control flags in TABLE I are used to automate generation of a complete list of equivalencies 145 from a smaller list provided by a vocabulary expert or imported from an outside source. These sets of control flags are designated as control flags 529 in
Reference is now made to
(TDS:winter AND PICT:landscape) OR (winter AND landscape), and the German term “bahntunnel” is equivalent to the Boolean expression (tracks OR train) AND tunnel.
It is noted that equivalents can contain references to unique controlled vocabulary terms, such as TDS:Winter, and to general free-text, such as “winter”. (In the current embodiment of the invention, TDS is a controlled “tag” that refers to “Time, Day, or Season”.) When translation machine 140 encounters such expressions, as those above, in the list of equivalencies 145, it incorporates the Boolean logic into the English query generated by query generator 147. Compound equivalencies, such as those above, may be imported automatically into the list of equivalencies 145 by a user-defined complex expression file 740, as indicated in
Reference is now made to
Reference is now made to
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made to the specific exemplary embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
This application claims the benefit of U.S. Provisional Patent Application 60/886,649 filed Jan. 25, 2007; the contents of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4337483 | Guillou et al. | Jun 1982 | A |
5201047 | Maki et al. | Apr 1993 | A |
5241671 | Reed et al. | Aug 1993 | A |
5251316 | Anick et al. | Oct 1993 | A |
5260999 | Wyman | Nov 1993 | A |
5263158 | Janis | Nov 1993 | A |
5317507 | Gallant | May 1994 | A |
5319705 | Halter et al. | Jun 1994 | A |
5325298 | Gallant | Jun 1994 | A |
5438508 | Wyman | Aug 1995 | A |
5442778 | Pedersen et al. | Aug 1995 | A |
5493677 | Balogh et al. | Feb 1996 | A |
5519608 | Kupiec | May 1996 | A |
5532839 | Beikirch et al. | Jul 1996 | A |
5553143 | Ross et al. | Sep 1996 | A |
5600775 | King et al. | Feb 1997 | A |
5629980 | Stefik et al. | May 1997 | A |
5642502 | Driscoll | Jun 1997 | A |
5675819 | Schuetze | Oct 1997 | A |
5682487 | Thomson | Oct 1997 | A |
5706497 | Takahashi et al. | Jan 1998 | A |
5721902 | Schultz | Feb 1998 | A |
5758257 | Herz et al. | May 1998 | A |
5765152 | Erickson | Jun 1998 | A |
5778362 | Deerwester et al. | Jul 1998 | A |
5794249 | Orsolini et al. | Aug 1998 | A |
5813014 | Gustman | Sep 1998 | A |
5832495 | Gustman | Nov 1998 | A |
5832499 | Gustman | Nov 1998 | A |
5850561 | Church et al. | Dec 1998 | A |
5864845 | Voorhees et al. | Jan 1999 | A |
5875446 | Brown et al. | Feb 1999 | A |
5903892 | Hoffert et al. | May 1999 | A |
5938724 | Pommier et al. | Aug 1999 | A |
5987459 | Swanson et al. | Nov 1999 | A |
6006221 | Liddy et al. | Dec 1999 | A |
6006241 | Purnaveja et al. | Dec 1999 | A |
6012068 | Boezeman et al. | Jan 2000 | A |
6038333 | Wang | Mar 2000 | A |
6072904 | Desai et al. | Jun 2000 | A |
6125236 | Nagaraj et al. | Sep 2000 | A |
6138119 | Hall et al. | Oct 2000 | A |
6243713 | Nelson et al. | Jun 2001 | B1 |
6349373 | Sitka et al. | Feb 2002 | B2 |
6385596 | Wiser et al. | May 2002 | B1 |
6404441 | Chailleux | Jun 2002 | B1 |
6523028 | DiDomizio et al. | Feb 2003 | B1 |
6546405 | Gupta et al. | Apr 2003 | B2 |
6574609 | Downs et al. | Jun 2003 | B1 |
6574622 | Miyauchi et al. | Jun 2003 | B1 |
6578072 | Watanabe et al. | Jun 2003 | B2 |
6578073 | Starnes et al. | Jun 2003 | B1 |
6581055 | Ziauddin et al. | Jun 2003 | B1 |
6618808 | Johnson et al. | Sep 2003 | B1 |
6735583 | Bjarnestam et al. | May 2004 | B1 |
6834130 | Niikawa et al. | Dec 2004 | B1 |
6868192 | Takiguchi | Mar 2005 | B2 |
6871009 | Suzuki | Mar 2005 | B1 |
6920610 | Lawton et al. | Jul 2005 | B1 |
6931408 | Adams et al. | Aug 2005 | B2 |
6944340 | Shah | Sep 2005 | B1 |
6947959 | Gill | Sep 2005 | B1 |
7110937 | Lei et al. | Sep 2006 | B1 |
7277884 | Vadai et al. | Oct 2007 | B2 |
7454413 | Lakshminarayanan et al. | Nov 2008 | B2 |
7603353 | Knepper et al. | Oct 2009 | B2 |
20020000998 | Scott et al. | Jan 2002 | A1 |
20020077986 | Kobata et al. | Jun 2002 | A1 |
20020082997 | Kobata et al. | Jun 2002 | A1 |
20030085997 | Takagi et al. | May 2003 | A1 |
20040205333 | Bjorkengren | Oct 2004 | A1 |
20050114329 | Dettinger et al. | May 2005 | A1 |
20050177358 | Melomed et al. | Aug 2005 | A1 |
20050203931 | Pingree et al. | Sep 2005 | A1 |
20060059192 | Chun et al. | Mar 2006 | A1 |
20060242139 | Butterfield et al. | Oct 2006 | A1 |
20060277189 | Cencini | Dec 2006 | A1 |
Number | Date | Country |
---|---|---|
WO-0219147 | Mar 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20080275691 A1 | Nov 2008 | US |
Number | Date | Country | |
---|---|---|---|
60886649 | Jan 2007 | US |