This application claims priority to and the benefit of Korean Patent Application No. 2007-129155, filed Dec. 12, 2007, the disclosure of which is incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates to a system and method for searching for documents, and more particularly, to a system and method for searching for document format and text information based on a search policy set by an administrator.
2. Discussion of Related Art
Conventional document search or categorization systems mainly employ a machine learning mechanism of the Artificial Intelligence (AI) field. In general, a supervised learning mechanism using learning data to which category information has already been attached is most frequently used. It is known that the performance of a document categorization system is enhanced when a conventional learning algorithm is used. However, to have the enhanced performance, a sufficient amount of learning data must be manually categorized by a person. In addition, a user can not search for a specific document format that he/she wants using such document categorization technology.
Conventional web services of searching for a document through Internet may be roughly classified into two types. One is a classified-list type such as Yahoo, and the other is a query-based engine type such as Altavista, HotBot, etc., which is more general. Both types have databases including reproduction of some webpages or other resources. A classified-list-type categorization method provides systematic sorting categories or the arrangement of resources linked in very complex layers. A query-based engine operates according to a search algorithm based on text input by a user. In general, the classified-list type also may support search based on queries about a category name and a resource name, and the query-based engine service also may provide categorized results. However, both types merely perform fragmentary search based on keyword or link information.
The present invention is directed to providing a system and method for enabling an administrator or user to more thoroughly search for a desired document according to a search policy based on document format and text information not included in a conventional document search system.
One aspect of the present invention provides a system for searching for a document based on a policy, the system including: a document database for storing document files; a document format and text filer for extracting document format information and text information from a document newly stored in the document database and adding the extracted information to the document database; a document format policy module for setting a document format search policy according to an instruction from an administrator; a document text policy module for setting a document text search policy according to an instruction from the administrator; a document format information search module for searching for a document having a document format matching the set document format search policy in the document database; and a document text information search module for searching for a document having a text matching the set document text search policy in the document database.
Another aspect of the present invention provides a method of searching for a document based on a policy, the method including: receiving at least one of a document format search policy and a text search policy from an administrator; monitoring whether or not a new document is stored in a document database; when the new document is stored, extracting document format information and text information from the new document and adding the extracted information to the document database; and searching for a document having at least one of document format information and text information matching the search policy in the document database.
The above and other objects, features and advantages of the present invention will become more apparent to those of ordinary skill in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
Hereinafter, exemplary embodiments of the present invention will be described in detail. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms. The following embodiments are described in order to enable those of ordinary skill in the art to embody and practice the present invention.
The document database 110 stores document files of various formats, which are collected online. Types of document files collected according to an exemplary embodiment of the present invention may be HWP 3.x, Wordian, 2000 and later; Microsoft Word 95, 97, 2000 and XP; Microsoft Powerpoint 95, 97, 2000 and XP; Microsoft Excel 95, 97, 2000 and XP; Haansoft Hangul 2.x, 3.x, 96, 97, Wordian and 2002; Adobe Acrobat 4.x and 5.x (supporting Portable Document Format (PDF) 1.x); Rich Text Format (RTF); Handysoft Arirang (HWD); a Hypertext Markup Language (HTML) document; a Mime HTML (MHT) document; a text document; a Moving Picture Experts Group (MPEG) layer 3 (MP3) tag; a ZIP file; an OpenOffice document file; and so on. However, the present invention is not limited to these document files.
The document format and text filter 120 extracts format information and text information from a document stored in the document database 110; generates a format file containing the extracted format information and a text file containing the extracted text information; and adds the extracted information in the document database. The document format information contained in the format file may include a document title, a writer, header/footer information, page number, and so on. The text information contained in the text file includes text information in the body of the document.
The document format policy setting module 130 sets, modifies and deletes a document format search policy according to an instruction from the administrator, and the document text policy setting module 140 sets, modifies and deletes a text search policy.
The document format information search module 150 searches for a document having document format matching the document format search policy set by the administrator, in the document database, and then provides the search result to the administrator. The document text information search module 160 searches for a document matching the text search policy set by the administrator and then provides the search result to the administrator. Although not shown in
The main modules will be described in further detail below with reference to
The document format and text filter 120 stores document format information, such as <header>, <format . . . >, <footer>, <page number>, etc., on the document “A.doc,” together with basic information on a writer, a time of writing, etc., of the document, in a file “A_doc.form,” which is a format file. In addition, information on the entire text, such as “1. Introduction . . . ”, included in the body, is stored in a file “A_doc.txt”, which is a text file.
An example of a document type that can be set by the document format policy setting module 130 is shown in a table below.
1. Introduction →5 points
I was born with a historical mission in this country.- - - - -•••.
<Figure>
4. Conclusion →5 points
Thanks for listening.
→ total weight 10 points (threshold value: 7 points)
Since the total weight is 10, the corresponding document matches text policy 3 of which the threshold value is 7. A table below is an example of a policy that can be set using the document text policy setting module 140 according to an exemplary embodiment of the present invention.
A document database is monitored (step 520), and it is determined whether or not a new file is stored in the document database (step 530).
When a new document file is stored in the document database, document format and text information is extracted from the new document and is further added to the document database (step 540).
A document having text format information matching the set document format search policy is searched for in the document database (step 550).
In a similar way, a document having text information matching the set text search policy is searched in the document database (step 560).
The search result is provided to the administrator, and thus the administrator can actively and constantly complement previously set document format and text search policies.
According to the present invention, since the present invention can automatically extract format and text information from a document and perform document search according to a search policy based on document format and text set by an administrator, it enables the administrator to more thoroughly search for a desired document.
While the invention has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2007-0129155 | Dec 2007 | KR | national |