This is a U.S. national stage application of PCT Application No. PCT/CN2019/118991 under 35 U.S.C. 371, filed Nov. 15, 2019 in Chinese, claiming priority to Chinese Patent Applications No. 201910447448.0, filed May 27, 2019, all of which are hereby incorporated by reference.
This present invention relates to the technical field of service computing, and in particular, to a service packaging method based on web page segmentation and search algorithm.
With the development of Internet, service providers tend to display their service data through web pages. However, various web pages which provide convenience restrict the use of these source data by developers. A service packaging system is intended to package the data in web pages into a service, and provide RestFul API for calling the service to use the service in the development process for developers.
Web page block segmentation technology is the analysis and processing of existing Web page documents, specifically is the technology that the whole Web page is segmented into multiple blocks containing information data, so as to achieve advertisement removal, main information extraction and other functions, which mainly include page block segmentation technology based on node entropy, page block segmentation technology based on visual features, Web page block segmentation technology based on content distance etc. The Web page block segmentation technology has been widely used in various fields of the Internet industry.
A service is a collection of API with multiple attributes that belong to a specific service class which is provided by a developer or a class of developers.
API is certain predefined function designed to provide applications and developers with the ability to access a set of routines based on a certain software or hardware without having to access the source code or understand the details of the inner workings. API has multiple input and output attributes, belonging to a specific developer, and being subordinate to a specific service.
Web crawler (also known as web spider, web robot, and more commonly as web chaser in the FOAF community) is a program or a script that automatically crawl the World Wide Web information according to certain rules. Other less commonly used names include ant, auto-index, simulator or worm.
The object of the present invention is to provide a service packaging method based on web page segmentation and search algorithm. The present invention greatly increases the efficiency of acquiring data by a user.
To realize the object of the present invention, the present invention provides the following technical solution:
a service packaging method based on web page segmentation and search algorithm, comprising following steps:
a service extraction stage, comprising dynamic packaging and/or static packaging; for dynamic packaging, parsing a dynamic web page, tagging forms that possibly exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that possibly exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service;
and a service calling stage, inputting, by the user, related information for calling a service, and generating, by a back end system, a respective service according to the received related information for calling the service and according to an extraction rule, and returning the service to a front end.
The present invention provides a service packaging method based on web page segmentation and search algorithm, which can automatically analyze the page, and can package the web page into a service through a module by which the packaging of a web page can be completed with only several clicks and a small amount of input, generate crawler rules, and return the corresponding structured data according to the user's requirements, which greatly improves the efficiency of data acquisition by a user.
In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the embodiments of the present invention. Obviously, the drawings described below are only some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained based on the drawings.
For better understanding of the purpose, technical solutions and advantages and of the present invention, the following is a further detailed description of the present invention in combination with the attached drawings and embodiments. It should be understood that the embodiments described herein are intended only to explain the present invention and do not limit the claimed scope of the present invention.
The present invention provides a service packaging method based on web page segmentation and search algorithm, comprising a service extraction stage and a service calling stage. The service extraction is a module by which the packaging of a web page can be completed by a packager with several clicks and a small amount of input, which can package a web page into a service. The service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
Taking the page http://www.ceic.ac.cn/history as an example, the running rule of the service packaging method based on web page segmentation and search algorithm is explained.
As shown in
Stage 1: Service Extraction
Service extraction mainly contains two functions: static packaging and dynamic packaging, that is, static packaging of web pages in which data is directly presented; dynamic packaging of web pages that require input of certain query content and require the clicking of a button to present data.
For a user, service extraction only requires the user to click a few times and fill in service description information. Service extraction implicitly performs two different extraction rules based on the web page. For static web pages in which the data directly presented in the web page, static web page extracting will be performed; for dynamic web pages in which the user needs to performing inputting and clicking to present data, dynamic web page extracting will be performed.
The dynamic packaging comprises the following steps:
And wherein the Parsing of the json file is as follows:
Each element in the *forms is parsed as follows, such as forms[0]:
Wherein the log information is in the format of Logging+timestamp→+what is being processed. If the last line is successful, it is 200+ json file address of webpage table rule+webpage screenshot address; If the last line is unsuccessful, it is 503+Procedure failed, please retry!, such as:
Wherein the position of the tagged element need to be obtained during the process of tagging, and JavaScript's getBoundingClientRect function is used here to obtain the width and height of the element and the position thereof with respect to the image.
The static packaging comprises the following steps:
The following is an example of a service information file:
The format of the file is described as follows:
Wherein there are three types of “type”: text, img and link respectively representing text, picture and hyperlink types. parent_id is used for background identification when the element is queried in the service calling stage.
An example and description of a crawler rule JSON file are as follows:
Wherein the text message needs to be located based on both css_selector and rank information. img comprises image information in the img tag and background image information in the css, which needs to be extracted according to different types of extraction rule. The extraction rule is as follows: if the image is an image with the img tag, then the link address of the img tag will be extracted. If the image is the link information of the background image, then it is necessary to search the background-img attribute in the css attribute of the element where the image is located, and then extract the corresponding link.
The background file can parse the page according to this parsing rule.
Wherein, the service generation background needs to store the service information and the extraction rule defined by the user in its own database and query the service information again when the service is called.
In this example, the generated service API address is: URL://call_service/79, which indicates that the generated service ID is 79. This interface complies with restAPI specification, so the user can use this interface to query, call, delete, and modify the service information.
Stage 2: Service Calling
The service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
The user can call the corresponding service by checking service information and writing RestFul API.
The specific steps are as follows:
Wherein the query parameter is each of the form input options so as to perform the input query; meanwhile, it also includes each of the returned results so that the user can perform screening on the returned results according to the class parameter. Meanwhile, the maximum number of pages parameter is _max_page to solve the pagination problem in the web page, the default is 5 pages of data.
For example, the url called by the user is:
Wherein, _max_page is the maximum number of pages in the system. Weidu1 and Jingdu2 are the input query parameters of the form; Magnitude (M) and link_5. reference position is the returned result parameter of the service, “link5.reference position” refers to the reference position parameter under link5.
The meaning of the above link is: capturing the paging content, up to 7 pages, wherein the input parameter weidu1 value is 30, and jingdu2 value is 20; when outputting, selecting the data with a result of magnitude (M)=3.5, link_5. reference position=Gengma County, Lincang City in Yunnan Province to display.
If Chinese characters exist, then UTF-8 code needs to be used.
Wherein if the user needs to fill in the form information, the user can fill in the form content according to the query parameter value input by the user. The supported input box type is the element form tag supported by HTML5, such as:
Wherein an example of the returned result is as follows:
The service packaging method based on web page segmentation and search algorithm has been described above, the packing method tries to: analyze any type of web pages, and automatically parse out the main possible information that may exist in the page; then parse out each format of each block after blocking the information, wherein after a simple modification by the user, this page can be converted into a calling service that can be called directly; and then returns formatted and structured data that the user needs. Meanwhile, the present invention provides dynamic form query function. If a dynamic form exists in the page, the form query box can be converted into query parameter for the use of the user. Compared with the traditional crawler, the present invention can automatically analyze the page, generate crawler rules, and return the corresponding structured data according to the user's requirements. Therefore, the present invention greatly increases the efficiency of acquiring data by a user.
The foregoing is merely illustrative of the preferred embodiments of the present invention and it should be understood that the embodiments described above are only the most preferable embodiments of the present invention and are not intended to be limiting of the present invention, and various changes and modifications may be made by those skilled in the art. Any modifications, equivalent substitutions, improvements, and the like within the spirit and principles of the present invention are intended to be included within the scope of the present invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 201910447448.0 | May 2019 | CN | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2019/118991 | 11/15/2019 | WO |
| Publishing Document | Publishing Date | Country | Kind |
|---|---|---|---|
| WO2020/238070 | 12/3/2020 | WO | A |
| Number | Name | Date | Kind |
|---|---|---|---|
| 9881323 | Agronow | Jan 2018 | B1 |
| 10521496 | Goodwin | Dec 2019 | B1 |
| 10534851 | Chan | Jan 2020 | B1 |
| 11205041 | Kumar | Dec 2021 | B2 |
| 20070055656 | Tunstall-Pedoe | Mar 2007 | A1 |
| 20090171999 | McColl | Jul 2009 | A1 |
| 20110296291 | Melkinov | Dec 2011 | A1 |
| 20110321160 | Mohandas | Dec 2011 | A1 |
| 20150193402 | Ayoub | Jul 2015 | A1 |
| 20190034441 | Capon | Jan 2019 | A1 |
| 20200110781 | Staszak | Apr 2020 | A1 |
| 20210049234 | Kumar | Feb 2021 | A1 |
| Number | Date | Country |
|---|---|---|
| 101004760 | Jul 2007 | CN |
| 101515287 | Aug 2009 | CN |
| 103034690 | Apr 2013 | CN |
| 105516337 | Apr 2016 | CN |
| WO2013016139 | Jan 2013 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 20220245203 A1 | Aug 2022 | US |