The present disclosure relates to the field of Internet technology, and specifically to a simple reflex intelligent agent for crawling literature data and a method of crawling literature data.
Technology literature data not only reflects the academic accomplishment of a researcher, but is also a core indicator for assessing the school-running strength of universities and colleges. With the passage of time and the development of Internet technology, technology literature data show explosive growth, and the impact factor of academic journals changes dynamically. Therefore, it has become an urgent problem to be solved to efficiently obtain technology literature data in real time for supporting disciplinary assessment and scholars' profiling.
Conventional web crawlers are designed to simulate user actions on a browser and automatically extract valuable web data to the user from a specific website. As the data acquisition by web crawlers will bring the same consumption of website resources as the real user's access, the data acquisition by web crawlers especially for a website such as Web of Science storing huge amount of technology literature data, would consume much larger resources than the real user's access.
Conventional anti-crawler strategies for dealing with Web of Science websites mainly rely on manual operations, such as manually reducing the access frequency of web crawler tools, resetting the IP address of web crawlers, and using manual human-computer verification. Manual operation not only requires staff to have certain professional knowledge and business quality, but also consumes a lot of time, which in turn affects the speed, accuracy and comprehensiveness of obtaining technology literature data.
In summary, there is an urgent need for a simple reflex intelligent agent and method for crawling literature data to solve the problems in the prior art.
An object of the present disclosure is to provide a simple reflex intelligent agent for crawling literature data and a method of crawling literature data, with the following specific technical solutions:
A simple reflex intelligent agent for crawling literature data, includes a performance module, an environment module, a sensing module, and an actuator module;
Preferably, an expression for the comprehensiveness indicator is as follows:
Preferably, an expression for the accuracy indicator is as follows:
Preferably, an expression for the performance objective function is as follows:
Preferably, an expression for the environment collection is as follows:
Preferably, the sensing module continuously monitors the system time and the number of journals in the environment collection with a following expression:
Preferably, the simple reflex intelligent agent further includes a storage module, configured for storing crawled literature data and log information during crawling of the literature data.
In addition, the present disclosure further includes a method for crawling literature data, applied in the above-mentioned simple reflex intelligent agent to crawl the literature data, when the sensing module monitors a change in the system time and the number of journals, the actuator module sets a target based on the performance objective function constructed by the performance module and automatically crawls the literature data.
Application of the technical solutions of the present disclosure has the following beneficial effects:
The present disclosure implements literature data crawling by constructing a simple reflex intelligent agent for crawling literature data. The simple reflex intelligent agent can achieve comprehensive and accurate literature data crawling by establishing a comprehensiveness indicator and an accuracy indicator of literature data, constructing a performance objective function based on the comprehensiveness indicator and the accuracy indicator, and setting targets based on the performance objective function via an actuator module.
In addition to the purposes, features and advantages described above, the present disclosure has other purposes, features and advantages. The present disclosure will be described in further detail below with reference to the drawings.
The accompanying drawings, which form part of this application, are used to provide a further understanding of the present disclosure, and the schematic embodiments of the disclosure and the description thereof are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure. In the accompanying drawings:
Conventional anti-crawler strategies for dealing with Web of Science mainly rely on manual operations, such as manually reducing the access frequency of web crawler tools, resetting the IP address of web crawlers, using manual human-computer verification, etc. Manual operation not only requires staff to have certain professional knowledge and business quality, but also requires to consume a lot of time, which in turn affects the speed, accuracy and comprehensiveness of obtaining technology literature data.
In order to overcome the deficiencies of the above mentioned related art, the present disclosure provides a simple reflex intelligent agent and method for crawling literature data, in order to solve the technical problems of existing web crawlers crawling technology literature data that require manual intervention, incomplete data crawling, and low accuracy of data crawling.
Embodiments of the disclosure are described in detail below in conjunction with the accompanying drawings, but the disclosure may be implemented in various different ways as defined and covered by the claims.
As shown in
Herein, the paper crawling performance module 101 is configured to construct a paper information crawling performance objective function, and the paper information crawling performance objective function is constructed by: taking the number of the published papers of journals in the Web of Science database as a benchmark to construct a paper information crawling comprehensiveness indicator of the paper intelligent agent 100; analyzing field information included in each paper in the Web of Science database to construct a paper information crawling accuracy indicator of the paper intelligent agent 100; establishing the paper information crawling performance objective function based on the comprehensiveness indicator and the accuracy indicator.
The field information of the paper in this embodiment includes literature title, literature type, language, keywords, abstract, references, reference quantity, Digital object identifier, author, corresponding author's address, Research ID, publication name, publisher, publication date, etc.
The paper crawling environment module 102 is configured to analyze the number of the published papers of journals and the periodic characteristics of Web of Science database updates, and to construct a paper information environment collection for the paper intelligent agent 100.
The paper crawling sensing module 103 continuously monitors whether the system time and the number of journals in the operating environment of the paper intelligent agent 100 have been changed.
The paper crawling actuator module 104 is configured to automatically crawl the paper information in the operating environment of the paper intelligent agent 100.
The paper information storage module 105 is configured to store the crawled paper information and log information during the crawling process.
Further, the expression for the paper information crawling comprehensiveness indicator is as follows:
Where ARp is the paper information crawling comprehensiveness indicator to evaluate the automatic crawling of the paper intelligent agent 100 on the paper information, xi denotes the number of papers in journal i automatically crawled by the paper intelligent agent 100, ci is the number of papers of the journal i published in a time span ti, and |⋅|22 denotes a 2 paradigm distance function. As values of xi and ci are more approximate to each other, the number of papers in the journal i automatically crawled by the paper intelligent agent 100 is more approximate to the number of the published papers of the journal i in the Web of Science database. The paper information automatically crawled by the paper intelligent agent 100 is more comprehensive as the value of ARp decreases.
Further, the expression for the paper information crawling accuracy indicator is as follows:
Where ACp is the paper information crawling accuracy indicator to evaluate the automatic crawling of the paper intelligent agent 100 on the paper information, p(i,j) denotes the jth literature data of the journal i automatically crawled by the simple reflex intelligent agent, [p(i,j)] denotes the number of fields included in the literature data p(i,j), and β denotes the number of fields of literature data in the Web of Science database. For example, see Table 1, in 2021, each paper in the Web of Science database included 70 field information, such as literature title, literature type, language, keywords, etc., i.e., β=70.
Further, the expression for the paper information crawling performance objective function is as follows:
Where p is the paper information crawling performance objective function to evaluate the automatic crawling of the paper intelligent agent 100 on the paper information. The paper intelligent agent 100 would automatically crawl the paper information more comprehensively and accurately with decrease of the
p value.
Further, an expression of the paper information environment collection expression is as follows:
Where Sp denotes the paper information environment collection, ti is the time span over which the paper information of the journal i has been updated in the Web of Science database, ci is the number of published papers of the journal i in the time span ti, and N is the number of journals in the Web of Science database. For example, the value of N was 12424 in 2021, which means that the Web of Science database stores a total of 12,424 journals, and for the 23rd journal, PRL (Pattern Recognition Letters), a total of 373 papers were published during 2021, i.e., t23=2021 and c23=373.
Further, the sensing module continuously monitors the change in the system time and the number of journals in the environment collection with the following expression:
Where Mp is used to reflect the change in the system time and the number of journals, T denotes a current system time monitored by the sensing module, and N* is the latest number of journals in the Web of Science database monitored by the sensing module. When the current system time monitored by the sensing module is greater than the time span of the journal update or a new journal is added to the Web of Science database, Mp>0. When Mp>0, it indicates a change in the system time and the number of journals.
Further, this embodiment also discloses a literature data crawling method, in particular a paper crawling method, applying the paper intelligent agent 100 as described above to crawl paper information. When the sensing module monitors a change in the system time and the number of journals, the actuator module sets a target based on the performance objective function constructed by the performance module and automatically crawls the paper information in the operating environment of the paper intelligent agent 100.
The paper crawling method disclosed in this embodiment constructs a paper crawling performance objective function by means of the paper information crawling accuracy indicator and the paper information crawling comprehensiveness indicator, which ensures that the paper information is crawled accurately and comprehensively, reduces manual intervention, and increases the efficiency in crawling the paper information.
Further, this embodiment employs the above-described paper intelligent agent 100 to crawl paper information data of a total of five years from 2017-2021 from the Web of Science database.
As detailed in Table 2, the actuator module in this crawling result sets the target of p≤0.02, in which none of the crawling failures exceeds 0.02.
As shown in
Herein, the impact factor crawling performance module 201 is configured to construct an impact factor crawling performance objective function, and the impact factor crawling performance objective function is constructed by: taking the number of journals in the Web of Science database as a benchmark to construct an impact factor crawling comprehensiveness indicator of the impact factor intelligent agent 200; analyzing impact factor change of journals in the Web of Science database to construct an impact factor crawling accuracy indicator of the impact factor intelligent agent 200; and establishing the impact factor crawling performance objective function based on the comprehensiveness indicator and the accuracy indicator.
The impact factor crawling environment module 202 is configured to analyze the impact factor value and update frequency of the journal, and to construct an impact factor environment collection of the impact factor intelligent agent 200.
The impact factor crawling sensing module 203 continuously monitors whether the system time and the number of journals in the operating environment of the impact factor intelligent agent 200 have been changed.
The impact factor crawling actuator module 204 is configured to automatically crawl the impact factor in the operating environment of the impact factor intelligent agent 200.
The impact factor storage module 205 is configured to store the crawled impact factor and log information during the crawling process.
Further, the expression for the impact factor crawling comprehensiveness indicator is as follows:
Where ARf is the comprehensiveness indicator to evaluate the automatic crawling of the impact factor intelligent agent 200 on the impact factor, N′ denotes the number of journal impact factors crawled automatically by the impact factor intelligent agent 200, and |⋅|22 denotes the 2 paradigm distance function. As values of N′ and N are more approximate to each other, the number of journal impact factors automatically crawled by the impact factor intelligent agent 200 is more approximate to the number of journal impact factors in the Web of Science database. The journal impact factor automatically crawled by the impact factor intelligent agent 200 is more comprehensive as the value of ARf decreases.
Further, the expression for the impact factor crawling accuracy indicator is as follows:
Where ACf is the accuracy indicator to evaluate the automatic crawling of the impact factor intelligent agent 200 on the journal impact factor, and yi denotes the value of the journal impact factor crawled automatically by the impact factor intelligent agent 200. As yi is more approximate to ei, the journal impact factor crawled automatically by the impact factor intelligent agent 200 is more accurate. The journal impact factor automatically crawled by the impact factor intelligent agent 200 is more accurate as the value of ACf decreases.
Further, the expression for the impact factor crawling performance objective function is as follows:
Where f is the impact factor crawling performance objective function to evaluate the automatic crawling of the impact factor intelligent agent 200 on the impact factor. The journal impact factor automatically crawled by the impact factor intelligent agent 200 is more comprehensive and accuracy with decrease of the
f value.
Further, the expression for the impact factor environment collection is as follows:
Where Sf denotes a collection of external environments in which the impact factor intelligent agent 200 operates, τi is a time span over which the impact factor of the journal i is updated in the Web of Science database, ei is a value for the impact factor of the journal i over the time span τi, and N is the number of journals in the Web of Science database. For example, the value of N is 12424 in 2021, which means that the Web of Science database stores a total of 12424 journals, and for the 23rd journal, PRL (Pattern Recognition Letters), its impact factor is updated every 12 months and it has an impact factor of 4.757 in 2021, i.e., τ23=12 and e23=4.757.
Further, the sensing module continuously monitors the change in the system time and the number of journals in the environment collection with the following expression:
Where Mf is used to reflect the change in the system time and the number of journals, and when Mf>0, it indicates a change in the system time and the number of journals.
Further, this embodiment also discloses a literature data crawling method, in particular an impact factor crawling method, applying the impact factor intelligent agent 200 as described above to crawl the impact factor. When the sensing module has monitored a change in the system time and the number of journals, the actuator module sets a target based on the performance objective function constructed by the performance module and automatically crawls the impact factor.
Further, in this embodiment, if the sensing module monitors Mf>0, the actuator module is activated, automatically crawls the impact factors of journals in the Web of Science database based on the impact factor environment collection with the target of f≤0.02.
As shown in Table 3, in this embodiment, journal impact factor data of a total of five years from 2017-2021 from the Web of Science database are crawled.
As can be seen through Table 3, the percentage of impact factor crawling failures is zero. It can be seen that journal impact factor crawling according to the embodiment ensures the stability and comprehensiveness of the crawling results.
It can be clearly understood by those skilled in the art that for the convenience and conciseness of description, only the division of the functional modules are taken as an example. In practical application, the functions can be allocated by different functional modules as required. That is, the internal structure of the intelligent agent is divided into different functional modules. The integrated modules can be realized in the form of hardware or software functional units. In addition, the specific name of each functional module is only for conveniently distinguishing each other, and is not used to limit the scope of protection of the present disclosure.
As shown in
The processing unit 310 is configured to execute programs, instructions, or code stored in memory 320 in order to accomplish the operation of the various modules or steps discussed herein. For example, the steps and operations discussed herein may be executed or implemented by the processor 310 via the communication unit 330. The communication unit 330 may be a transceiver or other suitable interface to implement the relevant operations discussed herein. The processing unit 310, via the communication unit 330, may implement access to a network such as, for example, the Web of Science website, and implement crawling literature data from the Web of Science website by running stored programs, instructions, or code in the memory 320.
For example, the processor 310 may include one or more central processing units (CPUs) or general-purpose processors with one or more processing cores, although other types of processors may also be used.
In some embodiments, the memory 320 is further configured to store information about the crawled papers, the impact factors, and log information during the crawling process.
The foregoing is merely a preferred embodiment of the present disclosure and is not intended to limit the disclosure, which is subject to various changes and variations of the present disclosure for those skilled in the art. Any modifications, equivalent substitutions, improvements made within the spirit and principles of the present disclosure shall be included in the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202310086593.7 | Feb 2023 | CN | national |
This application is a continuation-in-part of and claims priority to International Patent Application No. PCT/CN2023/100350 filed on Jun. 15, 2023, which application claims the benefit and priority of Chinese Patent Application No. 202310086593.7 filed with the China National Intellectual Property Administration on Feb. 9, 2023, and entitled “simple reflex intelligent agent for crawling literature data and method of crawling literature data”. The two applications are incorporated by reference herein in the entirety as part of the present application.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/100350 | Jun 2023 | WO |
Child | 18777105 | US |