This invention relates to data processing and analysis and, more particularly, to electronically generating items of original content from structured and unstructured data. The invention claims the unique use of natural language understanding and automatic incorporation of feedback for automatic content creation.
Organizations today are routinely processing and analyzing large amounts of data from varied internal and external sources. For example, analyst at a Wall street firm looking to prepare a report on the impact of US foreign policy on international investments might look at thousands of pages from more than a dozen sources to create an original piece combining analytical results with expert opinion. Even individual authors are required to assimilate, understand and analyze data in their everyday responsibilities. For example, a blog writer for techcrunch might look at hundreds of reports to write an article on how the use of Blockchain and cryptocurrency is disrupting the large banks. The field of Big Data is moving from variety, volume and velocity of data to veracity of data. As the amount of data available to an organization and individual increases, it becomes more and more difficult to separate noise from useful content. Data processing software is typically unable to function properly as data become noisy and highly voluminous. Use of templates and forms creates limitations that commoditizes the usefulness of output. The field of Artificial Intelligence, machine learning and cutting-edge technologies can be used to compensate for some of the above limitations. Within all these complexities, the requirement of organizations and individuals to produce original content from vast volume of data is ever increasing.
In some implementations, a system includes a Cognitive Memory Augmented Network (“CAMN”) to use advanced methods of cognitive search, content summarization and feedback assimilation to produce machine generated original content. The CAMN can ingest data from both structured and unstructured sources and organize it in a neural network. The method of generic and custom decomposition are used to ensure that the data sources are broken down inside the CAMN to individual elements of reusable data. The Cognitive Gateway Interface (“CGI”) ensures that the data available inside the CAMN is accessible to various processes such as cognitive search, content extraction and summarization. Finally, a feedback mechanism is used to ingest human thought, utilize Artificial Intelligence and machine learning, and convert such feedback to introduce original content into the output of the overall system.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the embodiments.
One or more specific embodiments of the present invention will be described below. In an effort to provide a concise description of these embodiments, all features of an actual implementation may not be described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers'specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
The CGI module 101 can be accessed through Cognitive Search Methods 107. For example, an organization might use Module 101 for Content Summarization 108. For example, an individual might use the Module 101 for Original Content Creation 109.
As used herein, devices, including those associated with the system 100 and any other device described herein, may exchange information via any communication network which may be one or more of a Local Area Network (“LAN”), a Metropolitan Area Network (“MAN”), a Wide Area Network (“WAN”), a proprietary network, a Public Switched Telephone Network (“PSTN”), a Wireless Application Protocol (“WAP”) network, a Bluetooth network, a wireless LAN network, and/or an Internet Protocol (“IP”) network such as the Internet, an intranet, or an extranet. Note that any devices described herein may communicate via one or more such communication networks.
The system 100 may store information into and/or retrieve information from various data stores (e.g., cloud systems 104, data lakes 106, documents 105, and proprietary databases 103), which may be locally stored or reside remote from the CAMN 102. Although a single CAMN 102 is shown in
A user may access the system 100 via a remote device (e.g., a Personal Computer (“PC”), tablet, or smartphone) to view information about and/or manage operational information in accordance with any of the embodiments described herein. In some cases, an interactive graphical user interface display may let an operator or administrator define and/or adjust certain parameters via the remote device (e.g., to define how data sources should be accessed) and/or provide or receive automatically generated recommendations or results associated with the system 100.
Structured data 201 refers to data that stored in key-value structure. For example, 201 is composed of documents meta-data 203, Knowledge graph and concept tree map 204 and historic log data 205. For example, 203 might include title of the document. User will provide some meta-data for each document in their data-set. This data will be used by Module 102 for generating training data for customized cognitive search, summarizing engine. For example, Knowledge graph and concept tree map 204 includes an Entity-Dictionary which signifies that the User provided all the terms and vocabularies that are frequent in their domain with their synonyms which are used by System 102 for micro and macro understanding of data. For example, Historical Log data 205 might include a Search-query-log that signifies that the user provides log of search terms that have been used for searching inside the data. This data will be used by Module 102 for generating training data for customized cognitive search. For example, 205 might also include a search-query-result-log of search terms and proper results in the data for each term. This data may be used by Module 102 for generating training data for customized cognitive search.
Unstructured data 202 is referred to original documents, media files and Uniform Resource Locators (“URLs”) to be parsed or crawled. It can have a lot of different formats and needs to be further decomposed and extracted to get the meaningful data. For example, extraction and decomposition might be used to process the unstructured data. Module 102 is further built of two types of cognition entities—Models which have macro understanding over the user data and Cognitive objects which have micro understanding over the inputs data. Macro understanding in 102 is based on the Neural Network based Deep Learning custom models. Micro understanding is the extracting of individual entities, facts or relationships from the text. For example, this is useful for extracting acronyms and their definitions or extracting citation references to other documents or extracting key entities depends on corpus domain or extracting facts and metadata from full text when it's not separately tagged in the web page or extracting entities with sentiment (e.g., positive sentiment towards a product or company). For example, Micro understanding is done with syntactic analysis of the text. This means that order and word usage are important.
The Cognitive Search 107 is an important feature of System 100 that allows an organization to use the Module 101 to access Module 102 for accessing data. 102 generates training data for cognitive search model with Macro Level understanding and this system has four methods. Hi-fi data: Portion of Search-query-result-log data; Mid-fi data: By using searchquery-log data and one unsupervised search model (BM25) the system generates mid-fi data . Mid Mid-fi data: By using extracted noun phrase data and one unsupervised search model (BM25) the system generates mid mid-fi data; Weak Mid-fi data: By using documents meta data (e.g. title) as search query terms and one unsupervised search model (BM25) the system generates weak mid-fi data. For example, 102 can further analyze and augment content with additional methods such as Augmentation with Noun-phrase Extraction. For example, in this method, the system crawl through all the documents in the extracted text corpus and extract every noun phrase. It can then filter and sort them based on a number of occurrences. This data will be used by 102 for generating training data for customized cognitive search. For example, 102 might also use Augmentation with Entity-Dictionary using Domain specific terms. In this method, by getting extracted noun phrases the system uses available data-bases such as Wikidata and Wordnet and generate list of common terms and their synonyms in order to get domain specific terms. In one implementation, in order to design the Cognitive search, 102 can operate with mixture of two architectures, which takes both phrase/keyword match and semantic match into consideration between the query and the document.
In some implementations, 107 can use Phrase match architecture. In order to do a phrase/keyword match, 107 represents inputs in terms of vector representation. Each word is represented in a ‘N’ dimensional space. In this ‘N’ dimensional space the words that are similar together will be closer to each other, while the words that differ in their meaning will be far apart. For example, the query for phrase match architecture can be represented as the cosine similarity between the words in the query and words in the document. In order to represent the input, 107 understands vector representation of the words in the query and of the document and finds cosine similarity between them, closer to ‘1’ similar, while closer to ‘0’ means they are not similar. For example, some input representations do not take contextual information of the neighboring words. Having contextual representation helps because even if the words are not similar, the neighboring words might provide information relevant to the query.
In some implementations, 107 can utilize Semantic Match Architecture. Phrase match network captures the information between query words and document words but it fails to understand the overall contextual information that flows across documents when they are big. The contextual flow of information across passages might change and there might be times, when the query would be more abstract and might not have exact phrases that match the existing document. In such cases, it is important to get contextual information of the document. In some implementations, 107 is designed as another neural network that gets the semantic understanding of the network. One of the reasons for using ngraphs is that some words in the query might not have embeddings or vector representation during inference and thus those words will be represented as vector of zeros, but with ngraphs, even if the word has not been seen, while training the network, it can still be represented using ‘ngraphs’. ‘Ngraphs’ is another way of representing words using subgraph information. Where every word is represented using subset of characters. In some implementations, the ngraphs used will have maximum length of five and the top 2000 ngraphs are chosen to represent the words.
In some implementations, the Semantic match architecture components is composed of a) Query Network that will take the query in terms of a sparse matrix represented using ngraphs and perform convolution to extract meaningful information from the query; b) the Document Network that will take input as a sparse matrix constructed using ngraphs and perform convolution operation to extract meaningful information from the document; c) the Contextual Similarity Network that will take input from query network and document network, which will be a representation of query and document in an embedding space. For example, to find similarity between query and document hardmard product is performed. The entire information is then aggregated using fully connected networks. In some implementations, the network will train using both cosine similarity network and ngraphs network and the loss will depend on the weight assigned by cosine similarity (phrase match network) and semantic similarity (ngraph network).
In some implementations, 107 will utilize a Running fast architecture. For example, In order to run network faster the Relu calculation to be done at the end of the network in cosine similarity network is modified. In some implementations, Convolution on dynamic document size is used rather than fixed document size. In some other implementation, the calculation for phrase match network is changed from 32 floating bits to 16 floating bits to achieve the faster architecture.
In order to create input for search network, 107 needs a data structure that has vector representation (word embeddings) for each word and query inverse document frequency for each word for Phrasematch architecture. For Semantic match architecture, a data structure that has ngraphs is required to create sparse input representation for documents and queries. The search architecture interacts with 102 to fetch the word embeddings, query inverse document frequency and ngraphs to create inputs for phrase match and semantic match architecture. User interacts with 102 using 101 to sends a request for the query to the search network, the search network takes input as a query and returns top 10-50 documents to 102. The search network also keeps a threshold, if the required documents have a score of less than a certain threshold then it does not send those documents to 102. Whenever new data comes in, 102 parses the data and creates new word embeddings, query inverse document frequency and ngraphs. Also, Module 102 can interact with search network on feedback data, where it finds the query document pair that were marked as not relevant/junk or highly relevant/relevant. Then creates a new feedback data to improve the search model by training the model on this new feedback data.
Most of the recent state of the art architecture represent words in the query and document as a dense representation vector. In most current implementations, the input is represented in the form of cosine similarity matrix between query terms and document terms. The input is then feed to convolution network, that finds phrase match (e.g., trigram or bigram or unigram match) between query and documents. The drawback of phrase match architecture is that it fails to capture the sematic meaning between query and documents, if the query is abstractive or if the query is long. To overcome this issue, we came up with a new architecture, which tackles the drawbacks of phrase match architecture. Rather than using word embeddings to represent words, we use them to create sentence embedding. In some implementations, Smooth Inverse Frequency (“SIF”) is used to represent sentences in terms of a 300-dimensional vector. One of the advantages of using SIF is that it uses weighted average of words to represent sentences and it has shown to be at par with other state of the art sentence embedding models. In some implementations, a query is represented as a 300-dimension vector and each sentence in the document is represented as a 300-dimension vector. In some implementations, cosine similarity is chosen to find the similarity between query and sentences in the document, for example, choose top-k sentences, where k is 10. In some implementations, 107 uses the best matching top-k sentences, then pass it to fully connected layer to find relevant patterns and score each document. This architecture is simple, gives much better performance and is fast. Using this architecture, for example, 100K documents can be processed in 1.3-1.4 seconds. Also, in some implementations after doing some optimization, 100K documents can be processed in less than 1 second.
Although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with some embodiments of the present invention (e.g., some of the information associated with the databases described herein may be combined or stored in external systems). Moreover, although some embodiments are focused on particular types of integration services and microservices, any of the embodiments described herein could be applied to other types of applications. Moreover, the displays shown herein are provided only as examples, and any other type of user interface could be implemented.
The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.
The present application claims the benefit of U.S. Provisional Patent Application No. 62/719,708 entitled “ELECTRONICALLY GENERATING ITEMS WITH ORIGINAL CONTENT” and filed Aug. 20, 2018. The entire content of that application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62719708 | Aug 2018 | US |