The present disclosure relates to a machine learning process for cleaning of data, and more specifically to modifying data based on a limited set of examples using Artificial Intelligence.
Knowledge graphs, also known as semantic networks, represent connections between real-world entities, such as objects, events, situations, or concepts, by illustrating the relationship between them. This information is usually stored in a graph database and visualized as a graph structure, prompting the term knowledge “graph.” Knowledge graphs have three main components: nodes, edges, and labels. Any object, place, or person can be a node, which is identified with a label. An edge defines the relationship between the nodes.
In order for a knowledge graph to accurately represent the real-world relationships between nodes, the data used to construct the knowledge graph needs to be consistently formatted.
Additional features and advantages of the disclosure will be set forth in the description that follows, and in part will be understood from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims, or can be learned by the practice of the principles set forth herein.
Disclosed are systems, methods, and non-transitory computer-readable storage media which provide a technical solution to the technical problem described. A method for performing the concepts disclosed herein can include: receiving data at a computer system, wherein the data has a plurality of rows; receiving, from a user at the computer system, a description of a task associated with the data; receiving, from the user at the computer system, a plurality of example transformations, and input and output labels; combining, via at least one processor of the computer system, the task description together with the plurality of example transformations and input and output labels, resulting in a prompt; executing, via the at least one processor, a machine learning model, wherein the prompt is an input to the machine learning model, and wherein output of the machine learning model comprises an algorithm for executing the task; and executing, via the at least one processor, the task on the data using the algorithm.
A system configured to perform the concepts disclosed herein can include: at least one processor; and a non-transitory computer-readable storage medium having instructions stored which, when executed by the at least one processor, cause the at least one processor to perform operations comprising: receiving data, wherein the data has a plurality of rows; receiving, from a user, a description of a task associated with the data; receiving, from the user, a plurality of example transformations; combining the task description together with the plurality of example transformations, resulting in a prompt; executing a machine learning model, wherein the prompt is an input to the machine learning model, and wherein output of the machine learning model comprises an algorithm for executing the task; and executing the task on the data using the algorithm.
A non-transitory computer-readable storage medium configured as disclosed herein can have instructions stored which, when executed by a computing device, cause the computing device to perform operations which include: receiving data, wherein the data has a plurality of rows; receiving, from a user, a description of a task associated with the data; receiving, from the user, a plurality of example transformations; combining the task description together with the plurality of example transformations, resulting in a prompt; executing a machine learning model, wherein the prompt is an input to the machine learning model, and wherein output of the machine learning model comprises an algorithm for executing the task; and executing the task on the data using the algorithm.
Various embodiments of the disclosure are described in detail below. While specific implementations are described, this is done for illustration purposes only. Other components and configurations may be used without parting from the spirit and scope of the disclosure.
Systems configured as disclosed herein obtain, clean, and transform data. The cleaned and transformed data may then be used for various purposes, such as input into a knowledge graph. In some embodiments, the data from the knowledge graph may be further utilized, for example, in web page creation, publications, etc. Exemplary, non-limiting sources of the data can a web crawler, API (Application Programming Interface) requests, webhooks (a method of augmenting or altering the behavior of a web page or web application with custom callbacks, operating, for example, via a “Push to API” source option), and custom typescript functions. The system may determine one or more data sources. The data sources may be configured by a user. The system disclosed herein can then obtain the source data and perform transformations (“transforms”) on the source data, thereby cleaning the data, at which point the clean data can be loaded into a knowledge graph, or used in another manner.
In one example, a machine learning model may be used to transform or clean the data. “Few-shot data cleaning” is a broad class of applications that involve transforming or modifying text data given very few examples (hence the term, “few-shot”). Generally, “few examples” means “fewer than about ten.” For example, a model may be given three examples where lower-case text is converted to upper-case text (e.g., “a” to “A”). The model learns the transform given only the three examples. The model can then apply this transformation to any arbitrary text. Transformation tasks are also often accompanied by a short description of the task, for example, “Convert each piece of text to upper-case.” The model being trained in this manner can be any machine learning model configured to operate in this manner. Non-limiting examples of machine learning models capable of operating in this manner include autoregressive language models using a transformer network for language prediction, such as the Generative Pre-trained Transformer 3 (GPT-3) model.
Consider the following example. On a retailer product page of a website, the product price is listed as “From $449 or $18.70/mo. per month for 24 mo.” The full price is to be extracted from the website and stored in a price field of a knowledge graph. The required format for “price” in the knowledge graph may be “USD 449.00”. The system allows the user to set up a single transform, resulting in the system extracting and converting the “From $449 or $18.70/mo. per month for 24 mo.” listing into “USD 449.00”. The system can use the transform to do the same for other websites or webpages which have a similar data structure, without needing to be retrained. To do this, the system uses a “Data Cleaning Transform,” where the system is provided with a few examples of the data cleaning to perform, then applies the same pattern to an entire column of data, automatically cleaning the data according to the defined transform.
The system may provide the option to create a new transform when configuring the system to import data (regardless of data source). In this option, examples of inputs and expected outputs are provided. The particular data within the imported data, for example a column of data, containing the data to be cleaned is identified, and the system performs the data transform. The data may be specified by a user or automatically identified by the system. An exemplary work flow may be:
The column that contains the data on which to perform the transform;
Text description of the transform (i.e. Task Description);
A few example inputs and corresponding expected outputs for the transform:
And, in exemplary embodiments, input and output labels for the transform:
In some configurations, the system can provide a preview option, where the transform may be tested on a subset of data, before applying the transform to all data. Likewise, in some configurations the transform can be applied to the data within the knowledge graph database.
Obtaining the data can be done via different mechanisms. In some cases the system can use APIs or crawl websites and extract HTML (HyperText Markup Language). While in some cases the JSON (JavaScript Object Notation) and HTML extracted from the website can be well structured (and not need cleaning), often the data will need additional cleansing, parsing, or formatting in order to match the expected field formats, for example, of a knowledge graph (such as formatting of dates/times/etc.). The system disclosed herein can significantly speed up and improve the current computer process of cleaning and transforming the data, allowing the extracted data to be transformed into a desired format easily and more flexibly, using natural language instructions.
Non-limiting examples of sources for data can include a web-site/web-page crawler; YOUTUBE, VIMEO, or other video storage/distribution websites; TWITTER, FACEBOOK, INSTAGRAM, LINKEDIN, TIK-TOK, or other social media platforms; and platforms such as ZENDESK, HUBSPOT, GURU, CONFLUENCE, GREENHOUSE, and DRUPAL. The ability to select a source of data and add that data to a knowledge graph may be referred to as a “connector.”
The description next turns to the specific examples provided by the figures.
In some embodiments input and output labels may be utilized. The input and output labels may be a short text description that describes the input and output data, respectively. For example an input label may be “Product URL” and an output label may be “Product Category.”
Using the task description 102, input and output labels, and the example transforms 104, the system creates a prompt 106. In some configurations, the created prompt 106 is a string aggregation of the task description 102, input and output labels, and the example transforms 104. In other configurations, the created prompt 106 is reduced or formatted prior to aggregation. The resulting prompt 106 is input into a machine learning model 108, which performs natural language processing on the prompt 106, parsing out the task description 102 and the examples 104, with the output being a task algorithm 110. The task algorithm 110 represents the model's 108 attempt to use the description 102 of the task description and the examples 104 to perform the described task 102. The system then retrieves data 114 from one or more sources 112, such as websites, databases, and API feeds. The task algorithm 110, when executed 116 using the data 114, modifies and cleans the data, resulting in revised data 118.
In one example, embodiments of the invention may be integrated into ETL (Extract, Transform, Load) system. ETL systems are tools used to transfer data from one source to another.
The illustrated transform user interface 200 also allows the user to input example transforms 210 that the system uses to form the transformation algorithm. In each example transform 210 an example input 212 and an example output 214 may be received, where the example output 214 is the output the user would like to see if the example input 210 were transformed. In some embodiments, the task description, input label, output label, and three example inputs and output are needed. The user interface 200 can also contain a validation portion 216, which provides the user the option to validate outputs 218 prior to accepting the transformation outputs as “clean” and/or prior to adding the outputs to a knowledge graph. Once the inputs are finalized, the “Apply” button may be selected to start the process, for example as described above. In exemplary embodiments, to start the process, which includes transforming data included in the preview table, all data has the transform applied when the model is run to process the data to be sent to the knowledge graph
In the third row 310, 312, the exemplary inputs 310 and outputs 312 appear to indicate that any brackets “[ ]” with text between the brackets (e.g., “[Input 1]”) results in an empty bracket (e.g., In being returned, whereas an empty bracket results in “[This is my new title]” being returned.
In the fourth row 314, 316, the exemplary inputs 314 and outputs 316 illustrate the ability of the system to form more complex, conditional transforms. In this example 314, 316, if an input has text before a comma, that pre-comma text is the output. If there is no text before a comma, the text after the comma becomes the output.
In the fifth and final row 318, 320, the exemplary inputs 318 and outputs 320 illustrate the ability to insert brackets around each different point of data, effectively parsing the data from a string into individual components which are ready for additional analysis, manipulation, or entry into the knowledge graph.
Per box 506, the prompt may be created and customized for each task, and sent to the model. The model utilizes the prompt for the transformation, for example as described in connection with
The model is applied to the selected data, per box 508. In this example, the selected data is a column of data extracted from the data source. The model may be called for each cell in the column to transform the data in that column. In some examples, the model may be retrained between uses or iterations. For example, the model may be retrained based on the results generated after some of or each of the data in the cells in the column is transformed, or after the entire column is processed, etc. The extracted may be in spreadsheet format, e.g. in columns and rows, but may be in other formats also. Thus, different data may be designated for transformation, such as one or more rows or columns, or different data fields from the extracted data.
Per box 510, the transformed data outputs are provided to the destination. In the described embodiment, the destination is a knowledge graph, and the data is loaded into the knowledge graph. However, other output destinations are also contemplated. Another example may be to provide an answer to frequently asked questions. The transform may intake a question, generate a prompt based on the question, and generate content for the response as the output. The generated response may or may not include information that appeared in the prompt. In additional examples, the transform may generate a biography based on the input of certain facts; generate a description based on the input of certain information; generate a translation of certain information; and/or generate a review response based on a review.
In some configurations, the plurality of example transformations can include: an input for a transformation; and an output for the transformation.
In some configurations, the plurality of example transformations number three.
In some configurations, the description of the task is prose. In such configurations, the illustrated method can further include executing, via the at least one processor, natural language processing (NLP) on the description of the task, resulting in parsed text, wherein the prompt further comprises the parsed text.
In some configurations, the illustrated method can further include receiving, at the computer system, feedback regarding accuracy of the execution of the task on the data using the algorithm; and retraining, via the at least one processor, the machine learning model using the feedback.
In some configurations, the machine learning model is a GPT-3 (Generative Pre-trained Transformer 3) model.
With reference to
The system bus 710 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output (BIOS) stored in ROM 740 or the like, may provide the basic routine that helps to transfer information between elements within the computing device 700, such as during start-up. The computing device 700 further includes storage devices 760 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like. The storage device 760 can include software modules 762, 764, 766 for controlling the processor 720. Other hardware or software modules are contemplated. The storage device 760 is connected to the system bus 710 by a drive interface. The drives and the associated computer-readable storage media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computing device 700. In one aspect, a hardware module that performs a particular function includes the software component stored in a tangible computer-readable storage medium in connection with the necessary hardware components, such as the processor 720, bus 710, display 770, and so forth, to carry out the function. In another aspect, the system can use a processor and computer-readable storage medium to store instructions which, when executed by a processor (e.g., one or more processors), cause the processor to perform a method or other specific actions. The basic components and appropriate variations are contemplated depending on the type of device, such as whether the device 700 is a small, handheld computing device, a desktop computer, or a computer server.
Although the exemplary embodiment described herein employs the hard disk 760, other types of computer-readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, random access memories (RAMs) 750, and read-only memory (ROM) 740, may also be used in the exemplary operating environment. Tangible computer-readable storage media, computer-readable storage devices, or computer-readable memory devices, expressly exclude media such as transitory waves, energy, carrier signals, electromagnetic waves, and signals per se.
To enable user interaction with the computing device 700, an input device 790 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device 770 can also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing device 700. The communications interface 780 generally governs and manages the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
The technology discussed herein refers to computer-based systems and actions taken by, and information sent to and from, computer-based systems. One of ordinary skill in the art will recognize that the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single computing device or multiple computing devices working in combination. Databases, memory, instructions, and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
Use of language such as “at least one of X, Y, and Z,” “at least one of X, Y, or Z,” “at least one or more of X, Y, and Z,” “at least one or more of X, Y, or Z,” “at least one or more of X, Y, and/or Z,” or “at least one of X, Y, and/or Z,” are intended to be inclusive of both a single item (e.g., just X, or just Y, or just Z) and multiple items (e.g., {X and Y}, {X and Z}, {Y and Z}, or {X, Y, and Z}). The phrase “at least one of” and similar phrases are not intended to convey a requirement that each possible item must be present, although each possible item may be present.
The various embodiments described above are provided by way of illustration only and should not be construed to limit the scope of the disclosure. Various modifications and changes may be made to the principles described herein without following the example embodiments and applications illustrated and described herein, and without departing from the spirit and scope of the disclosure. For example, unless otherwise explicitly indicated, the steps of a process or method may be performed in an order other than the example embodiments discussed above. Likewise, unless otherwise indicated, various components may be omitted, substituted, or arranged in a configuration other than the example embodiments discussed above.