Synthetic data is generated for various reasons. For example, synthetic data may be generated for use in performance benchmarking in place of actual customer data to ensure privacy for customers. Depending on the implementation, a workload utilizing synthetic data may require thousands, millions, or greater data points. Furthermore, some workloads require synthetic data including coherent sentences, such as simulating a doctor's recommendation to a patient or comments the patient made to the doctor.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Embodiments are described herein for synthetic data generation utilizing generative artificial intelligence and scalable data generation tools. In an aspect, a prompt comprising a domain is provided to a generative artificial intelligence (AI) model (e.g., a large language model (LLM) or other type of generative AI model). A data parameter associated with the domain that specifies a boundary for synthetic values in a column of data is received from the generative AI model. An argument comprising the data parameter is provided to a scalable data generation tool (SDGT) configured to generate data based on the data parameter. Scaled data is received from the SDGT. The scaled data comprises a column of synthetic data values wherein each synthetic data value is within the boundary specified by the data parameter. An emulated workload is caused to utilize synthetic data comprising the scaled data to generate a performance benchmark for the domain.
In a further aspect, synthetic sentence data is generated. Responsive to the prompt provided to the generative AI model, training data generated by the LLM based on the prompt is received. A lightweight model is trained to generate synthetic sentences based on the received training data. Synthetic sentence data is received from the lightweight model. The synthetic sentence data is appended to the scaled data to generate the synthetic data.
In a further aspect, a schema file comprising metadata associated with the domain is received. The prompt is generated based on the schema file.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments and, together with the description, further serve to explain the principles of the embodiments and to enable a person skilled in the pertinent art to make and use the embodiments.
The subject matter of the present application will now be described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
The following detailed description discloses numerous example embodiments. The scope of the present patent application is not limited to the disclosed embodiments, but also encompasses combinations of the disclosed embodiments, as well as modifications to the disclosed embodiments. It is noted that any section/subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, embodiments disclosed in any section/subsection may be combined with any other embodiments described in the same section/subsection and/or a different section/subsection in any manner.
Embodiments of the present disclosure relate to generation of synthetic data, e.g., for use in benchmark testing a system or data platform. Synthetic data, or “synthetic seed data,” may be used to represent real datasets rather than utilizing actual customer data for various reasons, e.g., to hide confidential information, to protect customer privacy, etc. A user or an organization utilizes synthetic data to test how a system or data platform operates on datasets. The amount of synthetic data required for testing such systems or data platforms may include thousands, millions, or greater numbers of data points. Furthermore, users or organizations may operate in various domains or fields (e.g., a healthcare domain, a finance domain, a technology domain, an energy domain, etc.). In this context, the user or organization may desire testing their system or data platform against synthetic data related to their domain.
Embodiments of the present disclosure leverage generative artificial intelligence (AI) models to generate data parameters for use in generating synthetic data, and in particular synthetic data related to a domain. A generative AI model is a model that generates content that is complex, coherent, and/or original. For instance, a generative AI model can create sophisticated sentences, lists, ranges, tables of data, images, essays, and/or the like. An example of a generative AI model is a language model. A language model is a model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens. In this context, a “token” is an atomic unit that the model is training on and making predictions on. A token may be a word, a character (e.g., an alphanumeric character, a blank space, a symbol, etc.), a sub-word (e.g., a root word, a prefix, or a suffix). In other types of models (e.g., image based models) a token may represent another kind of atomic unit (e.g., a subset of an image).
A large language model (LLM) is a language model that has a high number of model parameters. For instance, an LLM may have millions, billions, trillions, or even greater numbers of model parameters. Model parameters of an LLM are the weights and biases the model learns during training. An LLM is (pre-)trained using self-supervised learning and/or semi-supervised learning. For instance, an LLM may be trained by exposing the LLM to (e.g., large amounts of) text (e.g., predetermined datasets, books, articles, text-based conversations, webpages, transcriptions, forum entries, and/or any other form of text and/or combinations thereof). Training data may be provided from a database, from the Internet, from system, and/or the like. Furthermore, an LLM may be fine-tuned using Reinforcement Learning with Human Feedback (RLHF), where the LLM is provided the same input twice and provides two different outputs and a user ranks which output is preferred. In this context, the user's ranking is utilized to improve the model. Further still, an LLM may be trained to perform in various styles, e.g., as a completion model (a model that is provided a few words or tokens and generates words or tokens to follow the input), as a conversation model (a model that provides an answer or other type of response to a conversation-style prompt), as a combination of a completion and conversation model, or as another type of LLM model.
Some implementations of LLMs are transformer-based LLMs (e.g., the family of generative pre-trained transformer (GPT) models). A transformer is a neural network architecture that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings (e.g., without relying on convolutions or recurrent neural networks). Additional details regarding transformer-based LLMs are described with respect to
In some implementations of synthetic data generation, generative AI, such as LLMs, is used to generate the synthetic seed data to represent real datasets rather than utilizing actual customer data (e.g., due to privacy regulations, confidential information, etc.). However, as noted above and depending on the implementation, a workload utilizing synthetic data may require thousands, millions, or greater data points. Generative AI alone may require a long period of time and/or a large number of resources to generate synthetic data to this scale. In an alternative implementation of synthetic data generation, a scalable data generation tool (SDGT) is used to quickly generate large amounts of synthetic data. However, an SDGT may lack customizability without direct user input, thereby requiring a user to implement lengthy arguments that may require a long time to generate for a workload. Furthermore, a new or unique workload would require a new argument to be written for the SDGT.
Methods, systems, and computer-readable storage media described herein leverage generative AI models to generate data parameters that are provided to an SDGT for use in generating synthetic data. For example, in an embodiment, a prompt related to a domain is provided to a generative AI model (such as an LLM). The LLM generates data parameters associated with the domain. Each data parameter specifies a boundary for synthetic values in to-be-generated synthetic data. Example data parameters include, but are not limited to, a range data parameter that specifies a first range subset (e.g., a range of values to select as a start of a range) and a second range subset (e.g., a range of values greater than or subsequent to the first range subset to select as the end of the range), a numeric data parameter that specifies a range of numbers, and a categorical data parameter that specifies a categorical list of elements. An argument comprising at least one of the data parameters is provided to an SDGT configured to generate data based on the data parameter. The SDGT generates scaled data comprising a column of synthetic data values. Each of the synthetic data values are within a boundary specified by the data parameter in the argument. By leveraging a generative AI model and an SDGT in this manner, embodiments described herein are able to generate large amounts of synthetic data catered to a particular domain at a faster rate and utilizing fewer compute resources.
In some implementations, a user or organization may require synthetic data that includes coherent sentences, such as simulating a doctor's recommendation to a patient or comments the patient made to the doctor. Embodiments described herein may generate “synthetic sentence data” in various ways. For instance, and as described with respect to
Systems, devices, and apparatuses may be configured in various ways for generating synthetic data utilizing generative AI. For example,
Storage 126 stores data used by and/or generated by computing device 102, model server 120, SDGT server 122, emulator server 124, and/or components thereof and/or services executing thereon. For instance, as shown in
Computing device 102 may be any type of stationary or mobile processing device, including, but not limited to, a desktop computer, a server, a mobile or handheld device (e.g., a tablet, a personal data assistant (PDA), a smart phone, a laptop, etc.), an Internet-of-Things (IoT) device, etc. In accordance with an embodiment, computing device 102 is associated with a user (e.g., an individual user, a group of users, an organization, a family user, a customer user, an employee user, an admin user (e.g., a service team user, a developer user, a management user, etc.), etc.). Computing device 102 is configured to execute an application 110 and a synthetic data generator 112. As shown in
Model server 120, SDGT server 122, and emulator server 124 are network-accessible servers (or other types of computing devices). In accordance with an embodiment, one or more of model server 120, SDGT server 122, and emulator server 124 are incorporated in a network-accessible server set (e.g., a cloud-based environment, an enterprise network server set, and/or the like). Furthermore, as shown in
Application 110 comprises an application configured to utilize synthetic data generator 112 to generate synthetic data and cause the execution of workloads. For example, application 110 may be an application for benchmark testing data platforms and/or generating synthetic data for use in benchmark testing data platforms. Application 110 in accordance with an embodiment sends information (“schema information”) to synthetic data generator 112 to cause the generation of synthetic data. Alternatively, application 110 receives schema information from another computing device (not pictured in
Synthetic data generator 112 is configured to generate synthetic data (e.g., synthetic data 128) and cause emulation of workloads. Synthetic data generator 112 may be a service executed by computing device 102 or implemented by application 110. Optionally, logic for performing some or all of the functions of synthetic data generator 112 may be imported into a computer program (e.g., as a library). As shown in
Data handler 116 comprises logic for obtaining data parameters generated by generative AI model 104 (e.g., from prompter 114, from generative AI model 104 (e.g., as an API call response), from storage 126, and/or the like), generating an argument comprising one or more data parameters generated by generative AI model 104, transmitting an argument to SDGT 106 (e.g., as an API call of SDGT 106), receiving responses from SDGT 106 (e.g., as an API call response), causing workload emulator 108 to generate a performance benchmark utilizing data generated by SDGT 106, training a lightweight model to generate synthetic sentence data (e.g., as described further with respect to
Generative AI model 104 is configured to generate data parameters based on a received prompt. Generative AI model 104 may be any type of generative AI model capable of generating data parameters based on prompts received from synthetic data generator 112. In accordance with an embodiment, generative AI model 104 is an LLM. Generative AI model 104 may be trained using public information (e.g., information collected and/or scrubbed from the Internet) and/or data stored by an administrator of model server 120 (e.g., stored in memory of model server 120 and/or. In accordance with an embodiment, generative AI model 104 is an “off the shelf” model trained to generate complex, coherent, and/or original content based on (e.g., any) prompt. In an alternative embodiment, generative AI model 104 is a specialized model trained to generate data parameters for a domain based on prompts. Example code excerpts including prompts and API calls transmitted to generative AI model 104 are described with respect to
SDGT 106 is configured to generate scaled data based on a received argument. In accordance with an embodiment, SDGT 106 is a non-AI scalable data generation tool. SDGT 106 includes logic for generating scaled data based on data parameters included in received arguments. In accordance with an embodiment, SDGT 106 comprises one or more functions that generate scaled data based on one or more data parameters. For instance, SDGT 106 in accordance with an embodiment comprises a function that selects a number and/or date from a range parameter. In another example embodiment, SDGT 106 comprises a function that selects (e.g., randomly) an element (e.g., text, a phrase, a number, etc.) from a categorical list of elements. SDGT 106 is configured to generate hundreds, thousands, millions, or even greater numbers of synthetic data values based on received arguments. In accordance with an embodiment, each synthetic data value is within a boundary specified by a respective data parameter. In accordance with an embodiment, SDGT 106 generates scaled data as a table of synthetic data values, wherein columns of synthetic data values correspond to respective data parameters generated by generative AI model 104. An example table of synthetic data values is described with respect to
Workload emulator 108 is configured to emulate workloads utilizing synthetic data to generate a performance benchmark for a domain. In accordance with an embodiment, workload emulator 108 performs and/or otherwise manages a load test to evaluate the performance of a system and/or data platform based on synthetic data generated by synthetic data generator 112 (e.g., synthetic data 128). A workload may comprise one or more tasks and/or sub-tasks the system and/or data platform is to perform. By utilizing synthetic data, the performance of the system and/or data platform may be tested without exposing confidential and/or private information. Furthermore, by utilizing synthetic data generator 112, the synthetic data is catered to the domain the system and/or data platform is used for, thereby increasing the accuracy and quality of performance benchmarks.
Synthetic data generator 112 may be configured to generate synthetic data in various ways, in embodiments. For example,
Flowchart 300 begins with step 302. In step 302, a prompt comprising a domain is provided to an LLM. For example, prompter 114 provides a prompt 226 comprising a domain to generative AI model 104. Prompter 114 in accordance with an embodiment is configured to (e.g., automatically) generate prompt 226 based on schema information 224 received from application 110. Alternatively, prompter 114 obtains schema information from a data store (e.g., storage 126 of
In step 304, a data parameter associated with the domain is received from the LLM. The data parameter specifies a boundary for synthetic values in a column of data. For example, argument provider 220 receives response 228 comprising a data parameter associated with the domain specified in prompt 226. Alternatively, prompter 114 receives response 228 and provides the data parameter included therein to argument provider 220. In either scenario, the data parameter specifies a boundary for synthetic values in a column of data to be generated by SDGT 106. In accordance with an embodiment, argument provider 220 receives multiple data parameters at a time from generative AI model 104. Data parameters may be dependent on each other (e.g., start and end data parameters for generating a range of dates, minimum and maximum values for numeric data, two or more categorical lists of elements that depend on each other, etc.).
In step 306, an argument comprising the data parameter is provided to an SDGT configured to generate data based on the data parameter. For example, argument provider 220 provides an argument 230 comprising the data parameter received in response 228 to SDGT 106 to cause SDGT 106 to generate data based on the data parameter. In accordance with an embodiment, argument 230 is provided to SDGT in an API call. In accordance with another embodiment, SDGT is a library of code imported into synthetic data generator 112 and argument 230 is a function call of the library. In accordance with an embodiment, argument 230 comprises multiple data parameters (e.g., dependent data parameters, ranges of data parameters, etc.). Furthermore, argument 230 in accordance with an embodiment specifies how many rows of data scalable data generation tool 106 is to generate based on the provided data parameter(s). Additional details regarding providing arguments to an SDGT are described with respect to
In step 308, scaled data comprising a column of synthetic data values is received from the SDGT. Each synthetic data value is within the boundary specified by the data parameter. For example, synthetic data handler 222 receives scaled data 232 from SDGT 106. Scaled data 232 comprises one or more columns of synthetic data values generated by SDGT 106. The synthetic data values generated based on arguments comprising data parameters are within the boundary specified by the respective data parameter. In some embodiments, scaled data 232 comprises data generated without data parameters generated by generative AI model 104. In some embodiments, SDGT 106 generates columns of data and appends them to generate scaled data 232 as a table based on multiple arguments and/or sub-arguments provided by argument provider 220. Alternatively, synthetic data handler 222 receives scaled data 232 and appends the columns to a table maintained by synthetic data handler 222 to generate synthetic data (e.g., synthetic data 128). In accordance with an embodiment, synthetic data handler 222 stores generated synthetic data in a data store (e.g., in storage 126 of
In step 310, an emulated workload is caused to utilize synthetic data comprising the scaled data to generate a performance benchmark for the domain. For example, synthetic data handler 222 provides a workload call 234 to workload emulator 108 to cause workload emulator 108 to emulate a workload utilizing synthetic data comprising scaled data 232 to generate a performance benchmark 238 for the domain included in prompt 226. In accordance with an embodiment, performance benchmark 238 is provided to application 110 for further analysis, review, and/or display in a GUI of application 110.
Synthetic data generator 112 may operate in various ways to cause an SDGT to generate scaled data, in embodiments. For instance,
Flowchart 400A includes step 402. In step 402, the SDGT is caused to select a value within a first range subset and a value within a second range subset to generate the scaled data. For example, argument provider 220 provides argument 230 to SDGT 106 to cause SDGT 106 to select a value within a first range subset and a value within a second range subset to generate scaled data 232. In this context, the data parameter included in argument 230 is a range parameter that specifies the first and second range subsets. The range subsets may be ranges of any type of numeric data (e.g., datetimes, integers, decimals, etc.). Further details regarding causing SDGT 106 to select values within first and second range subsets are described with respect to
As described with respect to
Flowchart 400B includes step 404. In step 404, the SDGT is caused to select an element from a list of elements to generate the scaled data. For example, argument provider 220 provides argument 230 to SDGT 106 to cause SDGT 106 to select an element from a list of elements to generate scaled data 232. In this context, the data parameter included in argument 230 specifies a categorical list of elements. Further details regarding causing SDGT 106 to select from a categorical list of elements are described with respect to
Synthetic data generator 112 is described as utilizing a generative AI model (e.g., generative AI model 104) and an SDGT (e.g., SDGT 106) to generate synthetic data. In accordance with an embodiment, synthetic data generator 112 comprises logic that, when executed, transmits API calls to (and/or executes functions of) generative AI model 104 to generate data parameters and/or to SDGT 106 to generate scaled data. To better understand the operation of synthetic data generator 112 interfacing with generative AI model 104 to generate data parameters, synthetic data generator 112 of
Depending on the implementation, synthetic data generator 112 may generate synthetic data for a domain based on categories (e.g., topics) included in schema information (e.g., a schema file). Alternatively, synthetic data generator 112 leverages generative AI model 104 to generate categories. Consider the example code fragment 500A shown in
Prompt block 504 creates a prompt to be provided to generative AI model 104. In prompt block 504, a prompt is defined that tasks generative AI model 104 with generating a comma separated list of twenty topics that a dataset may include for a particular category. In other words, the prompt requests generative AI model 104 to generate topics for a domain. To generate the prompt, a system prompt template for a system role is generated using the SystemMessagePromptTemplate.from_template(template) function. A human prompt template for a human role is also generated, which includes a declaration of a “human_template”:
human_template=“{company_type}”
In this declaration, “ ” {company_type}“ ” refers to the domain input in the function call of function 502. The system prompt template and the human prompt template are combined in a chat prompt template declared as “chat_prompt.”
In chain definition 506, an LLMChain defines the generative AI model that will be called to generate the list of categories (e.g., generative AI model 104), the prompt that will be provided to the model, and that the output is to be in a comma separated list. In run block 508, the chain runs with “company_type” passed as an argument thereof. This causes the company_type in a call of function 502 to be passed into the human prompt of the chat prompt created in prompt block 504. For instance, if function 502 was called using an execution of “generate_company_data_topics (generative AI model 104, healthcare)”, a prompt would be provided to generative AI model 104 to cause generative AI model 104 to generate a comma separated list of 20 topics a dataset for a healthcare domain might have.
Thus, an example code fragment for generating data topics for a domain has been described with respect to
Synthetic data generator 112 of
Prompt block 512 defines a prompt that requests a minimum and maximum value of the datatype passed to function 510. Prompt block 512 includes an API call to generative AI model 104. In the context of prompt block 512, the API call does not use prompt pre-processing (e.g., the pre-processing in prompt block 504 of
Split block 514, post-processing blocks 516A and 516B, and test block 518 prepare the response generated by generative AI model 104 in an expected format. For instance, split block 514 splits the response from generative AI model 104 based on commas in the response. Since prompt block 512 prompted generative AI model 104 to generate the output as a list, the output should be a first number and a second number separated by a comma. Post-processing blocks 516A and 516B loop through the output of split block 514 to clean the text until only numbers within the text remain. Test block 518 tests the post-processed output to see if two values (e.g., a minimum value and a maximum value) remain. If so, test block 518 outputs the values.
In embodiments, a datatype is assigned to categories (e.g., the categories generated utilizing code fragment 500A, categories indicated in a schema file, and/or the like). In some embodiments, a schema file indicates the datatype assigned to a category. Alternatively, synthetic data generator 112 and/or generative AI model 104 assigns the datatype to the category.
Code fragments 500B and 500C have been described with respect to determining minimum and maximum values for datatypes; however, it is also contemplated herein that synthetic data generator 112 may be configured to determine a minimum and maximum value for a particular category. In this context, function 510 may be modified to accept datatype and the category name as arguments and the prompt in prompt block 512 may be modified to prompt generative AI model 104 to determine a minimum and maximum value for a particular category. For instance, an example modified prompt is provided as follows:
prompt=“what is the min and max value of”+category_type+“?Provide the values in”+datatype+“format.Only print the values and print them in a list.”
In this context, “category_type” refers to the category (e.g., a category of column) that the values are to be generated for. For instance, as a non-limiting example, suppose category_type is the weight of patients. In this context, the prompt asks generative AI model 104 to determine a minimum value and a maximum value of a patient's weight. By tailoring minimum and maximum values based on category types, synthetic data generator 112 improves the relevancy of data values in synthetic data. For instance, the minimum and maximum values for a patient's weight may be different than the minimum and maximum values for a patient's height or blood pressure.
As discussed herein, synthetic data generator 112 of
Prompt block 522 creates a prompt to be provided to generative AI model 104. In prompt block 522, a prompt is defined that tasks generative AI model 104 with generating three consecutive datetimes in a “month/day/year hours:minutes:seconds” format. Prompt block 522 includes similar prompt pre-processing as prompt block 504 of code fragment 502 of
Call block 524 includes a definition of the variable “chain” as an LLMChain function call to generative AI model 104 with the chat prompt of prompt block 522 and an output format as a comma separated list. Call block 524 also includes a call that executes the LLMChain function and passes a month and a year as arguments. This causes the month and year (January 2023 in
Code fragment 500D is described with respect to generating a range date parameter based on a month and date provided as arguments in call block 524. However, it is also contemplated herein that the prompt in prompt block 522 may be modified to generate range date parameters based on other time and/or date information as well (e.g., based on a particular day, a range of days, a year, a range of years, a range of hours, a time of day). Furthermore, while the format of the datetimes in code fragment 500D include the month, day, year, hour, minute, and seconds of the datetime, embodiments described herein may use datetimes with less or more information (e.g., only dates, only time, without the year, without seconds, etc.). Moreover, while code fragment 500D is described with respect to generating a range data parameter for a range of dates, embodiments described herein are not so limited. For instance, synthetic data generator 112 may include logic that generates a range data parameter for any type of numeric range. In this context, an argument can be passed to an SDGT to generate synthetic data comprising a beginning value for a range (e.g., a start value or a minimum value) and an ending value for a range (e.g., an end value or a maximum value).
As discussed herein, synthetic data generator 112 of
Prompt block 532 creates a prompt to be provided to generative AI model 104. In prompt block 532, a prompt is defined that tasks generative AI model 104 with generating a categorical list of elements. The prompt specifies the number of elements to be generated for a category based on the number in the argument of function 530 (e.g., the number assigned to the “num_cat” variable) and the maximum length (in characters) for the word specified in the argument of function 530 (e.g., the number assigned to the “max_length” variable). In accordance with an embodiment, the variables num_cat and max_length are specified in schema information provided to synthetic data generator 112. In this context, num_cat and/or max_length may be assigned to the specific category or in general (e.g., to all categories). Alternatively, the variables num_cat and/or max_length are assigned to a category by generative AI model 104 (e.g., in cases where generative AI model 104 generates categories for columns of data). In another alternative, synthetic data generator 112 comprises logic to determine a number of elements to be generated for a category and/or the maximum length of an element. Prompt block 532 comprises similar prompt pre-processing as prompt blocks 504 of
Call block 534 includes a definition of the variable “chain” as an LLMChain function call to generative AI model 104 with the chat prompt of prompt block 532 and an output format as a comma separated list. Call block 534 also includes a call that executes the LLMChain function and passes a category type as an argument. This causes the category type in the argument of function 530 (“random topic” in
In some embodiments, post-processing steps may be performed to refine the categorical list generated by executing function 530 of
Call block 538 includes a function call to function 530 of code fragment 500E, as described with respect to
Post-processing block 540 checks for empty strings in word_options and adds non-empty strings to a final list assigned to a “my_word_list” variable. Words are added to the my_word_list variable until there are no more words in the word_options variable or the total number of words added to the my_word_list variable is equal to the number of elements to be generated (“num_cat” in code fragment 5F.
Test block 542 performs a double check for empty strings in case the buffer was insufficient and returns the list stored in the my_word_list variable. In test block 542, if there are empty strings in my_word_list, an “add_words( )” function is called, passing my_word_list, the generative AI model, the number of empty strings, the category type, and the maximum length of an element as arguments. The add_words( ) function invokes another execution of function 530 and appends the output to the my_word_list variable. After test block 542 updates my_word_list variable, another check is made to see if any of the strings are empty. If not, the loop while loop is broken and the return statement returns the my_word_list variable. Otherwise, the loop repeats and the add_words( ) function is called again.
As discussed herein, synthetic data generator 112 of
Prompt block 546 creates a prompt to be provided to generative AI model 104. In prompt block 5323, a prompt is defined that tasks generative AI model 104 with generating pairs of words based on instructions received in an API call. The prompt specifies a maximum length (in characters) for the words specified in the argument of function 544 (e.g., the number assigned to the “max_length” variable). Max_length may be specified in schema information provided to synthetic data generator 112, assigned by synthetic data generator 112, and/or assigned by generative AI model 104 (e.g., in the generation of a category). Prompt block 546 comprises similar prompt pre-processing as prompt blocks 504 of
Call block 548 includes a definition of the variable “chain” as an LLMChain function call to generative AI model 104 with the chat prompt of prompt block 546 and an output format as a comma separated list. Call block 534 also includes a definition for the variable “info” as a string including the number of requested element pairs and keywords passed as arguments of function 544. Furthermore, call block 534 includes a call that executes the LLMChain function and passes a the variable info as an argument. This causes the number of requested element pairs and keywords in the argument of function 544 to be passed into the human prompt of the chat prompt created in prompt block 546, which causes generative AI model 104 to generate pairs of values as pairs in a tuple. For instance, suppose the number of element pairs requested was twenty and the keywords were diseases and symptoms. In this context, execution of “chain.run (info)” causes generative AI model 104 to generate twenty pairs of diseases and corresponding symptoms.
Post-processing block 550 processes the output of call block 548 and splits it into two separate lists. In particular, the function “clean_two_columns( )” cleans the output (e.g., removing punctuation, performing other cleaning post-processing functions, etc.) and the function “convert_two_columns( )” splits the list of values generated by AI model 104 into two columns of data. In this context, each two proceeding words in the list are a pair that depends on each other, wherein execution of the function convert_two_columns( ) returns a list of pairs of values in two separate lists that are used as a dictionary, where values of the first column are keys to value pairs in the second column. For instance, with reference to
value_pairs[requestedkey]
where requestedkey is the disease (e.g., the “key” of the value_pairs dictionary). For instance, if a disease “Disease A” was mapped to a symptom “Symptom A”, executing the expression value_pairs[Disease A] would return “Symptom A”. A further example implementing such a dictionary is described with respect to
While code fragment 500G of
Thus, example code fragments for generating categories and data parameters have been described with respect to synthetic data generator 112 of
As discussed with respect to
Initialization block 604 initializes the variable “sdgt” as the function call to SDGT 106, initializes the “values” variable as an empty array, sets the value of the “min_val” variable to the first value of the numeric data parameter passed to function 602 as an argument thereof (e.g., “−2147483648” in
Call block 606 generates a random set of numbers between the minimum and maximum values of the numeric data parameter utilizing SDGT 106. For instance, for the number of rows of data to be generated (e.g., 50 in
Thus code fragment 600A has been described with respect to generating a column of numeric data based on a data parameter. The example shown in
As discussed with respect to
Parameter generation block 612 generates range data parameters for SDGT 106. As shown in
Argument block 614 is configured to generate the “start” datetime for a range of dates and argument block 616 is configured to generate the “end” datetime for the range. For each row to be generated (50 in
While code fragment 600B is described with respect to generating two columns with dependent start and end dates, embodiments described herein are not so limited. For instance, synthetic data generator 112 may include logic that utilizes SDGT 106 to generate a range dependent numeric data values of any type. In this context, an argument can be passed to SDGT 106 to generate synthetic data comprising a beginning value for a range (e.g., a start value or a minimum value) and an ending value for a range (e.g., an end value or a maximum value).
As discussed with respect to
Parameter generation call 622 generates a categorical data parameter for SDGT 106. As shown in
Check block 624 is configured to check if the categorical list of elements generated by generative AI model 104 has been used in another table. As shown in
Argument block 626 is configured to place a function call to SDGT 106 and return the column of synthetic data values. As shown in
As discussed with respect to
Parameter generation call 640 generates two categorical lists that are dependent on one another. As shown in
Column generation block 642 is configured to generate the columns of dependent data. As shown in
Thus, example code fragments for generating columns of scaled data have been described with respect to SDGT 106 and synthetic data generator 112 of
As described elsewhere herein, in some implementations, a user or an organization may require synthetic data that includes coherent sentences, such as simulating a doctor's recommendation to a patient or comments the patient made to the doctor. Embodiments of the present disclosure may be configured to generate “synthetic sentence data” that simulates sentences that would appear in a real dataset. Systems, devices, and apparatuses described herein may perform in various ways to generate synthetic sentence data. For instance,
To better illustrate embodiments for generating synthetic data comprising synthetic sentence data,
Flowchart 800 begins with step 802. In step 802, training data generated by the LLM based on the prompt is received responsive to the prompt provided to the LLM. For example, as shown in
In step 804, a lightweight model is trained to generate synthetic sentences based on the received training data. For example, model trainer 738 of
In step 806, synthetic sentence data is received from the lightweight model. For example, model trainer 738 of
In step 808, the synthetic sentence data is appended to the scaled data to generate the synthetic data. For example, synthetic data handler 222 of
Subsequent to step 808, synthetic data generated by synthetic data handler 222 appending synthetic sentence data to scaled data 232 may be utilized in various ways. For instance, the synthetic data may be provided in a workload call 736 to workload emulator 108 to cause workload emulator 108 to emulate a workload utilizing the synthetic data comprising scaled data 232 and the appended synthetic sentence data (e.g., to generate a performance benchmark for the domain included in prompt 726). Alternatively, or additionally, the synthetic data is provided to application 110 via transmission 238 for further analysis, review, and/or display in a GUI of application 110 (not shown in
Synthetic data generator 112 is described with respect to
Depending on the implementation, synthetic data generator 112 may utilize generative AI model 104 to generate training data for lightweight model 740 (e.g., training data 728) in various ways. For example,
Open statement 904 sets a target file path for the training data. In open statement 904 an “open( )” function is called with “file_path” and “w” as arguments. “file_path” is the path and name of the file the training data is to be saved to passed in function 902. “w” indicates the file is to be opened for writing. If the file does not exist in the file path, the open( ) function creates the file.
Data generation block 906 comprises logic for generating training data 728. Data generation block 906 comprises prompt block 912, call block 914, and post-processing block 916. In prompt block 912, a prompt template is defined for prompting generative AI model 104 to generate comments a person in a particular subject would tell a user. In prompt 912, a subject is passed to the prompt via “{subject}”. Alternatively, embodiments of code fragment 900A comprise prompt pre-processing similar to that shown in prompt block 502 of code fragment 500A as described with respect to
In call block 914, a call is placed to generative AI model 104 utilizing the llm_chain.predict( ) function and passing the argument “subject=company_type”. In this context, the domain passed to function 902 is passed to 11m_chain.predict( ), which inserts the domain into the prompt of prompt block 912 and transmits the prompt to generative AI model 104. Post-processing block 916 includes logic that cleans punctuation in a response received from generative AI model 104 and writes the output to the opened file.
As shown in code fragment 900A, the operation of data generation block 906 is repeated for each number of example sentences to generate. In this context, a separate call is made to generative AI model 104 for each sentence generated for training data 728. Alternatively, the prompt in prompt block 912 prompts generative AI model 104 to generate the requested number of separate comments and return them in a list. In this context, the list is written to the open file. By prompting generative AI model 104 to generate multiple comments at once, the number of calls made to generative AI model 104 is reduced.
In close statement 910, the file opened in open statement 904 is closed and the generation of training data 728 is complete. In this context, training data 728 is saved in the file path. The file path may be a location in memory of computing device 102 or an external data store (e.g., storage 126 of
Code fragment 900A is described with respect to generating training data a person (e.g., a professional, an employee, etc.) in a domain would tell a user (e.g., a customer). However, training data may be generated for generating other types of sentence data (e.g., sentences a user would tell a person, sentences one person would tell another person, sentences a user would tell another user, and/or the like). In this context, prompt block 912 of code fragment 900A would be modified to include a template for generating such a sentence. In accordance with another embodiment, code fragment 900A comprises logic for selection one of several types of prompts to generate (e.g., a first prompt similar to the prompt shown in prompt block 912, a second prompt for a sentence a user would tell a person, a third prompt for a sentence a person would tell another person, a fourth prompt for a sentence a user would tell another user, etc.). In this context, function 902 is modified to accept an argument that indicates which of the prompts to be generated. Alternatively, 902 is configured to generate training data for each of the prompts. In this alternative embodiment, training data for each prompt is saved as separate sets of training data.
Embodiments of synthetic data generator 112 may be configured to train lightweight model 740 in various ways. For example,
In open block 920, the file stored in the file path training data is saved to is opened and the data is read. In training block 922 a markovify.Text( ) function is called to train lightweight model 740 based on training data 728. In the example shown in
Synthetic data generator 112 may be configured to generate synthetic sentence data utilizing lightweight model 740 in various ways, in embodiments. For example,
Column initialization statement 928 initializes a column (“comments”) that synthetic sentences generated by lightweight model 740 are stored in. Generation block 930 comprises logic for generating comments utilizing lightweight model 740. As shown in
In sentence generation statement 936, a call is made to lightweight model 740 utilizing the markov_model.make_short_sentence( ) function passing the character length defined in comment length statement 934 (or alternatively the maximum character length passed through function 926 or a default maximum character length). The function markov_model.make_short_sentence( ) causes lightweight model 740 to generate a sentence no longer than the passed character length. Check loop 938 verifies lightweight model 740 returned a sentence and, if not, places another call to lightweight model 740. Append statement appends the generated sentence to the column initialized in column initialization statement 928. The operation of generation block 930 is repeated until the number of comments generated is equal to the number passed through function 926. Return statement 932 returns the column of comments.
In accordance with an embodiment, synthetic data generator 112 is configured to check if a lightweight model has already been trained or if training data for a lightweight model has already been generated for a domain. Synthetic data generator 112 may be configured to check for existing trained lightweight models and/or training data in various ways, in embodiments. For example,
File path definition block 944 defines file paths that existing training data and trained models may be saved to and where new training data and trained models are to be saved to. For instance, “lakehouse_root_path” is a variable that defines the root file path for saved files, “data_file” defines a file name for training data, “file_path” defines a file path for data_file, “model_name” defines a file name for a trained model, and “modelpath” defines a file path for model_name. While file path definition block 944 illustrates a default root file path, in accordance with an embodiment, code fragment 900D includes logic for passing a root file path as an argument of function 942. Alternatively, synthetic data generator 112 determines a root file path (e.g., based on schema information received from application 110 and/or included in a schema file).
Training data check block 946 comprises logic for checking if training data has already been generated for a domain. The logic checks the file_path for an existing data_file. If data_file does not exist, a call is placed to function 902 of code fragment 900A of
Comment generation block 950 comprises logic for generating synthetic sentence data utilizing a trained version of lightweight model 740. The trained model is loaded from model_path. A call is made to function 926 of code fragment 900C of
As discussed herein, synthetic data generator 112 of
Table 1000 includes a plurality of columns with headings 1002 and synthetic data values 1004. In accordance with an embodiment headings 1002 are included in schema information provided to prompter 114 of synthetic data generator 112 (e.g., as schema information 224, schema information 224, a schema file, and/or the like). Alternatively, prompter 114 transmits a prompt to generative AI model 104 to generate headings 1002. For instance, in accordance with an embodiment, prompter 114 calls function 502 of code fragment 500A of
Synthetic data generator 112 utilizes generative AI model 104 and SDGT 106 to generate the first seven columns. The Patient #column represents a key for the rows in table 1000. Synthetic data generator 112 may be configured to utilize SDGT 106 and/or lightweight model 740 to generate any number of rows, as described elsewhere herein. The Check-In Time and Check-Out Time columns represent dependent start and end datetimes. In accordance with an embodiment, synthetic data generator 112 generates data parameters for these columns by calling function 520 of code fragment 500D of
Synthetic data generator 112 utilizes generative AI model 104 and lightweight model 740 to generate the Doctor's Comments column. In accordance with an embodiment, synthetic data generator 112 prompts generative AI model 104 to generate training data for what a doctor would say to a patient during an exam. For instance, synthetic data generator 112 may generate the training data by calling function 902 of code fragment 900A of
Prompts may be generated in various ways. For instance, a user may interact with a user interface of synthetic data generator 112 (e.g., via application 110 and/or computing device 102 of
Flowchart 1200 begins with step 1202. In step 1202, a schema file comprising metadata associated with a domain is received. For example, schema analyzer 1104 of
In step 1204, a prompt is generated based on the schema file. For example, prompt generator 1106 of
As noted herein, the embodiments of the present disclosure utilize a generative artificial intelligence model to generate data parameters to be used as input for a scalable data generation tool in the generation of synthetic data for use in performance benchmarking. A generative AI model is a model that generates content that is complex, coherent, and/or original. For instance, a generative AI model can create sophisticated sentences, lists, ranges, tables of data, images, essays, and/or the like. Examples of generative AI models include, but are not limited to, language models (e.g., large language models (LLMs)), generative adversarial networks (GANs), variational autoencoders (VAEs), multimodal models, and/or other generative AI models as understood by one of ordinary skill in the relevant art(s) having benefit of this disclosure).
Embodiments described herein have been described with respect to language models such as LLMs. A language model is a model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens. In this context, a “token” is an atomic unit that the model is training on and making predictions on. A token may be a word, a character (e.g., an alphanumeric character, a blank space, a symbol, etc.), a sub-word (e.g., a root word, a prefix, or a suffix). In other types of models (e.g., image based models) a token may represent another kind of atomic unit (e.g., a subset of an image).
A large language model (LLM) is a language model that has a high number of model parameters. For instance, an LLM may have millions, billions, trillions, or even greater numbers of model parameters. Model parameters of an LLM are the weights and biases the model learns during training. An LLM is (pre-)trained using self-supervised learning and/or semi-supervised learning.
Some implementations of LLMs are transformer-based LLMs (e.g., the family of generative pre-trained transformer (GPT) models, Pathways Language model (PaLM), Large Language Model Meta AI (LLaMMA), BigScience Large Open-science Open-access Multilingual Language Model (BLOOM), and/or the like). A transformer is a neural network architecture that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings (e.g., without relying on convolutions or recurrent neural networks). Examples of transformer-based LLMs utilized by embodiments described herein may be implemented as described with respect to
Depending on the implementation, transformer-based LLMs may comprise an encoder and/or a decoder. For instance, as shown in
Embedding layer 1306 receives input 1346 and outputs input embeddings 1348. Input 1346 is a sequence of tokens and embedding layer 1306 utilizes learned embeddings to convert input 1346 to a vector of dimension dmodel. In accordance with an embodiment, the learned embeddings include a weight matrix multiplied by the square root of the model's dimension (dmodel).
Positional encoding layer 1308 receives input embeddings 1348 and outputs encoded input embeddings 1350. Encoded input embeddings 1350 are in a vector form (also referred to as an “input vector”). In accordance with an embodiment, LLM 1300 does not include recurrence or convolution. In this context, positional encoding layer 1308 is utilized to inject relative and/or absolute position of tokens in input embeddings 1348. The positional embeddings are summed with input embeddings 1348 to generate encoded input embeddings 1350. Positional encodings may be learned, fixed, or another type of positional encoding as understood by a person ordinarily skilled in the relevant art having benefit of this disclosure.
Encoded input embeddings 1350 output by positional encoding layer 1308 either flow to self-attention sub-layer 1320 or “skip” self-attention sub-layer 1320 via a residual connection to normalization layer 1322. In implementations, residual connections improve convergence of training results of LLM 1300 by allowing data to “skip” through some of the layers (or sub-layers) of encoder 1302 and/or decoder 1304.
Self-attention sub-layer 1320 applies an attention function to received encoded input embeddings 1350 (e.g., the portion of encoded input embeddings 1350 that did not skip self-attention sub-layer 1320) to generate attended output 1352. Self-attention sub-layer computes attended output 1352 as a weighted sum of values, where the weight assigned to each value is computed as a compatibility function of the query with the corresponding key. In accordance with an embodiment, the value, the query, and the key are vectors of an input vector (e.g., of encoded input embeddings 1350) projected through trained weights (e.g., a value weight, a query weight, and a key weight). Example attention functions include, but are not limited to, additive attention, dot-product attention, and scaled dot-product attention. In embodiments, self-attention sub-layer 1320 attends to previous and subsequent embeddings for a particular value when computing output 1352.
In accordance with an embodiment, self-attention sub-layer 1320 utilizes multi-head attention, wherein multiple attention sub-layers run in parallel and respective outputs of the multiple attention sub-layers are concatenated to generate attended output 1352. In a further embodiment of multi-head attention, each of the parallel attention sub-layers utilizes different learned linear projections to a dimension of the values, queries, and keys. Since the dimension of each parallel attention sub-layer is reduced, the total computational cost of multi-head attention is similar to that of single-head attention with full dimensionality.
Normalization layer 1322 receives residual encoded input embeddings 1350 and attended output 1352 and generates normalized output 1354. Normalized output 1354 is a function of LayerNorm1(x+Sublayer1(y)), where “x” is the residual encoded input embeddings 1350 (e.g., embeddings that skip self-attention sub-layer 1320), “y” is the encoded input embeddings 1350 received by self-attention sub-layer 1320, Sublayer1( ) is a function implemented by self-attention sub-layer 1320 (e.g., an attention function and (e.g., optionally) any other concatenation, linearization, or other post-processing implemented by self-attention sub-layer 1320), and Sublayer1(y) is the output of self-attention sub-layer 1320 (i.e., attended output 1352). Similar to encoded input embeddings 1350, normalized output 1354 output by normalization layer 1322 may flow to feed forward sub-layer 1324 or “skip” feed forward sub-layer 1324 via a residual connection to normalization layer 1326.
Feed forward sub-layer 1324 receives (e.g., a portion of) normalized output 1354 and generates forward output 1356. In accordance with an embodiment, feed forward sub-layer 1324 is a position-wise fully connected feed-forward network. In a further embodiment, feed forward sub-layer 1324 is implemented using two linear layers with a Rectified Linear Unit (ReLU) activation function in between.
Normalization layer 1326 operates similar to normalization layer 1322, in that normalization layer 1326 receives residual normalized output 1354 and forward output 1356 and generates encoder output 1358. Encoder output 1358 is a function of LayerNorm(x+Sublayer2(y)), where “x” is the portion of residual normalized output 1354 (e.g., normalized output that skips feed forward sub-layer 1324), “y” is the portion of residual normalized output 1354 received by feed forward sub-layer 1324, Sublayer2( ) is a function implemented by feed forward sub-layer 1324, and Sublayer2(y) is the output of feed forward sub-layer 1324 (i.e., forward output 1356).
Decoder 1304 transforms a sequence of embeddings into a new sequence, possibly with a different length. Decoder 1304 comprises an embedding layer 1312, a positional encoding layer 1314, a plurality of decoding layers 1316 (“decoding layers 1316”), and a generator 1318. Each decoding layer of a decoder comprises a number of sub-layers. For instance, as shown in
Embedding layer 1312 receives input 1360 and outputs output embeddings 1362. Input 1360 is a sequence of tokens and embedding layer 1312 utilizes learned embeddings to convert input 1360 to a vector of dimension dmodel to generate output embeddings 1362. In accordance with an embodiment, the learned embeddings utilized by embedding layer 1312 are the same as the embeddings used by embedding layer 1306.
Positional encoding layer 1314 receives output embeddings 1362 and outputs encoded output embeddings 1364. Encoded output embeddings 1364 are in a vector form. In accordance with an embodiment, LLM 1300 does not include recurrence or convolution. In this context, positional encoding layer 1314 is utilized to inject relative and/or absolute position of tokens in output embeddings 1362. The positional embeddings are summed with output embeddings 1362 to generate encoded output embeddings 1364. Positional encodings may be learned, fixed, or another type of positional encoding as understood by a person ordinarily skilled in the relevant art having benefit of this disclosure. Similar to encoded input embeddings 1350 described above, encoded output embeddings 1362 may flow to masked self-attention sub-layer 1328 or “skip” masked self-attention sub-layer 1328 via a residual connection to normalization layer 1330.
Masked self-attention sub-layer 1328 receives (e.g., a portion of) encoded output embeddings 1362 and generates masked attended output 1366. Masked self-attention sub-layer 1328 operates in a similar manner to self-attention sub-layer 1320 with the following difference: masked self-attention sub-layer 1328 is configured to (e.g., only) attend to embeddings that are prior to the token being predicted. Furthermore, output embeddings 1364 are offset by one position. In this manner, predictions made by decoder 1304 depend (e.g., only) on known outputs at positions prior to the predicted output. Masked self-attention sub-layer 1328 may utilize attention functions and/or multi-head attention techniques similar to those described with respect to self-attention sub-layer 1320.
Normalization layer 1330 receives residual encoded output embeddings 1364 and masked attended output 1366 and generates normalized output 1368. Normalized output 1368 is a function of LayerNorm3(x+Sublayer3(y)), where “x” is the residual encoded output embeddings 1364 (e.g., embeddings that skip masked self-attention sub-layer 1328), “y” is the encoded output embeddings 1364 received by masked self-attention sub-layer 1328, Sublayer3( ) is a function implemented by masked self-attention sub-layer 1328 (e.g., an attention function and (e.g., optionally) any other concatenation, linearization, or other post-processing implemented by masked self-attention sub-layer 1328), and Sublayer3(y) is the output of masked self-attention sub-layer 1328 (i.e., masked attended output 1366). Normalized output 1368 may flow to cross-attention sub-layer 1342 or “skip” cross-attention sub-layer 1342 via a residual connection to normalization layer 1344.
Cross-attention sub-layer 1342 receives encoder output 1358 and (e.g., a portion of) normalized output 1368 and generates cross-attended output 1370. Cross-attention sub-layer 1342 operates in a manner similar to self-attention sub-layer 1320 with the following differences. The query vector of cross-attention sub-layer 1342 is the vector of normalized output 1368 projected through a trained query weight for cross-attention sub-layer 1342. The value and key vectors of cross-attention sub-layer 1342 are the vector of encoder output 1358 projected through respective trained value and key weights for cross-attention sub-layer 1342. In this context, cross-attention sub-layer 1342 utilizes an attention function to compute cross-attended output 1370 as a sum of values (which are from the encoder) weighted by the outcome of a function of the query (which is from the decoder) and the key (which is from the encoder). In other words, computation of cross-attended output 1370 depends on the encoder (which evaluates tokens at all positions) and decoder (which evaluates tokens at positions before a predicted outcome). Cross-attention sub-layer 1342 may utilize any of the attention functions and/or multi-head attention as described with respect to self-attention sub-layer 1342 to generate cross-attended output 1370.
Normalization layer 1344 receives residual normalized output 1368 and cross-attended output 1370 and generates normalized output 1372. Normalized output 1372 is a function of LayerNorm4(x+Sublayer4(y)), where “x” is the residual normalized output 1368 (e.g., embeddings that skip cross-attention sub-layer 1342), “y” is the normalized output 1368 received by cross-attention sub-layer 1342, Sublayer4( ) is a function implemented by cross-attention sub-layer 1342 (e.g., an attention function and (e.g., optionally) any other concatenation, linearization, or other post-processing implemented by cross-attention sub-layer 1342), and Sublayer4(y) is the output of cross-attention sub-layer 1342 (i.e., cross-attended output 1370). Normalized output 1372 may flow to feed forward sub-layer 1334 or “skip” feed forward sub-layer 1334 via a residual connection to normalization layer 1336.
Feed forward sub-layer 1334 receives (e.g., a portion of) normalized output 1372 and generates forward output 1374. In accordance with an embodiment, feed forward sub-layer is configured in a similar manner as feed forward sub-layer 1324 to generate forward output 1374.
Normalization layer 1336 receives residual normalized output 1372 and forward output 1374 and generates normalized output 1376. Normalized output 1376 is a function of LayerNorm(x+Sublayer5(y)), where “x” is the portion of residual normalized output 1372 (e.g., normalized output that skips feed forward sub-layer 1334), “y” is the portion of residual normalized output 1372 received by feed forward sub-layer 1336, Sublayer5( ) is a function implemented by feed forward sub-layer 1334, and Sublayer5(y) is the output of feed forward sub-layer 1334 (i.e., forward output 1374).
Generator 1318 receives normalized output 1376 and generates output probabilities. As shown in
Softmax layer 1340 receives transformed output 1378 and generates output probabilities 1380. Softmax layer 1340 converts values in transformed output 1378 into a probability distribution to generate output probabilities 1380. In accordance with an embodiment, output probabilities 1380 is a vector of probabilities of a particular token being chosen.
Sampler 1382 receives output probabilities 1380 and infers from output probabilities 1380 the next token in a sequence. For example, as shown in
In some embodiments, an LLM is a “decoder only” LLM. In this context, the LLM does not include an encoder or a cross-attention portion. In a “decoder only” implementation, an LLM may include a subset of decoder 1304. For instance, in a non-limiting example of such an LLM, the LLM comprises positional encoding layer 1310, a plurality of reduced decoding layers, and generator 1314. In this example, each of reduced decoding layers comprises masked multi-headed attention sub-layers 1324, normalization layer 1326, feed forward sub-layer 1330, and normalization layer 1332. Since there is no encoder layer in this implementation, cross-attention portion 1328 is omitted. In this context, feedforward sub-layer 1330 and normalization layer 1332 receive output 1360 (i.e., instead of output 1364).
LLM 1300 has been described with respect to multiple feed forward sub-layers and normalization layers. In accordance with an embodiment each of feed forward sub-layers 1324 and 1334 are identical in configuration. Alternatively, feed forward sub-layer 1324 varies in configuration from feed forward sub-layer 1334. In accordance with another embodiment, two or more of normalization layer 1322, normalization layer 1326, normalization layer 1330, normalization layer 1336, and/or normalization layer 1344 are identical configuration. Alternatively, each of the normalization layers vary in configuration from one another.
As noted herein, the embodiments described, along with any circuits, components and/or subcomponents thereof, as well as the flowcharts/flow diagrams described herein, including portions thereof, and/or other embodiments, may be implemented in hardware, or hardware with any combination of software and/or firmware, including being implemented as computer program code configured to be executed in one or more processors and stored in a computer readable storage medium, or being implemented as hardware logic/electrical circuitry, such as being implemented together in a system-on-chip (SoC), a field programmable gate array (FPGA), and/or an application specific integrated circuit (ASIC). A SoC may include an integrated circuit chip that includes one or more of a processor (e.g., a microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and/or further circuits and/or embedded firmware to perform its functions.
Embodiments disclosed herein may be implemented in one or more computing devices that may be mobile (a mobile device) and/or stationary (a stationary device) and may include any combination of the features of such mobile and stationary computing devices. Examples of computing devices in which embodiments may be implemented are described as follows with respect to
Computing device 1402 can be any of a variety of types of computing devices. For example, computing device 1402 may be a mobile computing device such as a handheld computer (e.g., a personal digital assistant (PDA)), a laptop computer, a tablet computer, a hybrid device, a notebook computer, a netbook, a mobile phone (e.g., a cell phone, a smart phone, etc.), a wearable computing device (e.g., a head-mounted augmented reality and/or virtual reality device including smart glasses), or other type of mobile computing device. Computing device 1402 may alternatively be a stationary computing device such as a desktop computer, a personal computer (PC), a stationary server device, a minicomputer, a mainframe, a supercomputer, etc.
As shown in
A single processor 1410 (e.g., central processing unit (CPU), microcontroller, a microprocessor, signal processor, ASIC (application specific integrated circuit), and/or other physical hardware processor circuit) or multiple processors 1410 may be present in computing device 1402 for performing such tasks as program execution, signal coding, data processing, input/output processing, power control, and/or other functions. Processor 1410 may be a single-core or multi-core processor, and each processor core may be single-threaded or multithreaded (to provide multiple threads of execution concurrently). Processor 1410 is configured to execute program code stored in a computer readable medium, such as program code of operating system 1412 and application programs 1414 stored in storage 1420. The program code is structured to cause processor 1410 to perform operations, including the processes/methods disclosed herein. Operating system 1412 controls the allocation and usage of the components of computing device 1402 and provides support for one or more application programs 1414 (also referred to as “applications” or “apps”). Application programs 1414 may include common computing applications (e.g., e-mail applications, calendars, contact managers, web browsers, messaging applications), further computing applications (e.g., word processing applications, mapping applications, media player applications, productivity suite applications), one or more machine learning (ML) models, as well as applications related to the embodiments disclosed elsewhere herein. Processor(s) 1410 may include one or more general processors (e.g., CPUs) configured with or coupled to one or more hardware accelerators, such as one or more NPUs and/or one or more GPUs.
Any component in computing device 1402 can communicate with any other component according to function, although not all connections are shown for ease of illustration. For instance, as shown in
Storage 1420 is physical storage that includes one or both of memory 1456 and storage device 1490, which store operating system 1412, application programs 1414, and application data 1416 according to any distribution. Non-removable memory 1422 includes one or more of RAM (random access memory), ROM (read only memory), flash memory, a solid-state drive (SSD), a hard disk drive (e.g., a disk drive for reading from and writing to a hard disk), and/or other physical memory device type. Non-removable memory 1422 may include main memory and may be separate from or fabricated in a same integrated circuit as processor 1410. As shown in
One or more programs may be stored in storage 1420. Such programs include operating system 1412, one or more application programs 1414, and other program modules and program data. Examples of such application programs may include, for example, computer program logic (e.g., computer program code/instructions) for implementing generative AI model 104, SDGT 106, workload emulator 108, application 110, synthetic data generator 112, prompter 114, data handler 116, argument provider 220, synthetic data handler 222, model trainer 738, lightweight model 740, schema analyzer 1104, and/or prompt generator 1106, as well as any of flowcharts or interaction diagrams 300A, 300B, 400, 500, and/or any individual steps thereof, as well as any of code fragments 500A, 500B, 500C, 500D, 500E, 500F, 500G, 600A, 600B, 600C, 600D, 900A, 900B, 900C, and/or 900D, and/or any individual code statements and/or blocks thereof.
Storage 1420 also stores data used and/or generated by operating system 1412 and application programs 1414 as application data 1416. Examples of application data 1416 include web pages, text, images, tables, sound files, video data, and other data, which may also be sent to and/or received from one or more network servers or other devices via one or more wired or wireless networks. Storage 1420 can be used to store further data including a subscriber identifier, such as an International Mobile Subscriber Identity (IMSI), and an equipment identifier, such as an International Mobile Equipment Identifier (IMEI). Such identifiers can be transmitted to a network server to identify users and equipment.
A user may enter commands and information into computing device 1402 through one or more input devices 1430 and may receive information from computing device 1402 through one or more output devices 1450. Input device(s) 1430 may include one or more of touch screen 1432, microphone 1434, camera 1436, physical keyboard 1438 and/or trackball 1440 and output device(s) 1450 may include one or more of speaker 652 and display 1454. Each of input device(s) 1430 and output device(s) 1450 may be integral to computing device 1402 (e.g., built into a housing of computing device 1402) or external to computing device 1402 (e.g., communicatively coupled wired or wirelessly to computing device 1402 via wired interface(s) 1480 and/or wireless modem(s) 1460). Further input devices 1430 (not shown) can include a Natural User Interface (NUI), a pointing device (computer mouse), a joystick, a video game controller, a scanner, a touch pad, a stylus pen, a voice recognition system to receive voice input, a gesture recognition system to receive gesture input, or the like. Other possible output devices (not shown) can include piezoelectric or other haptic output devices. Some devices can serve more than one input/output function. For instance, display 1454 may display information, as well as operating as touch screen 1432 by receiving user commands and/or other information (e.g., by touch, finger gestures, virtual keyboard, etc.) as a user interface. Any number of each type of input device(s) 1430 and output device(s) 1450 may be present, including multiple microphones 1434, multiple cameras 1436, multiple speakers 1452, and/or multiple displays 1454.
One or more wireless modems 1460 can be coupled to antenna(s) (not shown) of computing device 1402 and can support two-way communications between processor 1410 and devices external to computing device 1402 through network 1404, as would be understood to persons skilled in the relevant art(s). Wireless modem 1460 is shown generically and can include a cellular modem 1466 for communicating with one or more cellular networks, such as a GSM network for data and voice communications within a single cellular network, between cellular networks, or between the mobile device and a public switched telephone network (PSTN). Wireless modem 1460 may also or alternatively include other radio-based modem types, such as a Bluetooth modem 1464 (also referred to as a “Bluetooth device”) and/or Wi-Fi modem 1462 (also referred to as an “wireless adaptor”). Wi-Fi modem 1462 is configured to communicate with an access point or other remote Wi-Fi-capable device according to one or more of the wireless network protocols based on the IEEE (Institute of Electrical and Electronics Engineers) 802.11 family of standards, commonly used for local area networking of devices and Internet access. Bluetooth modem 1464 is configured to communicate with another Bluetooth-capable device according to the Bluetooth short-range wireless technology standard(s) such as IEEE 802.15.1 and/or managed by the Bluetooth Special Interest Group (SIG).
Computing device 1402 can further include power supply 1482, LI receiver 684, accelerometer 1486, and/or one or more wired interfaces 1480. Example wired interfaces 1480 include a USB port, IEEE 1394 (FireWire) port, a RS-232 port, an HDMI (High-Definition Multimedia Interface) port (e.g., for connection to an external display), a DisplayPort port (e.g., for connection to an external display), an audio port, and/or an Ethernet port, the purposes and functions of each of which are well known to persons skilled in the relevant art(s). Wired interface(s) 1480 of computing device 1402 provide for wired connections between computing device 1402 and network 1404, or between computing device 1402 and one or more devices/peripherals when such devices/peripherals are external to computing device 1402 (e.g., a pointing device, display 1454, speaker 1452, camera 1436, physical keyboard 1438, etc.). Power supply 1482 is configured to supply power to each of the components of computing device 1402 and may receive power from a battery internal to computing device 1402, and/or from a power cord plugged into a power port of computing device 1402 (e.g., a USB port, an A/C power port). LI receiver 1484 may be used for location determination of computing device 1402 and may include a satellite navigation receiver such as a Global Positioning System (GPS) receiver or may include other type of location determiner configured to determine location of computing device 1402 based on received information (e.g., using cell tower triangulation, etc.). Accelerometer 1486 may be present to determine an orientation of computing device 1402.
Note that the illustrated components of computing device 1402 are not required or all-inclusive, and fewer or greater numbers of components may be present as would be recognized by one skilled in the art. For example, computing device 1402 may also include one or more of a gyroscope, barometer, proximity sensor, ambient light sensor, digital compass, etc. Processor 1410 and memory 1456 may be co-located in a same semiconductor device package, such as being included together in an integrated circuit chip, FPGA, or system-on-chip (SOC), optionally along with further components of computing device 1402.
In embodiments, computing device 1402 is configured to implement any of the above-described features of flowcharts herein. Computer program logic for performing any of the operations, steps, and/or functions described herein may be stored in storage 1420 and executed by processor 1410.
In some embodiments, server infrastructure 1470 may be present in computing environment 1400 and may be communicatively coupled with computing device 1402 via network 1404. Server infrastructure 1470, when present, may be a network-accessible server set (e.g., a cloud-based environment or platform). As shown in
Each of nodes 1474 may, as a compute node, comprise one or more server computers, server systems, and/or computing devices. For instance, a node 1474 may include one or more of the components of computing device 1402 disclosed herein. Each of nodes 1474 may be configured to execute one or more software applications (or “applications”) and/or services and/or manage hardware resources (e.g., processors, memory, etc.), which may be utilized by users (e.g., customers) of the network-accessible server set. For example, as shown in
In an embodiment, one or more of clusters 1472 may be co-located (e.g., housed in one or more nearby buildings with associated components such as backup power supplies, redundant data communications, environmental controls, etc.) to form a datacenter, or may be arranged in other manners. Accordingly, in an embodiment, one or more of clusters 1472 may be a datacenter in a distributed collection of datacenters. In embodiments, exemplary computing environment 1400 comprises part of a cloud-based platform.
In an embodiment, computing device 1402 may access application programs 1476 for execution in any manner, such as by a client application and/or a browser at computing device 1402.
For purposes of network (e.g., cloud) backup and data security, computing device 1402 may additionally and/or alternatively synchronize copies of application programs 1414 and/or application data 1416 to be stored at network-based server infrastructure 1470 as application programs 1476 and/or application data 1478. For instance, operating system 1412 and/or application programs 1414 may include a file hosting service client configured to synchronize applications and/or data stored in storage 1420 at network-based server infrastructure 1470.
In some embodiments, on-premises servers 1492 may be present in computing environment 1400 and may be communicatively coupled with computing device 1402 via network 1404. On-premises servers 1492, when present, are hosted within an organization's infrastructure and, in many cases, physically onsite of a facility of that organization. On-premises servers 1492 are controlled, administered, and maintained by IT (Information Technology) personnel of the organization or an IT partner to the organization. Application data 1498 may be shared by on-premises servers 1492 between computing devices of the organization, including computing device 1402 (when part of an organization) through a local network of the organization, and/or through further networks accessible to the organization (including the Internet). Furthermore, on-premises servers 1492 may serve applications such as application programs 1496 to the computing devices of the organization, including computing device 1402. Accordingly, on-premises servers 1492 may include storage 1494 (which includes one or more physical storage devices such as storage disks and/or SSDs) for storage of application programs 1496 and application data 1498 and may include one or more processors for execution of application programs 1496. Still further, computing device 1402 may be configured to synchronize copies of application programs 1414 and/or application data 1416 for backup storage at on-premises servers 1492 as application programs 1496 and/or application data 1498.
Embodiments described herein may be implemented in one or more of computing device 1402, network-based server infrastructure 1470, and on-premises servers 1492. For example, in some embodiments, computing device 1402 may be used to implement systems, clients, or devices, or components/subcomponents thereof, disclosed elsewhere herein. In other embodiments, a combination of computing device 1402, network-based server infrastructure 1470, and/or on-premises servers 1492 may be used to implement the systems, clients, or devices, or components/subcomponents thereof, disclosed elsewhere herein.
As used herein, the terms “computer program medium,” “computer-readable medium,” “computer-readable storage medium,” and “computer-readable storage device,” etc., are used to refer to physical hardware media. Examples of such physical hardware media include any hard disk, optical disk, SSD, other physical hardware media such as RAMs, ROMs, flash memory, digital video disks, zip disks, MEMs (microelectronic machine) memory, nanotechnology-based storage devices, and further types of physical/tangible hardware storage media of storage 1420. Such computer-readable media and/or storage media are distinguished from and non-overlapping with communication media and propagating signals (do not include communication media and propagating signals). Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared, and other wireless media, as well as wired media. Embodiments are also directed to such communication media that are separate and non-overlapping with embodiments directed to computer-readable storage media.
As noted above, computer programs and modules (including application programs 1414) may be stored in storage 1420. Such computer programs may also be received via wired interface(s) 1480 and/or wireless modem(s) 1460 over network 1404. Such computer programs, when executed or loaded by an application, enable computing device 1402 to implement features of embodiments discussed herein. Accordingly, such computer programs represent controllers of the computing device 1402.
Embodiments are also directed to computer program products comprising computer code or instructions stored on any computer-readable medium or computer-readable storage medium. Such computer program products include the physical storage of storage 1420 as well as further physical storage types.
A system for generating synthetic data for use in performance benchmarking is described herein. The system comprises a processor circuit and a memory device. The memory device stores program code to be executed by the processor circuit. The program code comprises a synthetic data generator configured to: provide a prompt comprising a domain to a large language model (LLM); receive, from the LLM, a data parameter associated with the domain that specifies a boundary for synthetic values in a column of data; provide an argument comprising the data parameter to a scalable data generation tool configured to generate data based on the data parameter; receive, from the scalable data generation tool, scaled data comprising a column of synthetic data values, each synthetic data value within the boundary specified by the data parameter; and cause an emulated workload to utilize synthetic data comprising the scaled data to generate a performance benchmark for the domain.
In a further embodiment of the foregoing system, the data parameter is a range data parameter that specifies a first range subset and a second range subset and the argument provided to the scalable data generation tool causes the scalable data generation tool to select a value within the first range subset and a value within the second range subset to generate the scaled data.
In a further embodiment of the foregoing system, the data parameter specifies a categorical list of elements and the argument provided to the scalable data generation tool causes the scalable data generation tool to select an element from the list of elements to generate the scaled data.
In a further embodiment of the foregoing system, the synthetic data comprises synthetic sentence data.
In a further embodiment of the foregoing system, the synthetic data generator is further configured to: responsive to the prompt provided to the LLM, receive training data generated by the LLM based on the prompt; train a lightweight model to generate synthetic sentences based on the received training data; receive, from the lightweight model, the synthetic sentence data; and append the synthetic sentence data to the scaled data to generate the synthetic data.
In a further embodiment of the foregoing system, the synthetic data generator is further configured to: receive a schema file comprising metadata associated with the domain; and generate the prompt based on the schema file.
In a further embodiment of the foregoing system, the scalable data generation tool is a non-artificial-intelligence scalable data generation tool.
A method for generating synthetic data is described herein. The method comprises: providing a prompt comprising a domain to a large language model (LLM); receiving, from the LLM, a data parameter associated with the domain that specifies a boundary for synthetic values in a column of data; providing an argument comprising the data parameter to a scalable data generation tool configured to generate data based on the data parameter; receiving, from the scalable data generation tool, scaled data comprising a column of synthetic data values, each synthetic data value within the boundary specified by the data parameter; and causing an emulated workload to utilize synthetic data comprising the scaled data to generate a performance benchmark for the domain.
In a further embodiment of the foregoing method, the data parameter is a range data parameter that specifies a first range subset and a second range subset and said providing the argument to the scalable data generation tool causes the scalable data generation tool to select a value within the first range subset and a value within the second range subset to generate the scaled data.
In a further embodiment of the foregoing method, the data parameter comprises a categorical list of elements and said providing the argument to the scalable data generation tool causes the scalable data generation tool to select an element from the list of elements to generate the scaled data.
In a further embodiment of the foregoing method, the synthetic data comprises synthetic sentence data.
In a further embodiment of the foregoing method, the method further comprises: responsive to said providing the prompt to the LLM, receiving training data generated by the LLM based on the prompt; training a lightweight model to generate synthetic sentences based on the received training data; receiving, from the lightweight model, the synthetic sentence data; and appending the synthetic sentence data to the scaled data to generate the synthetic data.
In a further embodiment of the foregoing method, the method further comprises: receiving a schema file comprising metadata associated with the domain; and generating the prompt based on the schema file.
In a further embodiment of the foregoing method, the scalable data generation tool is a non-artificial-intelligence scalable data generation tool.
A computer-readable storage medium encoded with program instructions that, when executed by a processor circuit, perform a method is described herein. The method comprises: providing a prompt comprising a domain to a large language model (LLM); receiving, from the LLM, a data parameter associated with the domain that specifies a boundary for synthetic values in a column of data; providing an argument comprising the data parameter to a scalable data generation tool configured to generate data based on the data parameter; receiving, from the scalable data generation tool, scaled data comprising a column of synthetic values, each synthetic data value within the boundary specified by the data parameter; and causing an emulated workload to utilize synthetic data comprising the scaled data to generate a performance benchmark for the domain.
In a further embodiment of the foregoing computer-readable storage medium, the data parameter is a range data parameter that specifies a first range subset and a second range subset and said providing the argument to the scalable data generation tool causes the scalable data generation tool to select a value within the first range subset and a value within the second range subset to generate the scaled data.
In a further embodiment of the foregoing computer-readable storage medium, the data parameter comprises a categorical list of elements and said providing the argument to the scalable data generation tool causes the scalable data generation tool to select an element from the list of elements to generate the scaled data.
In a further embodiment of the foregoing computer-readable storage medium, the synthetic data comprises synthetic sentence data.
In a further embodiment of the foregoing computer-readable storage medium, the method further comprises: responsive to said providing the prompt to the LLM, receiving training data generated by the LLM based on the prompt; training a lightweight model to generate synthetic sentences based on the received training data; receiving, from the lightweight model, the synthetic sentence data; and appending the synthetic sentence data to the scaled data to generate the synthetic data.
In a further embodiment of the foregoing computer-readable storage medium, the method further comprises: receiving a schema file comprising metadata associated with the domain; and generating the prompt based on the schema file.
In a further embodiment of the foregoing computer-readable storage medium, the scalable data generation tool is a non-artificial-intelligence scalable data generation tool.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
In the discussion, unless otherwise stated, adjectives modifying a condition or relationship characteristic of a feature or features of an implementation of the disclosure, should be understood to mean that the condition or characteristic is defined to within tolerances that are acceptable for operation of the implementation for an application for which it is intended. Furthermore, if the performance of an operation is described herein as being “in response to” one or more factors, it is to be understood that the one or more factors may be regarded as a sole contributing factor for causing the operation to occur or a contributing factor along with one or more additional factors for causing the operation to occur, and that the operation may occur at any time upon or after establishment of the one or more factors. Still further, where “based on” is used to indicate an effect being a result of an indicated cause, it is to be understood that the effect is not required to only result from the indicated cause, but that any number of possible additional causes may also contribute to the effect. Thus, as used herein, the term “based on” should be understood to be equivalent to the term “based at least on.”
Numerous example embodiments have been described above. Any section/subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, embodiments disclosed in any section/subsection may be combined with any other embodiments described in the same section/subsection and/or a different section/subsection in any manner.
Furthermore, example embodiments have been described above with respect to one or more running examples. Such running examples describe one or more particular implementations of the example embodiments; however, embodiments described herein are not limited to these particular implementations.
Further still, example embodiments have been described with respect to LLMs; however, it is also contemplated herein that embodiments may utilize other types of generative AI models (e.g., a generative adversarial network (GAN), a variational autoencoder (VAE), a multimodal model, and/or the like). For instance, an implementation of the described systems and/or methods may leverage a multimodal model that inputs and/or outputs more than one modality. For example, an alternative embodiment utilizes a multimodal generative AI model that generates text and images from a prompt. As a non-limiting example, in a healthcare domain scenario, the multimodal generative AI model may generate images representing X-rays, CT scans, MRIs, or other images related to the healthcare domain.
Moreover, according to the described embodiments and techniques, any components of systems, computing devices, servers, applications, synthetic data generators, generative AI models, SDGTs, workload emulators, lightweight models, storages, and/or their functions may be caused to be activated for operation/performance thereof based on other operations, functions, actions, and/or the like, including initialization, completion, and/or performance of the operations, functions, actions, and/or the like.
In some example embodiments, one or more of the operations of the flowcharts described herein may not be performed. Moreover, operations in addition to or in lieu of the operations of the flowcharts described herein may be performed. Further, in some example embodiments, one or more of the operations of the flowcharts described herein may be performed out of order, in an alternate sequence, or partially (or completely) concurrently with each other or with other operations.
The embodiments described herein and/or any further systems, sub-systems, devices and/or components disclosed herein may be implemented in hardware (e.g., hardware logic/electrical circuitry), or any combination of hardware with software (computer program code configured to be executed in one or more processors or processing devices) and/or firmware.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the embodiments. Thus, the breadth and scope of the embodiments should not be limited by any of the above-described example embodiments, but should be defined only in accordance with the following claims and their equivalents.