A software system is built upon a source code “base,” which typically depends on and/or incorporates many independent software technologies, such as programming languages (e.g. Java, C++), frameworks, shared libraries, run-time environments, etc. Each software technology may evolve at its own speed, and may include its own branches and/or versions. Each software technology may also depend on various other technologies. Source code bases, or simply “code bases,” tend to be large. There are often teams of programmers and/or engineers involved in updating a large code base in a process that is sometimes referred to as “migrating.”
When a team member makes change(s) to source code file(s) of a code base, they may provide, or may be required to provide, a note indicating an intent (referred to herein as “code change intent” or “change intent”) behind the changes. In version control systems (“VCS”) with atomic multi-change commits, a set of code change(s) and corresponding code change intent(s) that are “committed” to the VCS in a single act may be referred to as a “change list,” a “patch,” a “change set,” or an “update.” Team members may also indicate code change intents using other means, such as comments embedded within source code and delimited from the source code using special characters, such as “//”, “#”, and so forth.
Because they are often under considerable pressure and/or time constraints, code base migration team members may place low priority on composing descriptive change intents, e.g., when they commit updated source code to the code base. For example, different team members may describe vastly different code changes with similar and/or ambiguous code change intents. Likewise, different team members may describe similar code changes with vastly different (at least syntactically) code change intents. Consequently, someone who consults information associated with change list(s) in order to gain a high level understanding of changes made to a code base during a migration may be confronted with numerous change list entries that are repetitive, redundant, vague, and/or ambiguous.
Techniques are described herein for learning and utilizing mappings between changes made to source code and regions of latent space associated with source code change intents that motivated those source code changes. In some implementations, one or more machine learning models may be trained to generate embeddings based directly or indirectly on changes made to source code snippets. These embeddings may capture semantic and/or syntactic properties of the source code change(s), as well as aspects the user-provided comments. For example, in some implementations, a “change list” “change set,” “update,” or “patch” may identify changes made to source code during a single commit to a version control system. For example, the change list may include before and after source code snippets (e.g., showing the changes made), as well as one or more human-composed comments (“change list entries”) that explain the intent(s) behind the changes. Various features of the change list, such as changes made, human-composed comments, etc., may be processed to generate an embedding that captures the change(s) made to the source code along with the intent(s) behind the change(s).
In some implementations, these embeddings may take the form of “reference” embeddings that represent previous change lists associated with changes made to source code previously. In some implementations, these reference embeddings map the previous change lists to a latent space. These reference embeddings may then be used to identify change intents for various purposes, such as for presentation as a condensed code base migration summary, for automatic pre-generation of a code change intent for a programmer ready to commit an updated source code snippet to a code base, for locating source code changes based on desired code change intents, and so forth.
As a non-limiting example of how a machine learning model configured with selected aspects of the present disclosure may be trained, in some implementations, a first version source code snippet (e.g., version 1.1.1) may be obtained from a change list and used to generate a data structure such as an abstract syntax tree (“AST”). The AST may represent constructs occurring in the first version source code snippet, such as variables, objects, functions, etc., as well as the syntactic relationships between these components. Another AST may be generated for a second version source code snippet (e.g., 1.1.2), which may be a next version or “iteration” of the first version source code snippet. The two ASTs may then be used to generate one or more data structures, such as one or more change graphs, that represent one or more changes made to update the source code snipped from the first version to the second version. In some implementations, one change graph may be generated for each change to the source code snippet during its evolution from the first version to the second version.
Once the change graph(s) are created, they may be used as training examples for training a machine learning model such as a GNN. In some implementations, the change graph(s) (or embeddings generated therefrom) may be applied as input across the machine learning model to generate corresponding source code change embeddings. In some implementations, the change graph(s) may be labeled with information, such as change intents, that is used to map the changes to respective regions in the latent space. For example, a label “change variable name” may be applied to one change, another label, “change function name,” may be applied to another change, and so on. In some implementations, these labels may be obtained from change list entries provided when the underlying change lists were committed to the VCS, or from comments embedded in source code.
As more and more change graphs are input across the machine learning model, these labels may be used as part of a loss function that determines whether comparable changes are clustering together properly in the latent space. If an embedding generated from a change of a particular change type (e.g., “change variable name”) is not sufficiently proximate to other embeddings of the same change type (e.g., is closer to embeddings of other change types), the machine learning model may be trained, e.g., using techniques such as triplet loss.
This training process may be repeated over numerous training examples until the machine learning model is able to accurately map change graphs, and more generally, data structures representing source code changes, to regions in the latent space near other, syntactically/semantically similar data structures. In some implementations, training techniques such as triplet loss may be employed to ensure that source code changes of the same change intent are mapped more closely to each other than to source code changes of different change intents.
In some implementations, the training process may involve grouping change lists into clusters based on their underlying code change intents that motivated the respective source code changes. In some implementations, natural language processing may be performed on code change intents (e.g., embedded comments, change list entries) to identify change lists having semantically/syntactically similar code change intents. Consequently, each cluster includes any number of different change lists. In some implementations, natural language processing may be used to summarize and/or normalize code change intents within each cluster, e.g., to generate a cluster-level or cluster-wide code change intent that captures all of the individual distinct code change intents of the cluster. In other implementations, rather than using natural language processing to group change lists having similar code change intents into clusters, source code snippets may be grouped into clusters using graph matching, e.g., on ASTs generated from the source code snippets.
Various training techniques may be employed to learn code change embeddings representing the plurality of change lists. For example, in some implementations, change graphs may be sampled from different clusters to train a machine learning model such as a neural network using techniques such as triplet loss. For example, triplet loss may involve: selecting an anchor (or “baseline”) change list from a change list cluster; sampling, as a positive or “truthy” input, another change list from the same change list cluster; and sampling, as a negative or “falsy” input, a change list from a different change list cluster. Triplet loss training may then be used to ensure that source code change embeddings generated from change lists having the syntactically and/or semantically similar underlying code change intents are closer to each other than other source code change embeddings generated from change lists having different code change intents.
Once the code change embeddings are learned—e.g., the neural network is trained to generate accurate embeddings—these embeddings may be used to train one or more natural language processing models to generate and/or predict code change intents. These natural language processing models may include various flavors of neural networks, such as feed-forward neural networks, recurrent neural networks, long short-term memory (“LSTM”) networks, gated recurrent unit (“GRU”) networks, transformer networks, bidirectional encoder representations, and any other network that can be trained to generate code change intents based on code change embeddings.
Once the code change embeddings are learned (e.g., a neural network is trained to generate code change embeddings from change graphs) and the natural language processing model(s) are also trained, these models may be used after or during an update of a to-be-updated software system code base for a variety of purposes. In some implementations, the trained machine learning model(s) may be used to generate a high-level summary of changes made to the code base. This high level summary may not have the redundant and/or repetitive entries of a “brute force” change list that simply includes all change list entries made by any team member who updated a code file. Rather, clustering semantically similar code changes together under a single change intent has the practical effect of deduplicating change intents in the change list.
Additionally or alternatively, in some implementations, the trained machine learning model may be used to automatically generate change list entries for programmers, who may be able to accept verbatim and/or edit the automatically-generated change list entries. For example, a change graph may be generated based on a change a programmer made to a source code file. The change graph may be applied as input across the model to generate an embedding in the latent space. This embedding will be proximate to other, semantically similar reference embeddings generated from past code changes. Change intent(s) associated with those proximate reference embeddings may then be used to generate a change intent for the programmer's updated source code file that is aligned with the change intent of the other semantically similar reference embeddings. In this way, it is possible to enforce or influence programmers to use best practices when composing change intents, as much of the work may be done for them.
In some implementations, a method performed by one or more processors is provided that includes: applying data indicative of a change made to source code snippet as input across a machine learning model to generate a new source code change embedding in a latent space; identifying one or more reference source code change embeddings in the latent space based on one or more distances between the one or more reference source code change embeddings and the new source code change embedding in the latent space, wherein each of the one or more reference source code change embeddings is generated by applying data indicative of a change, made to a reference first version source code snippet to yield a reference second version source code snippet, as input across the machine learning model; based on the identified one or more reference embeddings, identifying one or more code change intents; and creating an association between the source code snippet and the one or more code change intents.
In various implementations, the method may further include receiving an instruction to commit the change made to the source code snippet to a code base. In various implementations, at least the applying is performed in response to the instruction to commit the change made to the source code snippet to the code base. In various implementations, creating the association comprises automatically generating a change list entry based on one or more of the code change intents.
In various implementations, the method may further include automatically inserting, into the source code snippet, an embedded comment indicative of one or more of the code change intents. In various implementations, the data indicative of the change made to the source code snippet comprises an abstract syntax tree (“AST”). In various implementations, the data indicative of the change made to the source code snippet comprises a change graph. In various implementations, the machine learning model comprises a graph neural network (“GNN”).
In another aspect, a method implemented using one or more processors may include: obtaining data indicative of a change between a first version source code snippet and a second version source code snippet; obtaining data indicative of a change intent that was stored in memory in association with the change when the second version source code snippet was committed to a code base; applying the data indicative of the change as input across a machine learning model to generate a new code change embedding in a latent space; determining a distance in the latent space between the new code change embedding and a previous code change embedding in the latent space associated with the same change intent; and training the machine learning model based at least in part on the distance.
In various implementations, the distance may include a first distance, and the method may further include: determining a second distance in the latent space between the new code change embedding and another previous code change embedding in the latent space associated with a different change intent; and computing, using a loss function, an error based on the first distance and the second distance; wherein the training is based on the error.
In addition, some implementations include one or more processors of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods. Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.
It should be appreciated that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the subject matter disclosed herein.
Code knowledge system 102 may be configured to perform selected aspects of the present disclosure in order to help one or more clients 1101-P to generate and/or utilize code change intents associated with the update (or “migration”) of one or more corresponding legacy code bases 1121-P. Each client 110 may be, for example, an entity or organization such as a business (e.g., financial institute, bank, etc.), non-profit, club, university, government agency, or any other organization that operates one or more software systems. For example, a bank may operate one or more software systems to manage the money under its control, including tracking deposits and withdrawals, tracking loans, tracking investments, and so forth. An airline may operate one or more software systems for booking/canceling/rebooking flight reservations, managing delays or cancelations of flight, managing people associated with flights, such as passengers, air crews, and ground crews, managing airport gates, and so forth.
Many of these entities' code bases 112 may be highly complex, requiring teams of programmers and/or software engineers to perform code base migrations, maintenance, and/or updates. Many of these personnel may be under considerable pressure, and may place low priority on composing descriptive and/or helpful code change intents, in embedded comments or as part of change list entries. Accordingly, code knowledge system 102 may be configured to leverage knowledge of past code base migration, update, or maintenance events, and/past code change intents composed in association with these events, in order to automate composition and/or summarization of code change intents. Code change intents may be embodied in various forms, such as in change list entries that are sometimes required when an updated piece of code (referred to herein as a “source code snippet”) is committed (e.g., installed, stored, incorporated) into a code base, in comments (e.g., delimited with symbols such as “//” or “#”) embedded in the source code, in change logs, or anywhere else where human-composed language indicating an intent behind a source code change might be found.
In various implementations, code knowledge system 102 may include a machine learning (“ML” in
In some implementations, code knowledge system 102 may also have access to one or more up-to-date code bases 1081-M. In some implementations, these up-to-date code bases 1081-M may be used, for instance, to train one or more of the machine learning models 1061-N. In some such implementations, and as will be described in further detail below, the up-to-date code bases 1081-M may be used in combination with other data to train machine learning models 1061-N, such as non-up-to-date code bases (not depicted) that were updated to yield up-to-date code bases 1081-M. “Up-to-date” as used herein is not meant to require that all the source code in the code base be the absolute latest version. Rather, “up-to-date” may refer to a desired state of a code base, whether that desired state is the most recent version code base, the most recent version of the code base that is considered “stable,” the most recent version of the code base that meets some other criterion (e.g., dependent on a particular library, satisfies some security protocol or standard), etc.
In various implementations, a client 110 that wishes to take advantage of techniques described herein for generating and/or utilizing code change intents when migrating, updating, or even maintaining its legacy code base 112 may establish a relationship with an entity (not depicted in
Beginning at the top left, a codebase 216 may include one or more source code snippets 2181-Q of one or more types. For example, in some cases a first source code snippet 2181 may be written in Python, another source code snippet 2182 may be written in Java, another 2183 in C/C++, and so forth. Additionally or alternatively, each of elements 2181-Q may represent one or more source code snippets from a particular library, entity, and/or application programming interface (“API”). Each source code snippet 218 may comprise a subset of a source code file or an entire source code file, depending on the circumstances. For example, a particularly large source code file may be broken up into smaller snippets (e.g., delineated into functions, objects, etc.), whereas a relatively short source code file may be kept intact throughout processing.
At least some of the source code snippets 2181-Q of code base 112 may be converted into an alternative form, such as a graph or tree form, in order for them to be subjected to additional processing. For example, in
A dataset builder 224, which may be implemented using any combination of hardware and machine-readable instructions, may receive the ASTs 2221-R as input and generate, as output, various different types of data that may be used for various purposes in downstream processing. For example, in
Code change intents 232 may be assigned to change graphs 228 for training purposes. Each code change intent 232 may include text that conveys the intent of the software engineer or programmer when they changed/edited the source code snippet underlying the change graph under consideration. For example, each of change graphs 228 may be labeled with a respective code change intent of code change intents 232. The respective code change intents may be used to map the changes conveyed by the change graphs 228 to respective regions in a latent space. For example, a code change intent “migrate from language_A to language_B” may be applied to one change of a source code snippet, another code change intent, “link to more secure encryption library,” may be applied to another change of another source code snippet, and so on.
An AST2VEC component 234 may be configured to generate, from delta data 226, one or more feature vectors, i.e. “latent space” embeddings 244. For example, AST2VEC component 234 may apply change graphs 228 as input across one or more machine learning models to generate respective latent space embeddings 244. The machine learning models may take various forms as described previously, such as a GNN 252, a sequence-to-sequence model 254 (e.g., an encoder-decoder), etc.
During training, a training module 250 may train a machine learning model such as GNN 252 or sequence-to-sequence model 254 to generate embeddings 244 based directly or indirectly on source code snippets 2181-Q. These embeddings 244 may capture semantic and/or syntactic properties of the source code snippets 2181-Q, as well as a context in which those snippets are deployed. In some implementations, as multiple change graphs 228 are input across the machine learning model (particularly GNN 252), the code change intents 232 assigned to them may be used as part of a loss function that determines whether comparable changes are clustering together properly in the latent space.
Suppose an embedding generated from a source code change motivated by a particular code change intent (e.g., “link to more secure encryption library”) is not sufficiently proximate to other embeddings having the same or similar code change intent (e.g., is closer to embeddings of other code change intents). GNN 252 may be trained, e.g., using techniques such as gradient descent and back propagation (e.g., as part of a triplet loss training procedure). This training process may be repeated over numerous training examples until GNN 252 is able to accurately map change graphs, and more generally, data structures representing source code changes, to regions in the latent space near other, syntactically/semantically similar data structures.
With GNN 252 in particular, the constituent ASTs of delta data 226, which recall were generated from the source code snippets and may include change graphs in the form of ASTs, may be operated on as follows. Features (which may be manually selected or learned during training) may be extracted for each node of the AST to generate a feature vector for each node. Recall that nodes of the AST may represent a variable, object, or other programming construct. Accordingly, features of the feature vectors generated for the nodes may include features like variable type (e.g., int, float, string, pointer, etc.), name, operator(s) that act upon the variable as operands, etc. A feature vector for a node at any given point in time may be deemed that node's “state.”
Meanwhile, each edge of the AST may be assigned a machine learning model, e.g., a particular type of machine learning model or a particular machine learning model that is trained on particular data. For example, edges representing “if” statements may each be assigned a first neural network. Edges representing “else” statements also may each be assigned the first neural network. Edges representing conditions may each be assigned a second neural network. And so on.
Then, for each time step of a series of time steps, feature vectors, or states, of each node may be propagated to their neighbor nodes along the edges/machine learning models, e.g., as projections into latent space. In some implementations, incoming node states to a given node at each time step may be summed (which is order-invariant), e.g., with each other and the current state of the given node. As more time steps elapse, a radius of neighbor nodes that impact a given node of the AST increases.
Intuitively, knowledge about neighbor nodes is incrementally “baked into” each node's state, with more knowledge about increasingly remote neighbors being accumulated in a given node's state as the machine learning model is iterated more and more. In some implementations, the “final” states for all the nodes of the AST may be reached after some desired number of iterations is performed. This number of iterations may be a hyper-parameter of GNN 252. In some such implementations, these final states may be summed to yield an overall state or embedding (e.g., 244) of the AST.
In some implementations, for change graphs 228, edges and/or nodes that form part of the change may be weighted more heavily during processing using GNN 252 than other edges/nodes that remain constant across versions of the underlying source code snippet. Consequently, the change(s) between the versions of the underlying source code snippet may have greater influence on the resultant state or embedding representing the whole of the change graph 228. This may facilitate clustering of embeddings generated from similar code changes in the latent space, even if some of the contexts surrounding these embeddings differ somewhat.
For sequence-to-sequence model 254, training may be implemented using implicit labels that are manifested in a sequence of changes to the underlying source code. Rather than training on source and target ASTs, it is possible to train using the entire change path from a first version of a source code snippet to a second version of the source code snippet. For example, sequence-to-sequence model 254 may be trained to predict, based on a sequence of source code elements (e.g., tokens, operators, etc.), an “updated” sequence of source code elements that represent the updated source code snippet. In some implementations, both GNN 252 and sequence-to-sequence model 254 may be employed, separately and/or simultaneously.
Once the machine learning models (e.g., 252-254) are adequately trained, they may be used during an inference phase to help generate code change intents and/or to map code changes to code change intents, e.g., for change list summarization purposes. During inference, many of the operations of
The source code snippets 2181-Q are once again used to generate ASTs 2221-R, which are processed by dataset builder 224 to generate change graphs 228. These change graphs 228 are applied by AST2VEC component 234 as input across one or more of the trained machine learning models (e.g., 252, 254) to generate new source code change embeddings 244 in latent space. Then, one or more reference source code change embeddings in the latent space may be identified, e.g., by a change list (“CL”) generator 246, based on respective distances between the one or more reference source code change embeddings and the new source code change embedding in the latent space.
Based on the identified one or more reference source code change embeddings, CL generator 246 may identify (e.g., predict) one or more code change intents, e.g., which may be associated with the reference source code change embeddings themselves and/or associated with a region of latent space containing a cluster of similar reference source code change embeddings. These identified code change intents may be output at block 248. In some implementations, if a code change intent is identified with a sufficient measure of confidence, the code change intent may be automatically associated with the updated source code snippet 218, e.g., as a change list entry. A code change intent with a lesser measure of confidence may be presented to the user for editing and/or approval before it is associated with the updated source code snippet.
ASTs 364, 364′ may be compared, e.g., by dataset builder 224, to generate a change graph 228 that reflects this change. Change graph 228 may then be processed, e.g., by AST2VEC 234 using a machine learning model such as GNN 252 and/or sequence-to-sequence model 254, to generate a latent space embedding as shown by the arrow. In this example, the latent space embedding falls with a region 3541 of latent space 352 in which other reference embeddings (represented in
As part of training the machine learning model, in some implementations, data indicative of a change between a first version source code snippet and a second version source code snippet, e.g., change graph 228, may be labeled (with 232) with a code change intent, which may be obtained from a change list entry, embedded comment, etc. Change graph 228 may then be applied, e.g., by AST2VEC component 234, as input across a machine learning model (e.g., 252) to generate a new source code change embedding in latent space 352. Next, a distance in the latent space between the new source code change embedding and a previous (e.g., reference) source code change embedding in the latent space associated with the same code change intent may be determined and used to train the machine learning model. For example, if the distance is too great—e.g., greater than a distance between the new source code change embedding and a reference source code change embedding of a different code change intent—then techniques such as back propagation and gradient descent may be applied to alter weight(s) and/or parameters of the machine learning model. This training technique may be referred to as “triplet loss.” Eventually after enough training, reference embeddings having the same or similar underlying code change intents will cluster together in latent space 352.
For example, a first code change intent, “Link code to improved encryption library,” is designated as the code change intent that motivated edits made to three source code files, “RECONCILE_DEBIT_ACCOUNTS.CC,” “RECONCILE_CREDIT_ACCOUNTS.CC,” and “DISPLAY_AR_GRAPHICAL.JAVA.” When these three source code files were processed as described above, their respective source code change embeddings may have been clustered together near each other and/or other reference source code change embeddings that were all associated with the intent by programmers/software engineers of linking code to an improved encryption library.
A second code change intent, “Update function arguments to comply with new standard,” is designated as the code change intent that motivated edits make to four source code files, “CUSTOMER_TUTORIAL.C,” “MERGE_ENTITY_FIELDS.PL,” “ACQUISITION_ROUNDUP.PHP,” and “ERROR_DECISION_TREE.PL.” When these three source code files were processed as described above, their respective source code change embeddings may have been clustered together near each other and/or other reference source code change embeddings that were all associated with the intent by programmers/software engineers of updating function arguments to comply with the new standard.
In
At block 502, the system may apply data indicative of a change made to source code snippet as input across a machine learning model (e.g., GNN 252) to generate a new source code change embedding in a latent space. An example of this occurring was depicted in
At block 504, the system may identify one or more reference source code change embeddings in the latent space based on one or more distances between the one or more reference source code change embeddings and the new source code change embedding in the latent space. Each of the one or more reference source code change embeddings may have been generated previously by applying data indicative of a change, made to a reference first version source code snippet to yield a reference second version source code snippet, as input across the machine learning model. For example, in some implementations, the system may identify reference source code change embeddings that are within some threshold distance from the new source code change embedding in latent space. These distances may be determined using techniques such as the dot product, cosine similarity, etc.
Based on the one or more reference embeddings identified at block 504, at block 506, the system may identify one or more code change intents. For example, the reference source code change embeddings may be associated, e.g., in a lookup table or database, with code change intents. In some implementations, a region of latent space may be associated with (e.g., assigned) a code change intent, and any source code change embedding that is located within that region may be considered to have that code change intent.
At block 508, the system may create an association between the source code snippet and the one or more code change intents. In some implementations, creating this association may include automatically generating a change list entry based on one or more of the code change intents, e.g., as depicted in
At block 602, the system may obtain data indicative of a change between a first version source code snippet and a second version source code snippet. For example, a change graph 228 may be generated, e.g., by dataset builder 224, based on “before” and “after” versions of the source code snippet. At block 604, the system, e.g., by way of dataset builder 224, may obtain data indicative of a change intent that was stored in memory in association with the change when the second version source code snippet was committed to a code base. This code change intent may be found, for instance, within the source code as an embedded comment or in a change list entry.
At block 606, the system may apply the data indicative of the change (e.g., change graph 228) as input across a machine learning model, e.g., GNN 252, to generate a new embedding in a latent space. At block 608, the system may determine distance(s) in the latent space between the new embedding and previous embedding(s) in the latent space associated with the same and/or different change types. These distances may be computed using techniques such as cosine similarity, dot product, etc.
At block 610, the system may compute an error using a loss function and the distance(s) determined at block 608. For example, if a new source code change embedding having a code change intent “upgrade to 5G library” is closer to previous source code change embedding(s) of the type “link to new template library” than it is to previous embeddings of the type “upgrade to 5G library,” that may signify that the machine learning model that generated the new embedding needs to be updated, or trained. Accordingly, at block 612, the system may train the machine learning model based at least in part on the error computed at block 610. The training of block 612 may involve techniques such as gradient descent and/or back propagation. Additionally or alternatively, in various implementations, other types of labels and/or training techniques may be used to train the machine learning model, such weak supervision or triplet loss, which may include the use of labels such as similar/dissimilar or close/not close.
Referring to both
At block 704, the system may generate a plurality of change graphs 828 associated with the plurality of source code changes. Each change graph 828 may reflects a corresponding source code change. In
At block 706, the system, e.g., by way of AST2VEC component 234, may sample change graphs 828 from different clusters 874 to learn code change embeddings 844 representing the plurality of source code changes. In some implementations, these learned embeddings may be incorporated into a GNN, such as GNN 252 described previously, or may be incorporated into a feed-forward neural network. In some implementations, the sampling of block 706 may be performed using techniques such as triplet loss, and may include sampling an anchor input and a positive input from a first cluster of the plurality of clusters, and sampling a negative input from a second cluster of the plurality of clusters. Respective distances between the anchor input and the positive and negative inputs may then be used to train, for instance, GNN 252 and/or another machine learning model, such as a feed-forward neural network.
At block 708, the system, e.g., by way of CL generator 246 or another similar component, may use the learned code change embeddings to train a natural language processing model 882 to predict code change intents (output 848). In
User interface input devices 922 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 910 or onto a communication network.
User interface output devices 920 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 910 to the user or to another machine or computing device.
Storage subsystem 924 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 924 may include the logic to perform selected aspects of the methods of
These software modules are generally executed by processor 914 alone or in combination with other processors. Memory 925 used in the storage subsystem 924 can include a number of memories including a main random access memory (RAM) 930 for storage of instructions and data during program execution and a read only memory (ROM) 932 in which fixed instructions are stored. A file storage subsystem 926 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 926 in the storage subsystem 924, or in other machines accessible by the processor(s) 914.
Bus subsystem 912 provides a mechanism for letting the various components and subsystems of computing device 910 communicate with each other as intended. Although bus subsystem 912 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
Computing device 910 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 910 depicted in
While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.
Number | Date | Country | |
---|---|---|---|
62949746 | Dec 2019 | US |