This U.S. patent application claims priority under 35 U.S.C. § 119 to: Indian Patent Application No. 202321029623, filed on Apr. 24, 2023. The entire contents of the aforementioned application are incorporated herein by reference.
The disclosure herein generally relates to source code verification, and, more particularly, to method and system for source code verification using machine learning based strategy prediction.
Machine learning approaches are now being used by the software engineering community for a number of software engineering tasks, including software testing, source code representation, and source code analysis, including testing and vulnerability identification. Source code verification is crucial for the secure operation of software applications. For example, defect identification, for instance, must be carried out in all types of software applications. However, it is considered to be a challenging task as no single known technique succeeds on all source codes. Slicing can be used as a maintenance or reuse technique for tasks such as program understanding, debugging, regression testing, and function extraction from existing source code.
As an interactive tool, program slicing is a software analysis technique that identifies subset of program statements that may affect the computation at a specific program point. With the growth of machine learning and software engineering, faster analysis is necessary in agile development environment for scalable and accurate source code verification. Existing techniques lack in utilizing machine learning to verify errors in a source code based on verification strategies with heuristically defined time constraints. However, maintenance of the verification strategies is a challenge because each strategy caters to verifying only certain properties of the source code and this limits to prove safety of the source code within loops.
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for source code verification using machine learning based strategy prediction is provided. The system includes receiving a source code comprising a plurality of function properties to be verified. The system further slice by using at least one sequence slicer among a plurality of sequence slicers the source code into a plurality of slices, wherein the sequence slicer includes at least one of a slice analyzer and a shallow slicer. Further, a feature vector generator extracts a plurality of feature vectors from each slice among the plurality of slices based on a true boolean value, a false boolean value and a plurality of boolean features, wherein each of the plurality of feature vectors indicates syntax and semantics of the source code. Further, a neural network generates a plurality likelihood of success values by applying a plurality of verification techniques over each feature vector among the plurality of feature vectors, wherein an input node in the neural network represents each feature vector, and an output node represents a likelihood of success value generated of each verification technique corresponding to the input node. Then, a verification strategy is predicted by the neural network by sorting the plurality of likelihood of success values of each verification technique corresponding to each feature vector in decreasing order and applying the verification strategy over each slice to be verified. Finally, a verification result is displayed as one of a verification successful (S) when each slice is successfully verified by the verification strategy, a verification failure (F) when each slice is verified by the verification strategy and at least one of the plurality of function properties fails in each slice, and an unknown (U) when each slice is not verified for the verification strategy run out of time or memory.
In another aspect, a method for source code verification using machine learning based strategy prediction is provided. The system includes receiving a source code comprising a plurality of function properties to be verified. The system further slice by using at least one sequence slicer among a plurality of sequence slicers the source code into a plurality of slices, wherein the sequence slicer includes at least one of a slice analyzer and a shallow slicer. Further, a feature vector generator extracts a plurality of feature vectors from each slice among the plurality of slices based on a true boolean value, a false boolean value and a plurality of boolean features, wherein each of the plurality of feature vectors indicates syntax and semantics of the source code. Further, a neural network generates a plurality likelihood of success values by applying a plurality of verification techniques over each feature vector among the plurality of feature vectors, wherein an input node in the neural network represents each feature vector, and an output node represents a likelihood of success value generated of each verification technique corresponding to the input node. Then, a verification strategy is predicted by the neural network by sorting the plurality of likelihood of success values of each verification technique corresponding to each feature vector in decreasing order and applying the verification strategy over each slice to be verified. Finally, a verification result is displayed as one of a verification successful (S) when each slice is successfully verified by the verification strategy, a verification failure (F) when each slice is verified by the verification strategy and at least one of the plurality of function properties fails in each slice, and an unknown (U) when each slice is not verified for the verification strategy run out of time or memory.
In yet another aspect, a non-transitory computer readable medium for receiving a source code comprising a plurality of function properties to be verified. The system further slice by using at least one sequence slicer among a plurality of sequence slicers the source code into a plurality of slices, wherein the sequence slicer includes at least one of a slice analyzer and a shallow slicer. Further, a feature vector generator extracts a plurality of feature vectors from each slice among the plurality of slices based on a true boolean value, a false boolean value and a plurality of boolean features, wherein each of the plurality of feature vectors indicates syntax and semantics of the source code. Further, a neural network generates a plurality likelihood of success values by applying a plurality of verification techniques over each feature vector among the plurality of feature vectors, wherein an input node in the neural network represents each feature vector, and an output node represents a likelihood of success value generated of each verification technique corresponding to the input node. Then, a verification strategy is predicted by the neural network by sorting the plurality of likelihood of success values of each verification technique corresponding to each feature vector in decreasing order and applying the verification strategy over each slice to be verified. Finally, a verification result is displayed as one of a verification successful (S) when each slice is successfully verified by the verification strategy, a verification failure (F) when each slice is verified by the verification strategy and at least one of the plurality of function properties fails in each slice, and an unknown (U) when each slice is not verified for the verification strategy run out of time or memory.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Embodiments herein provide a method and system for source code verification using machine learning based strategy prediction. The system enables providing a fast, reliable, and scalable verification strategy for source code verification. Software requires rigorous verification and analysis criteria to be satisfied to ensure absence of errors. The method of the present disclosure is a machine learning based verification strategy predictor that predicts sequence of verification techniques for verifying source code. There are multiple techniques for source code verification and each technique performs differently for different types of software programs or source codes. Herein, ‘verification’ may refer to finding error paths as well as proving validity of the input properties of the source code(s). The effectiveness of a verification technique is program and property dependent. Hence, without knowledge of the verification techniques that may be suitable for a given source code, the process of verification may be inefficient, cumbersome and time consuming. Experimental data suggests that each verification technique only functions with a specific class of the source code, which is one advantage of sequencing various verification procedures. When a new class is encountered, experts define a new verification strategy and strategy-selection algorithm of the verifier is updated which is a tedious task. Also, verification techniques must be re-defined in the event that a new verification approach is required in order to supplement the existing set of verification strategies. The method disclosed employs machine learning model to predict verification strategy for source code verification.
Referring now to the drawings, and more particularly to
Referring to the components of the system 100, in an embodiment, the processor (s) 104 can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 104 is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud, and the like.
The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting a number of devices (nodes) of the system 100 to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
In an embodiment, the memory 102 includes a plurality of modules 108 can also include various sub-modules as depicted in
The memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system 100 and methods of the present disclosure. Functions of the components of system 100, source code verification using machine learning based strategy prediction, are explained in conjunction with
The sequence slicer unit 202 of the system 100 includes a slicer analyzer 202A and a shallow slicer 202B which receives an input source code P to be verified. The slicer analyzer 202A performs a scalable slicing of the input source code P with respect to calls in entry function main. The shallow slicer 202B slices the input source code P and allows multiples calls in entry function main which refers to the same global variable and transitively invokes the same function. For example, a verification by abstraction tool (VeriAbsL) tool is a reachability verifier that verifies the source code P. Referring to the following example (Example 1 in Table 1) having the source code P to be verified using a machine learning based verification strategy.
In the present example the source code P is input to the VeriAbsL tool which verifies whether the source code can be sliced. The source code P that can be sliced are divided into a plurality of slices (P1, P2, P3 . . . Pj).
The feature vector generator 204 of the system 100 extracts a plurality of feature vectors {right arrow over (f)} comprising features (f1, f2, f3 . . . fn) from each slice among the plurality of slices based on a true boolean value, a false boolean value and a plurality of boolean features. Each feature vector indicate syntax and semantics of the input slice Pj. For example, the true boolean feature fi∈ {right arrow over (f)} indicates the presence of arrays in the source code Pj and the false boolean feature indicates that no arrays are used in corresponding slice.
The neural network 206 of the system 100 receives each feature vector {right arrow over (f)} of each slice Pj to generate a plurality of likelihood of success values (l1, l2, l3 . . . li) by applying a plurality of verification techniques. The VeriAbsL tool uses a three layered neural network with multi-class classification for example 10 classes for 10 verifications techniques in a verification portfolio 210. In the neural network 206 an input node represents the feature vector, and an output node represents a likelihood of success value generated for the verification technique corresponding to the input node.
The verification strategy predictor 208 of the system 100 predicts a verification strategy by sorting the plurality of likelihood of success values li of each verification technique.
The verification portfolio 210 of the system 100 applies the verification strategy over each slice to be verified. The plurality of verification techniques may include for example a k-Induction, a loop shrinking, and an array pruning and the like.
Referring to the steps of the method 300, at step 302 receiving via one or more hardware processor 104, a source code comprising a plurality of function properties to be verified. For example, the VeriAbsL tool receives the input source code P to be verified.
Referring to the steps of the method 300, at step 304 slicing by using at least one sequence slicer among a plurality of sequence slicers the source code into a plurality of slices. The sequence slicer 200 includes at least one of the slice analyzer 202A and the shallow slicer 202B. The sequence slicer 200 divides the source code P into a plurality of slices (P1, P2, P3 . . . Pj).
Referring to the steps of the method 300, at step 306 includes extracting by the feature vector generator a plurality of feature vectors from each slice a vectors indicates of slices based on a true boolean value, a false boolean value and a plurality of boolean features, wherein each of the plurality of feature vectors indicates syntax and semantics of the source code P.
The plurality of boolean features comprises a goto label in program, a goto label in loop, an unstructured forward-backward edges in loop, if array modified in loop, a constant reset in loop, a linear acceleration in loop, a linear acceleration intensive loop, a self-recurrence in loop, a mutual-recurrence in loop, a too-many if else in the source code, a non-linear operation in loop, a non-linear operations intensive loop, a non-linear operations in assertion, a no-loop, a small-known bound loop, a known-max bound loop, a unknown-bound loop, an infinite-loop, a multiple loop, a nested loop, and a recursive program.
The true boolean value is updated in each feature vector when the plurality of boolean features are present in each feature vector. The false boolean value is updated in each feature vector when the plurality of boolean features is absent in each feature vector.
For the above example the plurality of feature vectors present in the source code as represented in Table 2. From the Table 2, (GL indicates goto labels, and UFB indicates unstructured forward-backward edges)
indicates data missing or illegible when filed
With the above Table 2, for the given example, the plurality of extracted feature vector identified features may be for example “if array-mod-in-loop”, “linear-acceleration-in-loop” and “known-max-bound-loop” which are identified as True and further each feature vector is given to the neural network 206 for the verification strategy predictor.
Referring to the steps of the method 300, at step 308 the neural network 206 generates a plurality likelihood of success values by applying a plurality of verification techniques over each feature vector among the plurality of feature vectors. In the neural network 206 an input node represents each feature vector, and an output node represents a likelihood of success value generated of each verification technique corresponding to the input node.
The neural network 206 of the system 100 includes three layers where a first layer comprises 22 neurons, a second layer comprises 17 neurons and a third layer comprises 10 neurons and was trained using a rectified linear unit (Relu) and (for the output layer) softmax activation functions and mean-squared-error loss function. The neural network 206 translates the input feature vector {right arrow over (f)} representing the source code P into the plurality of likelihood of success values li, 1≤i≤10 of the corresponding verification technique Ti in the verification portfolio 210 on the source code P. Each output node ni of the neural network 206 represents one verification technique Ti and corresponding likelihood of success value li generated by the neural network 206 at that node ni is the likelihood of success of the corresponding technique Ti.
For the above example, the neural network 206 of the system 100 predicts probabilities for each verification technique and the VeriAbsL tool sorts the plurality of likelihood of success values in descending order. For example, the following the plurality of likelihood of success values are generated for corresponding verification technique present in the verification portfolio 210 li=0.43807328 0.28551188 0.13186349 0.057359967 0.049513955 0.019339733 0.010554153 0.004746521 0.0015238631 0.00055219093. The probability of corresponding verification technique “Array Pruning” is maximum, which is 0.44, for the input source code P and result shows that the verification technique is able to verify program in 24 secs.
Referring to the steps of the method 300, at step 310 includes predicting a verification strategy by the neural network by sorting the plurality of likelihood of success values of each verification technique corresponding to each feature vector in decreasing order and applying the verification strategy over each slice to be verified. The verification strategy predictor 210 sorts the plurality of likelihood of success values li is at all output nodes of the neural network 206 in decreasing order and invokes corresponding verification technique Ti in that order. For example, the verification strategy is represented as (Tk1, Tk2, . . . Tk10), 1≤kr≤10,1≤r≤10, verification strategy the plurality of likelihoods of success li of each verification technique Ti in decreasing order.
Referring to the steps of the method 300, at step 312 includes displaying a verification result as one of a verification successful (S) when each slice is successfully verified by the verification strategy, a verification failure (F) when each slice is verified by the verification strategy and at least one of the plurality of function properties fails in at least one of the slice among the plurality of slices, and an unknown (U) when each slice is not verified for the verification strategy run out of time or memory.
The machine learning based strategy prediction verifies 475 or 7.5% more reach safety benchmarks than VeriAbs tool that uses the same verification technique portfolio as VeriAbsL. This is because for those benchmarks the VeriAbsL tool predicted successful verification techniques early in its strategies. However, when the VeriAbs tool is selected an unsuitable strategy ran out of time. Further the VeriAbsL tool took 4.4% lesser time compared to the VeriAbs and verified 1047 and 543 more Reach-Safety benchmarks compared to the other machine learning based strategy prediction tools, for example Graves and Pesco respectively.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined herein and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the present disclosure if they have similar elements that do not differ from the literal language of the present disclosure or if they include equivalent elements with insubstantial differences from the literal language of the embodiments described herein.
The embodiments of present disclosure herein addresses unresolved problem of source code verification. The embodiment, thus provides method and system for source code verification using machine learning based strategy prediction. Moreover, the embodiments herein further provides a reachability verifier that performs verification in three stages. The method of the present disclosure divides the input source code using a combination of two slicers and secondly verifies each slice using predicted verification strategies, and lastly composes the result of verifying the individual slices. The feature vector generator generates a boolean feature vector to describe the input source code and pass it to the neural network to forecast the verification strategy. The neural network predicts a verification strategy for the input source code P. The verification strategy is sequence of verification techniques from verification portfolio in the order of likelihood of success.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202321029623 | Apr 2023 | IN | national |