In today's software source code base, there are hundreds or thousands of entities (e.g., Modules, Classes, Functions or Methods). A typical software system may run for months or years in production. In terms of code basis paths, not all paths are executed all the times. Based on n-number of variable factors such as conditions, interlocks, dependencies or rules, the system chooses the execution flow at runtime and does so dynamically. Over a period of time, it can be generalized that some parts of the code base are executed or utilized more frequently than other parts in terms of basis paths. It can also be generalized the importance of source code entities based on its linkages with other source code entities and project work items.
Currently, there are no tools that can categorize and rank source code entities such as Classes or Functions or Methods based on their actual runtime usage and business importance. Code optimization efforts are equally distributed over the entire code which is not efficient. As a result a greater than desired amount of time may be spent on optimizing portions of code that are not executed very often. Oftentimes, test case writing efforts may also be diluted due to equal distribution of effort in writing and maintaining it. The same may happen for code coverage and logging efforts.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described herein in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
One aspect may provide a method for dynamic categorization and ranking of source code entities and relationships. The method includes scanning source code of an application, extracting source code entities from the application, and generating a hierarchical source code entity graph model from extracted source code entities. The method also includes scanning a project management artifact repository, extracting project management artifacts from the project management artifact repository, and generating a hierarchical project management artifact graph model from extracted project management artifacts. The method further includes traversing the hierarchical source code entity graph model and the hierarchical project management artifact graph model, identifying linking relationships between the source code entities and the project management artifacts, defining a policy for software quality control from the linking relationships, monitoring source code changes to the application, and modifying source code development for the application when a violation of the policy is identified in response to the monitoring.
Another aspect may provide a system for dynamic categorization and ranking of source code entities and relationships. The system includes a memory having computer-executable instructions. The system also includes a processor operated by a storage system. The processor executes the computer-executable instructions. When executed by the processor, the computer-executable instructions cause the processor to perform operations. The operations include scanning source code of an application, extracting source code entities from the application, and generating a hierarchical source code entity graph model from extracted source code entities. The operations also include scanning a project management artifact repository, extracting project management artifacts from the project management artifact repository, and generating a hierarchical project management artifact graph model from extracted project management artifacts. The operations further include traversing the hierarchical source code entity graph model and the hierarchical project management artifact graph model, identifying linking relationships between the source code entities and the project management artifacts, defining a policy for software quality control from the linking relationships, monitoring source code changes to the application, and modifying source code development for the application when a violation of the policy is identified in response to the monitoring.
Another aspect may provide a computer program product for dynamic categorization and ranking of source code entities and relationships. The computer program is embodied on a non-transitory computer readable medium. The computer program product includes instructions that, when executed by a computer at a storage system, causes the computer to perform operations. The operations include scanning source code of an application, extracting source code entities from the application, and generating a hierarchical source code entity graph model from extracted source code entities. The operations also include scanning a project management artifact repository, extracting project management artifacts from the project management artifact repository, and generating a hierarchical project management artifact graph model from extracted project management artifacts. The operations further include traversing the hierarchical source code entity graph model and the hierarchical project management artifact graph model, identifying linking relationships between the source code entities and the project management artifacts, defining a policy for software quality control from the linking relationships, monitoring source code changes to the application, and modifying source code development for the application when a violation of the policy is identified in response to the monitoring.
Objects, aspects, features, and advantages of embodiments disclosed herein will become more fully apparent from the following detailed description, the appended claims, and the accompanying drawings in which like reference numerals identify similar or identical elements. Reference numerals that are introduced in the specification in association with a drawing figure may be repeated in one or more subsequent figures without additional description in the specification in order to provide context for other features. For clarity, not every element may be labeled in every figure. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments, principles, and concepts. The drawings are not meant to limit the scope of the claims included herewith.
Embodiments described herein provide an intelligent automated system that categorizes application source code entities in a ranking hierarchy, based on its importance and usage over the duration of application execution. The system correlates the source code entities, such as classes, functions, methods, and test cases with project management work items, such as themes, features, stories, tasks, and defects.
The importance determination above may be measured in terms of graph based in/out referencing to other source code entities and graph based in/out referencing to project work items. The usage may be measured in terms of basis paths and code execution hits at application runtime. The system of the embodiments herein is described as based on Basis Paths, Page Rank, Graph Theory Principles, and Finite Undirected Cyclic Graphs.
Turning now to
The system 100 of
In an embodiment, the static code analyzer module 108 integrates with the source code base in version control such as Team Foundation Server (TFS) by MICROSOFT to scan/probe application modules. The static code analyzer module 108 may leverage C#.NET & .NET Reflection APIs. In embodiments, the static code analyzer module 108 performs code analysis and parsing. In particular, the module 108 extracts types (e.g., classes), functions, and methods defined in application modules (e.g., EXE/DLL). The analyzer module 108 also builds Intermediate Graph Representation (IGR) models, e.g., using a tree data structure, and persists the IGR models in a database, e.g., (a Graph or No-SQL DB). In particular, the IGR models are stored in an IGR repository 114.
In an embodiment, the project artifacts analyzer module 112 integrates with source control using, e.g., VSTS Team Foundation REST APIs and Azure DevOps Services to analyze and extract management work items such as Themes, Features, Stories, Defects, Tasks, Changesets and their inter-linkages. The project artifacts analyzer module 112 scans and extracts all TFS Themes, Features, Stories, Tasks and Defects in a defined scope based on, e.g., Organization, Team, Project or Collections. The project artifacts analyzer module 112 also extracts all the changes made to a source code base in terms of check-ins and Changesets. It then finds classes and functions that have been modified in those Changesets. The analyzer module 112 further identifies relationships between modified Classes/Functions tagged to Tasks and Defects, which in turn are tagged to respective Features and Stories. Further, the project artifacts analyzer module 112 builds intermediate graph representation (IGR) models from the projects artifacts based on, e.g., a tree data structure, and persists the IGR models in a database (e.g., a Graph or No-SQL DB). As shown in
In an embodiment, a project development team 102 makes changes to source code entities via the repository 106. The static code analyzer module 108 scans source code in the repository 106 and extracts entities, such as class and functions. The module 108 builds the IGR models from the extracted information, which may be stored in the IGR repository 114 in
A project management team 104 defines, creates, and tracks changes to project entities, which are stored in the project artifacts repository 110. The project artifacts analyzer module 112 scans the repository 110 and extracts entities, such as features, defects, stories, and tasks, and builds the IGR from the extracted information, which is stored in repository 114.
Also shown in
The data transformation and persister module 116 reads the IGR models generated as described above, and transforms them into Undirected Cyclic Graph (UCG) models (e.g., as shown and described in
The UCG models are accessed and scanned and all changesets are extracted, and scan/extract all Changesets which provide the list of Classes, Functions and Methods modified as part of that Changeset. The Changeset on the other side will be linked to Tasks->Story-Features->Theme. Using a reverse graph traversal, the system can relate a Class to a Feature (e.g., as shown in the graph models 300 of
In particular, graph model 200A represents a hierarchical graph model of source code artifacts (also referred to as source code entities graph model) generated by the static code analyzer module 108 shown in UCG format. Objects 202 are functions or methods found in the source code base, while all other objects (e.g., C1-Cn and C5C1, C5C2, C5C21, C7C1, C7C2, and C9C1) depicted in 200A represent classes or types. The UCG can be expressed as an n-node graph G=(N,E) with n Nodes and n−1 Edges where:
1≤|N|≤1 and 0≤|E|1.
G is an ordered pair and is an undirected cyclic graph.
N={n1, n2, n3, n4, n5, n6, n7, n8, n9, nn}
E={{n1,n2},{n2,n3},{n3,n4},{n4,n5},{n5,n6},{n6,n7},{n7,n8},{n8,n9},{nn-1,nn}}
Thus, a node (e.g., n1) may correlate to node C1 in
Graph model 200B of
As shown in
Turning now to
In block 404, a management entity model is built by scanning source controls to analyze and extract management work items (themes, features, stories, defects, tasks, changesets, and their interlinkages). In particular, project management artifacts IGR model is built, e.g., by the project artifacts analyzer module 112 from scanned/extracted artifacts in the project artifacts repository 110. In an embodiment, the project artifacts analyzer module 112 scans and extracts all TFS themes, features, stories, tasks and defects in a defined scope based on Organization, Team, Project, or Collections. It extracts all the changes made to source code base in terms of check-ins and change sets. A check-in refers to the act of making changes in software code and saving it to the source code base. A changeset refers to the set of files being modified in one check-in and is saved to the code base. A changeset may contain one or multiple source code files that have been modified. The project artifacts analyzer module 112 next finds classes and functions that are have modified in these changesets. The module 112 finds relationships between modified classes/functions tagged to tasks and defects, which in turn are tagged to respective feature and stories. The module 112 then builds an IGR graph model based on a tree data structure.
In block 406, the IGR models generated in blocks 402 and 404 are read and transformed into UCG models (e.g., respective graph models 200A-200B in
In block 408, the UCG model is accessed and scanned by the data transformation persister module 116 to extract all the changesets, which provide the list of class, functions, and methods modified as part of that changeset. The changeset on the other side is linked to tasks-story-features-theme. Using reverse graph traversal, the system relates a class to a feature. In particular, the data transformation persister module116 transforms the IGR models to UCG models and extracts the changesets which are linked to changesets on the other side. By way of example, element group 304 shown in
As indicated above, the embodiments provide an intelligent automated system that categorizes and builds a ranking hierarchy of application source code entities based on its usage over the duration of application execution and importance based on its linkages to other source code entities and project work item entities. The system correlates the source code entities, such as classes, functions, methods, and test cases with project management work items, such as themes, features, stories, tasks, and defects. In embodiments, the system identifies the part of the code that is used more frequently by analyzing basis path hits at runtime. It assigns scores to each source code entity based on these criteria. The system further identifies how each source code entity is related to other source code entities and project work item entities. It assigns scores to each source code entity based on these criteria.
These scores are monitored and used by the system to determine a final score, which is sued to determine a category and relative rank for each class, function, and method.
Technology leadership and project management teams can define business rules based on the above rankings to enforce organization wide, software quality control and governance policies.
Non-limiting examples of business quality rules and policies include:
Every change in a top 100 Ranking Class/Function/Method has to be code reviewed by the Tech Lead (mandatorily);
Every change in a top 100 Ranking Class/Function/Method has to be unit tested with 100% code coverage before it's checked into source control;
Every change in a top 100 Ranking Class/Function/Method should have 0 compiler warnings;
Every change in a top 100 Ranking Class/Function/Method should have logging enabled at all levels (Error, Debug, Warning and Info);
Every change in a top 300 Ranking Class/Function/Method has to be code reviewed by the Tech Lead (optionally) or peer (mandatorily);
Every change in a top 300 Ranking Class/Function/Method has to be unit tested with at least 70% code coverage before it's checked into source control;
Every change in a top 300 Ranking Class/Function/Method can have warnings ignored;
Every change in a top 300 Ranking Class/Function/Method should have logging enabled at least at Error and Warning; and
Remaining Class/Function/Method might have less severe or stringent policies configured based on business scenario and org requirements.
Turning now to
In addition to the components above, the system 500 of
The basis path analyzer module 510 analyzes application logs and events from repository 516 (or one or more additional repositories, e.g., txt, XML, json files, databases, Splunk) and builds/updates basis path metrics. In particular, the module 510 builds basis path metrics (i.e., the portions of the code that were hit while an application was executed and the frequency of hits). In an embodiment, the scope of measurement is at the Class and Function/Method level. The basis path analyzer module 510 may factor in logs generated over a defined interval of time and runs as a daily job. The module 510 updates the metrics with new information over time. The metrics may be consolidated over a large period of time to ensure confidence in the metrics to predict and differentiate which portions of the code are considered important (e.g., which portions are executed most frequently in comparison to other portions of code). The basis path analyzer module 510 also assigns a score to each and every source code entity found in those basis paths, as will now be described.
Let E={E1,E2,E3,E4,E5,E6,E7,En} is the set of source code entities
Let M={M1,M2,M3,M4,M5,M6,M7,Mn} is the set of source entities' occurrences in basis paths
wherein ni is the ith normalized data and ˜(max(M)−min(M)==0),
ni=0.01 (default normalized score),
where (max(M)−min(M)==0) and (Mi−min(M)==0).
Thus, a sample data and score calculation using the above variables may be: min(O)=25, max(O)=100, and max(O)−min(O)=75.
As shown in
Returning to
The graph analyzer module 508 also generates metrics using project management artifacts. The metrics, by way of non-limiting examples, include: a Class is related by graph edges via Changesets nodes to how many other project artifacts like Features, Stories, Tasks or Defects (the more it is being referred in the network, the higher it's score); and a Function or Method is related by graph edges (via Changesets nodes) to how many project work items like Features, Stories, Tasks or Defects (the more it is referred in the network, the higher it's score). Additional details are described further herein.
A page ranking algorithm (e.g., Python NetworkX package PageRank™) may be used to compute these metrics by passing to it the source code and project work item graph models described above. The page ranking algorithm computes a ranking of the nodes in the graph based on the structure of the incoming links and works with both directed as well as undirected graphs.
The system takes a weighted average of the scores and calculates the aggregate score from the graph analyzer module, as will now be described.
The system calculates scores based on source code entities' linkages and project work items' linkages. Sample scores for the source code entities' linkages is shown in a table 700 of
Turning now to the score calculation based on source code entities' linkages:
Let E={E1,E2,E3,E4,E5,E6,E7,En} is the set of source code entities
Let Msce={Msce1,Msce2,Msce3,Msce4,Msce5,Msce6,Msce7,Mscen} is the number of other source code entities, these entities are linked to via inward edges in the source code graph
wherein ni is the ith normalized data and ˜(max(M)−min(M)==0),
ni=0.01 (default normalized score),
where (max(M)−min(M)==0) and (Mi−min(M)==0).
As shown in
Turning now to the score calculation based on project work items' linkages:
Let E={E1,E2,E3,E4,E5,E6,E7,En} is the set of source code entities
Let Mpwi={Mpwi1,Mpwi2,Mpwi3,Mpwi4,Mpwi5,Mpwi6,Mpwi7,Mpwin} is the number of other source code entities, these entities are linked to via inward edges in the source code graph
wherein ni is the ith normalized data and ˜(max(M)−min(M)==0),
ni=0.01 (default normalized score),
where (max(M)−min(M)==0) and (Mi−min(M)==0).
As shown in
Returning to
The final scoring function in Code Ranking Engine module is as defined below:
where:
ƒs is a function to calculate final score of source code entity e;
m represents the number of stages where sub scores were calculated;
w represents the individual weight of the score; and
ƒ(st, sc) is a function to calculate the sub-score sc for stage st.
The business rules and policies engine 512 is responsible for defining business rules and policies related to source code quality control and governance. Based on source code categorization from the code ranking engine, project management teams can define rules, such as:
Every change in a CAT1 Class/Function/Method has to be code reviewed by the Tech Lead (mandatorily);
Every change in a CAT1 Class/Function/Method has to be unit tested with 100% code coverage before it's checked into source control;
Every change in a CAT1 Class/Function/Method should have 0 complier warnings;
Every change in a CAT1 Class/Function/Method should have logging enabled at all levels (Error, Debug, Warning and Info);
Every change in a CAT2 Class/Function/Method has to be code reviewed by the Tech Lead (optionally) or peer (mandatorily);
Every change in a CAT2 Class/Function/Method has to be unit tested with at least 70% code coverage before it's checked into source control;
Every change in a CAT2 Class/Function/Method can have warnings ignored;
Every change in a CAT2 Class/Function/Method should have logging enabled at least at Error and Warning; and
CAT3 change might have less severe or stringent policies configured based on business scenario and org requirements.
The development teams' code changes, modifications and other development activities may be verified by business rules and policies engine 512 and if it violates any of the business or quality compliance rules then that change may be flagged as a possible violation. The technical lead, manager, or quality control can then pull out a report of all possible violations flagged daily, over a week, bi-weekly or monthly for possible rectification actions.
Turning now to
In block 902, basis path metrics are calculated code ranking engine.
In block 904, source code entities (Class, Function, Method) are ranked based on their linkages with other source code entities.
In block 906, source code entities (Class, Function, Method) are ranked based on their linkages with other management entities, such as features, stories, tasks, defects, changesets). It will be understood that blocks 902-906 can be performed simultaneously or in succession. The system accesses the source code artifacts graph and project artifacts graph models to analyze and determine a ranking score for each class, function, and method. In embodiments, this is implemented by computing, e.g., that a given class is related by graph edges to a number of other entities. The more it is being referred to, the higher the ranking score.
In block 908, a weighted average of the ranking scores (from blocks 902-906) is calculated by the code ranking engine, and a final score is allocated to each source code entity. In particular, the system takes a weighted average of the scores from blocks 904 and 906 allotted to a class, function, or method and calculates the final ranking score. A table 1100 in
In block 910, the scores are used to classify and categorize each class/function/method into three categories (e.g., CAT1, CAT2, and CAT3). A table 1000 in
In block 912, the source code ranking and categorization are used by business entities to define rules and policies related to source code quality control and governance. Technology and project management teams can define business rules based on the above rankings to enforce organization wide software quality control and governance policies.
Processing may be implemented in hardware, software, or a combination of the two. Processing may be implemented in computer programs executed on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
Having described exemplary embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating their concepts may also be used. The embodiments contained herein should not be limited to the disclosed embodiments but rather should be limited only by the spirit and scope of the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.
Elements of different embodiments described herein may be combined to form other embodiments not specifically set forth above. Various elements, which are described in the context of a single embodiment, may also be provided separately or in any suitable subcombination. Other embodiments not specifically described herein are also within the scope of the following claims.
While illustrative embodiments have been described with respect to processes of circuits, described embodiments may be implemented as a single integrated circuit, a multi-chip module, a single card, or a multi-card circuit pack. Further, as would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, or general-purpose computer. Thus, described embodiments may be implemented in hardware, a combination of hardware and software, software, or software in execution by one or more processors.
Some embodiments may be implemented in the form of methods and apparatuses for practicing those methods. Described embodiments may also be implemented in the form of program code, for example, stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation. A non-transitory machine-readable medium may include but is not limited to tangible media, such as magnetic recording media including hard drives, floppy diskettes, and magnetic tape media, optical recording media including compact discs (CDs) and digital versatile discs (DVDs), solid state memory such as flash memory, hybrid magnetic and solid state memory, non-volatile memory, volatile memory, and so forth, but does not include a transitory signal per se. When embodied in a non-transitory machine-readable medium and the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the method.
When implemented on a processing device, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits. Such processing devices may include, for example, a general purpose microprocessor, a digital signal processor (DSP), a reduced instruction set computer (RISC), a complex instruction set computer (CISC), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a microcontroller, an embedded controller, a multi-core processor, and/or others, including combinations of the above. Described embodiments may also be implemented in the form of a bitstream or other sequence of signal values electrically or optically transmitted through a medium, stored magnetic-field variations in a magnetic recording medium, etc., generated using a method and/or an apparatus as recited in the claims.
Various elements, which are described in the context of a single embodiment, may also be provided separately or in any suitable subcombination. It will be further understood that various changes in the details, materials, and arrangements of the parts that have been described and illustrated herein may be made by those skilled in the art without departing from the scope of the following claims.
In the above-described flow charts of
Some embodiments may be implemented in the form of methods and apparatuses for practicing those methods. Described embodiments may also be implemented in the form of program code, for example, stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation. A non-transitory machine-readable medium may include but is not limited to tangible media, such as magnetic recording media including hard drives, floppy diskettes, and magnetic tape media, optical recording media including compact discs (CDs) and digital versatile discs (DVDs), solid state memory such as flash memory, hybrid magnetic and solid state memory, non-volatile memory, volatile memory, and so forth, but does not include a transitory signal per se. When embodied in a non-transitory machine-readable medium and the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the method.
When implemented on one or more processing devices, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits. Such processing devices may include, for example, a general purpose microprocessor, a digital signal processor (DSP), a reduced instruction set computer (RISC), a complex instruction set computer (CISC), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a microcontroller, an embedded controller, a multi-core processor, and/or others, including combinations of one or more of the above. Described embodiments may also be implemented in the form of a bitstream or other sequence of signal values electrically or optically transmitted through a medium, stored magnetic-field variations in a magnetic recording medium, etc., generated using a method and/or an apparatus as recited in the claims.
In some embodiments, a storage medium may be a physical or logical device. In some embodiments, a storage medium may consist of physical or logical devices. In some embodiments, a storage medium may be mapped across multiple physical and/or logical devices. In some embodiments, storage medium may exist in a virtualized environment. In some embodiments, a processor may be a virtual or physical embodiment. In some embodiments, logic may be executed across one or more physical or virtual processors.
For purposes of illustrating the present embodiment, the disclosed embodiments are described as embodied in a specific configuration and using special logical arrangements, but one skilled in the art will appreciate that the device is not limited to the specific configuration but rather only by the claims included with this specification. In addition, it is expected that during the life of a patent maturing from this application, many relevant technologies will be developed, and the scopes of the corresponding terms are intended to include all such new technologies a priori.
The terms “comprises,” “comprising”, “includes”, “including”, “having” and their conjugates at least mean “including but not limited to”. As used herein, the singular form “a,” “an” and “the” includes plural references unless the context clearly dictates otherwise. Various elements, which are described in the context of a single embodiment, may also be provided separately or in any suitable subcombination. It will be further understood that various changes in the details, materials, and arrangements of the parts that have been described and illustrated herein may be made by those skilled in the art without departing from the scope of the following claims.