Methods and systems to classify software components based on multiple information sources

Information

  • Patent Grant
  • 12164915
  • Patent Number
    12,164,915
  • Date Filed
    Friday, February 25, 2022
    2 years ago
  • Date Issued
    Tuesday, December 10, 2024
    a month ago
Abstract
Systems and methods classifying software components based on multiple information sources are provided. An exemplary method includes retrieving a number of sources including a project documentation file, source code, and dependent project list associated with a software component, extracting a number of entities from the number of sources, processing the number of entities based on a machine learning model, mapping the number of entities to a set of rules, generating a number of categorizations based on the mapping of the number of entities to the set of rules, and ranking the number of categorizations based on the set of rules.
Description
TECHNICAL FIELD

The present disclosure generally relates to methods and systems for classifying software components to different groups based on their relevance and catering to specific functions based on the information available on them. These category functions can be for different purposes including but not limited to business functions, technology stack functions and others.


BACKGROUND

There are more than 25 million open-source libraries, and the cloud API is rapidly growing, presenting a huge number of components for building today's applications. Understanding the spread of capabilities presented within these large numbers of components requires representing these components under a manageable number of categories. The abstraction of these components under different categories helps to organize the listings of these software components in a more structured way for different purposes like analysis, searching and browsing the lists to name a few.


Most of the software library components do not have a standardized categorization for them. Some of them are labelled by the authors which are not consistent with a reliable taxonomy nomenclature. Grouping the software components in a standard and consistent way will help in understanding the choices of available components in those specific categories. To visually represent these listings of software component will also bring a consistent expectation to the users on broad capability of the components.


When considering some of the systems and methods in the prior art, the above discussed drawbacks are evident. For example, United States Patent Application Publication Number 2010/0174670A1 discloses a pattern-based classification process that can use only very short patterns for classification and does not require a minimum support threshold. The training phase allows each training instance to “vote” for top-k, size-2 patterns, such as in a way that provides an effective balance between local, class, and global significance of patterns. Unlike certain approaches, the process need not make Boolean decisions on patterns that are shared across classes. Instead, these patterns can be concurrently added to all applicable classes and a power law-based weighing scheme can be applied to adjust their weights with respect to each class. However, the '670 publication describes data classification and hierarchical clustering based on patterns, but is silent on type and domain of data, data processing methods, scoring mechanism and ML techniques.


United States Patent Application Publication Number 2014/0163959A1 discloses an arrangement and corresponding method are described for multi-domain natural language processing. Multiple parallel domain pipelines are used for processing a natural language input. Each domain pipeline represents a different specific subject domain of related concepts. Each domain pipeline includes a mention module that processes the natural language input using natural language understanding (NLU) to determine a corresponding list of mentions, and an interpretation generator that receives the list of mentions and produces a rank-ordered domain output set of sentence-level interpretation candidates. A global evidence ranker receives the domain output sets from the domain pipelines and produces an overall rank-ordered final output set of sentence-level interpretations. However, the '959 publication describes multi-domain Natural language processing for sentence level interpretation but is silent about classification of software documents by hierarchically applying different techniques.


United States Patent Application Publication Number 2015/0127567A1 discloses a data mining system extracts job opening information and derives, for a given job, relevant competencies and derives, for a given candidate, relevant competencies, for the candidate. In some embodiments, the data mining performs authentication of relevant competencies before performing matching. The matching outputs can be used to provide data to a candidate indicating possible future competencies to obtain, to provide data to a teaching organization indicating possible future competencies to cover in their coursework, and to provide data to employers related to what those teaching organizations are covering. However, the '567 publication discloses processing of natural language for skilling and recruitment of human resources but is silent on classification and hierarchical representation of software documents.


U.S. Pat. No. 7,774,288 discloses records including category data that is clustered by representing the data as a plurality of clusters, and generating a hierarchy of clusters based on the clusters. Records including category data are classified into folders according to a predetermined entropic similarity condition. However, the '288 patent describes data classification and hierarchical clustering but is silent in terms of type and domain of data, data processing methods, scoring mechanism and ML techniques.


U.S. Pat. No. 8,838,606 discloses systems and methods for classifying electronic information or documents into a number of classes and subclasses are provided through an active learning algorithm. In certain embodiments, seed sets may be eliminated by merging relevance feedback and machine learning phases. Such document classification systems are easily scalable for large document collections, require less manpower and can be employed on a single computer, thus requiring fewer resources. Furthermore, the classification systems and methods can be used for any pattern recognition or classification effort in a wide variety of fields, including electronic discovery in legal proceedings. However, the '606 patent is silent on type and domain of data, data processing methods, scoring mechanism and ML techniques.


U.S. Pat. No. 9,471,559 discloses creating training data for a natural language processing system that includes obtaining natural language input, the natural language input annotated with one or more important phrases; and generating training instances including a syntactic parse tree of nodes representing elements of the natural language input augmented with the annotated important phrases. In another aspect, a classifier may be trained based on the generated training instances. The classifier may be used to predict one or more potential important phrases in a query. However, the '559 patent describes automatic generation of phrases to use in training of question answering system based on annotation but is silent on taxonomy of software documents.


In view of the above examples and the drawbacks described in each, there is a need for a method and a system that classifies the software components into a defined taxonomy structure. A method or system that uses the different information available in the software component documentation and source code to understand the semantic context of the purpose the software component.


SUMMARY

The following presents a simplified summary of the subject matter in order to provide a basic understanding of some of the aspects of subject matter embodiments. This summary is not an extensive overview of the subject matter. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the subject matter. Its sole purpose to present some concepts of the subject matter in a simplified form as a prelude to the more detailed description that is presented later.


The present disclosure provides an automated and consistent way to organizing these software components into different category dimensions to provide a structured way to browse them in an easier way for the users.


The present disclosure is a new software system which classifies the software components into a defined taxonomy structure. The system uses the different information available in the software component documentation and source code to understand the semantic context of the purpose the software component. The area of application of the software component is also understood by the system in different dimensions, for example, is this component serving a database interaction function or it can be further abstracted to level of serving a finance business function like payment gateway.


Therefore, a system for classifying software components is disclosed herein including at least one processor that operates under control of a stored program including a sequence of program instructions to control one or more components. The components including a Component Categories Portal, a Categorizer, a Classification ML Model builder, a Natural Language Processing (NLP) Cleanup Services, a NLP Extractor, a Clustering Service, a Classification Service, a Dictionary Rules Service, and a Category Ranking Service. The Component Categories Portal views the software components with their classification details and the Categorizer is in communication with the Component Categories Portal to create an overall ranked classification for the software components. The Classification ML Model builder is in communication with the Categorizer to create machine learning models based on the software components with training for prediction tasks and the NLP Cleanup Services is in communication with the Classification ML Model builder to extract needed and clean up sections of information associated with the software components. The NLP Extractor is in communication with the NLP Cleanup Services to extract key software entities based on the software components for classification and the Clustering Service is in communication with the NLP Extractor to group similar software components together. The Classification Service is in communication with the Clustering Services to classify the software components and the Dictionary Rules Service is in communication with the Classification Service to provide software dictionary terms and rules based on the software components for classification ranking. Finally, the Category Ranking Service in communication with the Dictionary Rules Service to compute final top ranked classifications for the software components.


In an embodiment, the Component Categories Portal is configured to view different categories for the software components and view the software components under a category. In an embodiment, the Categorizer is configured to invoke the Classification services to classify the software components, classify the software components based on different information collected and techniques, and apply the techniques including document classification, clustering and entity mapping to classify the software components. In an embodiment, the Classification ML Model builder is configured to create the machine learning models for classifying the software components based on different information sources, train a plurality of models with data extracted from the documentation, code, and model services are provided to classify based on entity extraction, clustering and document classification techniques.


In an embodiment, the NLP Cleanup Services is configured to provide natural language processing services for removing unnecessary information from the extracted data. In an embodiment, the NLP Extractor is configured to provide natural language processing services to extract key software entities from the information collected based on the software components and train the plurality of models based on the software dictionary and the software component information. In an embodiment, the Clustering Services is configured to collect all the software component information and group the software component information into clusters having similar software components, and extract key definition terms from a cluster documentation associated with the clusters and other information collected earlier.


In an embodiment, the Classification Service is configured to predict a category of the software components classification based on the cluster documentation extracted from the software components and apply a threshold-based mechanism to report top categories of software components classification with corresponding confidence scores. In an embodiment, the Dictionary Rules Service is configured to provide the rules for ranking the classifications based on different parameters of project source code metrics and documentation maturity level and provide key software dictionary terms for the categories.


In an embodiment, the Category Ranking Service is configured to fetch the different classifications for the software components done with different services, evaluate classification scores based on the project source code metrics, evaluate the classification score based on the documentation maturity level, and apply rules and normalize score based on the classification scores to rank the classification scores. In an embodiment, the Repo Services is configured to provide integration services for connecting to Project Repository, enable to retrieve source code of the software component and software component documentations, and save the software component information, the retrieved source code and the software component documentations to a database and file store.


A method associated with a system to classify software components into different categories is disclosed. Here, at least one processor that operates under control of a stored program including a sequence of program instructions. A first instruction step including collecting different sources of information about the software components. A second instruction step including extracting required sections of the software component information from a software component documentation and a source code associated with the software component. A third instruction step including pre-processing the extracted software component information using natural language processing techniques. A fourth instruction step including fetching dictionary for classification and rules associated with the classification. A fifth instruction step including running a categorization process on the software components. A sixth instruction step including ranking the different categorization identified for the software components.


One implementation of the present disclosure is a system for classifying software components based on multiple information sources. The system includes one or more processors and memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. The operations include retrieving a number of sources including a project documentation file, source code, and dependent project list associated with a software component, extracting a number of entities from the number of sources, processing the number of entities based on a machine learning model, mapping the number of entities to a set of rules, generating a number of categorizations based on the mapping of the number of entities to the set of rules, and ranking the number of categorizations based on the set of rules.


In some embodiment, the machine learning model uses natural language processing techniques for removing unnecessary information including hyperlinks, stopwords, and version information.


In some embodiments, the machine learning model is generated based on training data extracted from a number of project documentation files associated with the dependent project list.


In some embodiments, generating the number of categorizations includes providing a first categorization associated with a direct match between one or more entities of the number of entities with the set of rules and providing a second categorization associated with an indirect match between the one or more entities of the number of entities and the set of rules, the indirect match associated with a similarity score, the similarity score identified as equal to or greater than a threshold score.


In some embodiments, one or more of the entities are identified as at least one of a short description, a full description, features, code comments, project tags, and dependent libraries.


In some embodiments, ranking the number of categorizations based on the set of rules includes determining whether a categorization matches a name of the project documentation file.


In some embodiments, the operations further including presenting a user with the ranked categorizations.


Another implementation of the present disclosure is a method for classifying software components based on multiple information sources. The method includes retrieving a number of sources including a project documentation file, source code, and dependent project list associated with a software component, extracting a number of entities from the number of sources, processing the number of entities based on a machine learning model, mapping the number of entities to a set of rules, generating a number of categorizations based on the mapping of the number of entities to the set of rules, and ranking the number of categorizations based on the set of rules.


In some embodiments, the method includes presenting a user with the ranked categorizations.


Another implementation of the present disclosure is one or more non-transitory computer-readable media for classifying software components based on multiple information sources. The non-transitory computer-readable media store instructions thereon. The instructions, when executed by one or more processors, cause the one or more processors to retrieve a number of sources including a project documentation file, source code, and dependent project list associated with a software component, extract a number of entities from the number of sources, process the number of entities based on a machine learning model, map the number of entities to a set of rules, generate a number of categorizations based on the mapping of the number of entities to the set of rules, and rank the number of categorizations based on the set of rules.





BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings are illustrative of particular examples for enabling systems and methods of the present disclosure, are descriptive of some of the methods and mechanism, and are not intended to limit the scope of the present disclosure. The drawings are not to scale (unless so stated) and are intended for use in conjunction with the explanations in the following detailed description.



FIG. 1 shows a system architecture that does the classification of software components, according to some embodiments.



FIG. 2 shows an example computer system implementation for classifying the software projects, according to some embodiments.



FIG. 3 shows the overall process flow to classify the software components into different categories, according to some embodiments.



FIG. 4 shows the overall method flow associated with a computer programmed product to classify the software components into different categories, as described in FIGS. 1-3, according to some embodiments.



FIG. 5 shows Pre-Process using NLP techniques, where the topics of documents are standardised in the Standardise tags, according to some embodiments.



FIG. 6 shows Fetch Dictionary and Rules step, where the transformed document from the Pre-process using NLP techniques step are fed to Predict topics, which leverages supervised Machine learning model, according to some embodiments.



FIG. 7 shows how predicted topics are evaluated against the name of software document and a weight being assigned accordingly, according to some embodiments.





Like reference numbers and designations in the various drawings indicate like elements.


Persons skilled in the art will appreciate that elements in the figures are illustrated for simplicity and clarity and may represent both hardware and software components of the system. Further, the dimensions of some of the elements in the figure may be exaggerated relative to other elements to help to improve understanding of various exemplary embodiments of the present disclosure. Throughout the drawings, it should be noted that like reference numbers are used to depict the same or similar elements, features, and structures.


DETAILED DESCRIPTION

Exemplary embodiments now will be described. The disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey its scope to those skilled in the art. The terminology used in the detailed description of the particular exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting. In the drawings, like numbers refer to like elements.



FIG. 1 shows a System 100 or a high-level architecture that does classification of software components. Briefly, and as described in further detail below, the System 100 includes an API Hub 102, Messaging Bus 103, Categorizer 104, Classification ML Model builder 105, Natural Language Processing (NLP) Cleanup Services 106, NLP Extractor 107, Clustering Service 108, Classification Service 109, Dictionary Rules Service 110, Category Ranking Service 111, Project Repo Services 112, File Storage 114, Database 115 and Component Categories Portal 101, which are a unique set of software components to perform the task of classifying the software.


In some embodiments, the primary functional block of the present disclosure includes the Component Categories Portal 101 which has a User Interface for the user to view the different categories available for the software components with their classification details. The user can explore the software components under the categories as a list of software components for that category.


In some embodiments, the Categorizer 104 in communication with the Component Categories Portal 101 and is responsible for the overall ranked classification of the software components. The Categorizer 104 invokes the services to classify the software components on multiple dimensions of information sources and techniques used. The Categorizer 104 classifies the software components based on different information collected and techniques. The techniques used are but not limited to techniques of document classification, clustering and entity mapping to classify the software components.


In some embodiments, the Classification ML Model builder 105 is responsible for creating the machine learning models for classifying the software components based on the different information sources with training for prediction tasks. The Classification ML Model builder 105 trains multiple models with data extracted from the documentation, code, etc., and provides for multiple Model services to classify based on entity extraction, clustering and document classification techniques.


In some embodiments, the Natural Language Processing (NLP) Cleanup Services 106 is in communication with the Classification ML Model builder 105 to extract needed and clean up sections of information associated with the software component, or in other words, provides natural language processing services for removing the unnecessary information from the extracted content. Key technology related words are retained by using a special dictionary of software technology terms during the cleanup process.


In some embodiments, the NLP Extractor 107 is in communication with the NLP Cleanup Services 106 to extract key software entities based on the software components for classification. Hence, the NLP Extractor 107 uses natural language processing services to extract key software entities from the collected software component information. The trained models that are based on software dictionary and collected software component information is used with natural language processing to extract the information.


In some embodiments, the Clustering Service 108 is in communication with the NLP Extractor 107 to group similar software components together. The Clustering Service 108 takes all the software component information and groups them into clusters having similar software components using the machine learning services and trained models. The Clustering Service 108 then extracts key definition terms from the cluster documentation associated with the clusters and other information collected earlier aided by the software technology terms dictionary.


In some embodiments, the Classification Service 109 is in communication with the Clustering Service 108 to classify the software components. The Classification Service 109 is used to predict the category of the software components classification based on the cluster documentation extracted from the Classification Service 109. The Classification Service 109 applies a threshold-based mechanism to report top categories of software components classification with their confidence scores from classification done utilizing the trained models with existing labelled content with the categories.


In some embodiments, the Dictionary Rules Service 110 is in communication with the Classification Service 109 provides software dictionary terms and rules based on the software components for classification ranking. The Dictionary Rules Service 110 provides the different rules for ranking the classifications based on different parameters of project source code metrics and documentation maturity level. The Dictionary Rules Service 110 provides the key software dictionary terms for categories.


In some embodiments, the Category Ranking Service 111 is in communication with the Dictionary Rules Service to compute final top ranked classifications for the software components. The Category Ranking Service 111 fetches the different classification for the software components done with different services. The Category Ranking Service 111 then evaluates the classification score based on the project source code metrics. The Category Ranking Service 111 evaluates the classification score based on the documentation maturity level. The Category Ranking Service 111 applies rules to and normalizes score based on all the classification scores to rank the classification scores.


In some embodiments, the Repo Services 112 provides integration services for connecting to Project Repository 116. This enables the System 100 to get the source code of software component and software component documentations. After fetching this information, the Repo Services 112 saves the software component information, source code and documentations to the database and file store.



FIG. 2 shows a block view of a System 200 configured for classifying software components, according to some embodiments. This may include a Processor 201, Memory 202, Display 203, Network Bus 204, and other input/output like a mic, speaker, wireless card etc. The Software Component Classification System Modules 100, file storage 114, database 115, are stored in a Partition 205 of the Memory 202 which provides the necessary machine instructions to the Processor 201 to perform the executions for classifying the software components. In some embodiments, the Processor 201 controls the overall operation of the System 100 and managing the communication between the software components through the network bus 204. The Memory 202 holds the software component classification system code, data and instructions of the System 100 and may be of different types of the non-volatile memory and volatile memory. The software component categories portal 101 interacts with the network bus 204.



FIG. 3 shows the end-to-end process 300 for classifying the software components, according to some embodiments. In step 301, The System 100 collects different sources of information about the software component, including Project Documentation 307, Source Code 308, and Dependent Projects 309. The System 100 connects to the source code repository (e.g., Project Repository 116) and documentation site and downloads the source code and documentation. The System 100 then collects the dependent projects details.


In some embodiments, in step 302, the Source Code 308 is parsed and code comments are collected. Then the documentation is analysed to extract the contextual sections for the software component. The information sections include but are not limited to Short Description 310, Full description 311, Features 312, Code Comments 313, Project Tags 314, Dependent Libraries 315, and Release notes (not shown). Below is a sample output from step 302:














{


 “fullName”: “spring-projects/spring-boot”,


 “description”: “Spring Boot”,


 “topics”: [“java”, “spring-boot”, “spring”, “framework”],


 “readme”: “Spring Boot helps you to create Spring-powered, production-grade


 applications and services with absolute minimum fuss. It takes an opinionated


 view of the Spring platform so that new and existing users can quickly get to the


 bits they need. You can use Spring Boot to create stand-alone Java applications


 that can be started using java -jar or more traditional WAR deployments. We also


 provide a command line tool that runs spring scripts. Provide a radically faster


 and widely accessible getting started experience for all Spring development Be


 opinionated out of the box, but get out of the way quickly as requirements start to


 diverge from the defaults Provide a range of non-functional features that are


 common to large classes of projects (e.g. embedded servers, security, metrics,


 health checks, externalized configuration) Absolutely no code generation and no


 requirement for XML configuration ”


}









In some embodiments, Process 300 further includes steps 303, 304, 305, and 306, described in further detail below in regards to FIGS. 4-7.



FIG. 4 illustrates a process 400 associated with a computer programmed product to classify the software components into different categories, as described in FIGS. 1-3, according to some embodiments. Process 400 can be performed by the components of System 100 or System 200 in some embodiments. Here, at least one processor that operates under control of a stored program includes a sequence of program instructions. Step 401 includes collecting different sources of information about the software components. Step 402 includes extracting required sections of the software component information from a software component documentation and a source code associated with the software component. Step 403 includes pre-processing the extracted software component information using natural language processing techniques. Step 404 includes fetching dictionary for classification and rules associated with the classification. Step 405 includes running a categorization process on the software components. Step 406 includes ranking the different categorization identified for the software components.


Referring to FIG. 5, step 303 introduced above in regards to FIG. 3 is described in further detail, according to some embodiments. FIG. 5 and step 303 relate to pre-processing the documents using NLP techniques. The topics of documents are standardised in step 501. In step 502, learnt stopwords in software context are removed. In step 503 the document is further filtered by applying natural language processing techniques for removing, as shown in block 504, unnecessary information, for example, hyperlinks, regular stopwords, version information, and skipping words that are too short or too long in the context. In step 505, the words in the document are lemmatised, as shown in block 506, and features of a software document like description and tags are combined. Following steps 501, 502, 503, and 505 the section headings and the section content are verified using the NLP techniques as shown in regards to step 303. Any sections without headings are mapped using the NLP techniques to a heading using natural processing techniques trained with earlier provided documentation and section headings. The NLP techniques are used to clean up the section content by removing unnecessary information. A sample output of step 303 is shown below:














{


 ‘repo_name’:‘spring-projects/spring-boot’


 ‘combined_text’:‘spring boot spring boot help spring powered production grade


 application service absolute minimum fuss take opinionated view spring exist user


 get bit need spring boot stand alone java application start use java jar traditional


 war deployment command line run spring script radically fast widely accessible


 get start experience spring opinionate box get requirement start diverge default


 range non functional feature common large class project embed server security


 metric health check externalize configuration absolutely code generation


 requirement xml configuration spring framework spring boot’


 ‘label’:‘spring frameworks spring-boot’


 ‘readme_length’:140


 ‘preprocessed_text_length’:82


 ‘number_of_tags':3


 ‘topics':‘spring frameworks spring-boot’


}









Referring to FIG. 6, step 304 introduced above in regards to FIG. 3 is described in further detail, according to some embodiments. FIG. 6 and step 304 relate to fetching a dictionary and rules. In step 601, the output of step 303 is received as transformed text. In step 602 a supervised Machine learning model (shown in block 603) trained specifically on software related documents is leveraged for prediction of number of topics that defaults to 10 unless specified otherwise. In step 604, a rule book that was generated as a result of another training process that represents technical words in multi-dimensional vector space is fetched. Following steps 601, 602, and 604, the software dictionary terms and priority rules associated with the software components are loaded as shown in regards to step 304. The rules for prioritizing the classification results based on the source of information and project source code metrics, documentation quality is loaded, or in other words, the rules are loaded to be used while classifying and ranking the software components. A sample output of step 304 is shown below:


‘predicted_topics’: {‘spring-boot’, ‘spring’, ‘spring-mvc’}


Referring to FIG. 7, step 305 introduced above in regards to FIG. 3 is described in further detail, according to some embodiments. FIG. 7 and step 305 relate to categorization. Generally, in step 305, the System 100 applies machine learning and natural language tasks to the various information extracted and cleaned in the earlier steps. The System 100 then applies classifying techniques on the extracted and consolidated information to predict the category for the software components. Accordingly, step 305 may include steps shown as Classification 305, Entity Extraction 317, and Clustering 318. In step 701, the output of step 304 is received as predicted topics and a fetched rulebook. In step 702 the predicted topics are evaluated against the name of software document and a weight is assigned if they are equal. In step 703, the predicted topics are looked up in the rule book by applying clustering techniques to find if there are any direct match that is found in the rule book. If a direct match is found, they are stored in memory for computation in further steps. If a direct match is not found in step 704 the rule book is again looked up for any indirect match from the result of similarity engine and accordingly another weight is assigned for score computation. Based on the result of the step 703, step 704 will be executed and therefore the output is aggregated for an input for step 706 in the step 705. In step 706, the classification results are grouped hierarchically. For every repetitive hierarchical group, the scoring is continually boosted to find the hierarchical pair that has scored the maximum. A sample output of step 305 is shown below:














{


 0:[[“, ‘application-framework’, ‘server’, ‘spring-boot’, ‘direct_match’, True], [‘1317’,


‘application-framework’, ‘server’, ‘spring-boot’, ‘direct_match’, True]]


 1:[[“, ‘application-framework’, ‘server’, ‘spring’, ‘direct_match’, False], [‘787’,


‘application-framework’, ‘server’, ‘spring’, ‘direct_match’, False]]


 2:[‘197’, ‘mvc’, ‘architecture’,”, ‘direct_match’, False]


}









In regards to the sample output of step 305, for every predicted topic, the respective finds from rule book are displayed above. The first value of array contains index from rule book, the second value is a subcategory of software document under test, the third value is a category of software document under test, the fourth value indicates it will be considered as a technical label of the software document under test, the fifth value indicates whether the match from rule book is a direct match or not, and the sixth value indicates whether name of the software document under test matches with the predicted topic or not.


Referring again to FIG. 3, in the Rank Categorizations 306 step, the different categorization identified for the software component is ranked with normalized scores, according to some embodiments. The prediction score of different categorizations done for the software component is evaluated with the rules for boosting the prediction score based on the software component information state of maturity. The rules for normalizing the scores across different categorization is then applied depending on the extracted information metrics and project source code metrics. The boosted and normalized prediction scores is then used to rank the categories for the software component. A sample output of step 306 is shown below:

















{



“technologyCategoryDetails”: [



  ‘category’:‘architecture ’



  ‘subCategoryCode’:‘mvc’



  ‘categoryCode’:‘architecture’



  ‘subCategory’:‘mvc'



  ‘rank’:1



  ‘createdDate’:‘2022-02-08T16:10:30Z’



  ‘probability’:0.1038



]



 ‘techLabels':[‘spring-boot’, ‘spring’]



 ‘primaryTechCategory’:‘architecture’



 ‘primaryTechSubcategory’:‘mvc’



 ‘primaryTechProbability’:0.10375863313674927



}











In regards to the sample output of step 306, the key technologyCategoryDetails contains all the possible categories in array view. The example here contains only one category along with computed rank. The concluded labels, the category, the subcategory and the probability of classification are found against keys techLabels, primaryTechCategory, primaryTechSubcategory, and primaryTechProbability respectively.


As will be appreciated by one of skill in the art, the present disclosure may be embodied as a method and system. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, a software embodiment or an embodiment combining software and hardware aspects. It will be understood that the functions of any of the units as described above can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts performed by any of the units as described above.


Instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function/act performed by any of the units as described above.


Instructions may also be loaded onto a computer or other programmable data processing apparatus like a scanner/check scanner to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts performed by any of the units as described above.


In the specification, there has been disclosed exemplary embodiments of the present disclosure. Although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation of the scope of the present disclosure.

Claims
  • 1. A method for classifying software components based on multiple information sources, the method comprising: retrieving a plurality of sources comprising a project documentation file, source code, and dependent project list associated with a software component;extracting contextual information for the software component from the plurality of sources;pre-processing the contextual information using natural language processing;fetching, based on the pre-processed contextual information, a set of rules for prioritizing classification results;generating a plurality of categorizations for the software component using the pre-processed contextual information; andranking the plurality of categorizations for the software component based on the set of rules
  • 2. The method of claim 1, wherein pre-processing the contextual information using natural language processing comprises removing hyperlinks, stopwords, and version information.
  • 3. The method of claim 1, wherein the natural language processing uses a machine learning model, the method comprising training the machine learning model using training data extracted from a plurality of project documentation files associated with the dependent project list.
  • 4. The method of claim 1, wherein ranking the plurality of categorizations based on the set of rules comprises determining whether a categorization matches a name of the project documentation file.
  • 5. The method of claim 1, further comprising presenting a user with the ranked categorizations.
  • 6. A system for classifying software components based on multiple information sources, the system comprising: one or more processors and memory storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: retrieving a plurality of sources comprising a project documentation file, source code, and dependent project list associated with a software component;extracting contextual information for the software component from the plurality of sources;pre-processing the contextual information using natural language processing;fetching, based on the pre-processed contextual information, a set of rules for prioritizing classification results;generating a plurality of categorizations for the software component using the pre-processed contextual information; andranking the plurality of categorizations for the software component based on the set of ruleswherein: generating the plurality of categorizations comprises providing a first categorization associated with a direct match between contextual information and the set of rules and providing a second categorization associated with an indirect match between the contextual information and the set of rules, the indirect match associated with a similarity score, the similarity score identified as equal to or greater than a threshold score; andthe contextual information comprises a short description, a full description, features, code comments, project tags, and dependent libraries.
  • 7. The system of claim 6, wherein pre-preprocessing the contextual information using the natural language processing comprises removing unnecessary information comprising hyperlinks, stopwords, and version information.
  • 8. The system of claim 6, wherein the natural language processing uses a machine learning model generated based on training data extracted from a plurality of project documentation files associated with the dependent project list.
  • 9. The system of claim 6, wherein ranking the plurality of categorizations based on the set of rules comprises determining whether a categorization matches a name of the project documentation file.
  • 10. The system of claim 6, the operations further comprising presenting a user with the ranked categorizations.
  • 11. One or more non-transitory computer-readable media for classifying software components based on multiple information sources, the non-transitory computer-readable media storing instructions thereon, wherein the instructions when executed by one or more processors cause the one or more processors to: retrieve a plurality of sources comprising a project documentation file, source code, and dependent project list associated with a software component;extract contextual information for the software component from the plurality of sources, wherein the contextual information comprises a short description, a full description, features, code comments, project tags, and dependent libraries;pre-process the contextual information using natural language processing;fetch, based on the pre-processed contextual information, a set of rules for prioritizing classification results;generate a plurality of categorizations for the software component using the pre-processed contextual information by providing a first categorization associated with a direct match between the contextual information and the set of rules and providing a second categorization associated with an indirect match between the contextual information and the set of rules, the indirect match associated with a similarity score, the similarity score identified as equal to or greater than a threshold score; andrank the plurality of categorizations for the software component based on the set of rules.
  • 12. The non-transitory computer-readable media of claim 11, wherein pre-processing the contextual information using the natural language processing comprises removing hyperlinks.
  • 13. The non-transitory computer-readable media of claim 11, wherein the natural language processing uses a machine learning model generated based on training data extracted from a plurality of project documentation files associated with the dependent project list.
  • 14. The non-transitory computer-readable media of claim 11, wherein ranking the plurality of categorizations based on the set of rules comprises determining whether a categorization matches a name of the project documentation file.
CROSS-REFERENCE TO RELATED PATENT APPLICATION

This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/154,381 filed Feb. 26, 2021, the entire disclosure of which is incorporated by reference herein.

US Referenced Citations (158)
Number Name Date Kind
5953526 Day et al. Sep 1999 A
7322024 Carlson et al. Jan 2008 B2
7703070 Bisceglia Apr 2010 B2
7774288 Acharya et al. Aug 2010 B2
7958493 Lindsey et al. Jun 2011 B2
8010539 Blair-Goldensohn et al. Aug 2011 B2
8051332 Zakonov et al. Nov 2011 B2
8112738 Pohl et al. Feb 2012 B2
8112744 Geisinger Feb 2012 B2
8219557 Grefenstette et al. Jul 2012 B2
8296311 Rapp et al. Oct 2012 B2
8412813 Carlson et al. Apr 2013 B2
8417713 Blair-Goldensohn et al. Apr 2013 B1
8452742 Hashimoto et al. May 2013 B2
8463595 Rehling et al. Jun 2013 B1
8498974 Kim et al. Jul 2013 B1
8627270 Fox et al. Jan 2014 B2
8677320 Wilson et al. Mar 2014 B2
8688676 Rush et al. Apr 2014 B2
8838606 Cormack et al. Sep 2014 B1
8838633 Dhillon et al. Sep 2014 B2
8935192 Ventilla et al. Jan 2015 B1
8943039 Grieselhuber et al. Jan 2015 B1
9015730 Allen et al. Apr 2015 B1
9043753 Fox et al. May 2015 B2
9047283 Zhang et al. Jun 2015 B1
9135665 England et al. Sep 2015 B2
9176729 Mockus et al. Nov 2015 B2
9201931 Lightner et al. Dec 2015 B2
9268805 Crossley et al. Feb 2016 B2
9330174 Zhang May 2016 B1
9361294 Smith Jun 2016 B2
9390268 Martini et al. Jul 2016 B1
9471559 Castelli et al. Oct 2016 B2
9558098 Alshayeb et al. Jan 2017 B1
9589250 Palanisamy et al. Mar 2017 B2
9626164 Fuchs Apr 2017 B1
9672554 Dumon et al. Jun 2017 B2
9977656 Mannopantar et al. May 2018 B1
10305758 Bhide et al. May 2019 B1
10474509 Dube et al. Nov 2019 B1
10484429 Fawcett et al. Nov 2019 B1
10761839 Migoya et al. Sep 2020 B1
10922740 Gupta et al. Feb 2021 B2
11023210 Li et al. Jun 2021 B2
11238027 Frost et al. Feb 2022 B2
11256484 Nikumb et al. Feb 2022 B2
11288167 Vaughan Mar 2022 B2
11294984 Kittur et al. Apr 2022 B2
11295375 Chitrapura et al. Apr 2022 B1
11301631 Atallah et al. Apr 2022 B1
11334351 Pandurangarao May 2022 B1
11461093 Edminster et al. Oct 2022 B1
11474817 Sousa et al. Oct 2022 B2
11704406 Lee et al. Jul 2023 B2
11893117 Segal et al. Feb 2024 B2
11966446 Socher et al. Apr 2024 B2
12034754 O'Hearn et al. Jul 2024 B2
20010054054 Olson Dec 2001 A1
20020059204 Harris May 2002 A1
20020099694 Diamond et al. Jul 2002 A1
20020150966 Muraca Oct 2002 A1
20020194578 Irie et al. Dec 2002 A1
20040243568 Wang et al. Dec 2004 A1
20060090077 Little et al. Apr 2006 A1
20060104515 King et al. May 2006 A1
20060200741 Demesa et al. Sep 2006 A1
20060265232 Katariya et al. Nov 2006 A1
20070050343 Siddaramappa et al. Mar 2007 A1
20070168946 Drissi et al. Jul 2007 A1
20070185860 Lissack Aug 2007 A1
20070234291 Ronen et al. Oct 2007 A1
20070299825 Rush et al. Dec 2007 A1
20090043612 Szela et al. Feb 2009 A1
20090319342 Shilman et al. Dec 2009 A1
20100106705 Rush et al. Apr 2010 A1
20100121857 Elmore et al. May 2010 A1
20100122233 Rath May 2010 A1
20100174670 Malik et al. Jul 2010 A1
20100205198 Mishne et al. Aug 2010 A1
20100205663 Ward et al. Aug 2010 A1
20100262454 Sommer et al. Oct 2010 A1
20110231817 Hadar et al. Sep 2011 A1
20120143879 Stoitsev Jun 2012 A1
20120259882 Thakur et al. Oct 2012 A1
20120278064 Leary et al. Nov 2012 A1
20130103662 Epstein Apr 2013 A1
20130117254 Manuel-Devadoss et al. May 2013 A1
20130254744 Sahoo Sep 2013 A1
20130326469 Fox et al. Dec 2013 A1
20140040238 Scott et al. Feb 2014 A1
20140075414 Fox et al. Mar 2014 A1
20140122182 Cherusseri et al. May 2014 A1
20140149894 Watanabe et al. May 2014 A1
20140163959 Hebert et al. Jun 2014 A1
20140188746 Li Jul 2014 A1
20140297476 Wang et al. Oct 2014 A1
20140331200 Wadhwani et al. Nov 2014 A1
20140337355 Heinze Nov 2014 A1
20150127567 Menon et al. May 2015 A1
20150220608 Crestani Campos et al. Aug 2015 A1
20150331866 Shen et al. Nov 2015 A1
20150378692 Dang et al. Dec 2015 A1
20160253688 Nielsen et al. Sep 2016 A1
20160350105 Kumar et al. Dec 2016 A1
20160378618 Cmielowski et al. Dec 2016 A1
20170034023 Nickolov et al. Feb 2017 A1
20170063776 Nigul Mar 2017 A1
20170154543 King Jun 2017 A1
20170177318 Mark et al. Jun 2017 A1
20170220633 Porath et al. Aug 2017 A1
20170242892 Ali et al. Aug 2017 A1
20170286541 Mosley et al. Oct 2017 A1
20170286548 De et al. Oct 2017 A1
20180046609 Agarwal et al. Feb 2018 A1
20180067836 Apkon et al. Mar 2018 A1
20180107983 Tian et al. Apr 2018 A1
20180114000 Taylor Apr 2018 A1
20180189055 Dasgupta Jul 2018 A1
20180191599 Balasubramanian et al. Jul 2018 A1
20180329883 Leidner et al. Nov 2018 A1
20180349388 Skiles Dec 2018 A1
20190026106 Burton et al. Jan 2019 A1
20190229998 Cattoni Jul 2019 A1
20190278933 Bendory et al. Sep 2019 A1
20190286683 Kittur et al. Sep 2019 A1
20190294703 Bolin et al. Sep 2019 A1
20190303141 Li et al. Oct 2019 A1
20190311044 Xu et al. Oct 2019 A1
20190324981 Counts et al. Oct 2019 A1
20200097261 Smith et al. Mar 2020 A1
20200110839 Wang et al. Apr 2020 A1
20200125482 Smith et al. Apr 2020 A1
20200133830 Sharma et al. Apr 2020 A1
20200293354 Song et al. Sep 2020 A1
20200301672 Li et al. Sep 2020 A1
20200301908 Frost et al. Sep 2020 A1
20200348929 Sousa et al. Nov 2020 A1
20200356363 Dewitt et al. Nov 2020 A1
20210049091 Hikawa et al. Feb 2021 A1
20210065045 Kummamuru Mar 2021 A1
20210073293 Fenton et al. Mar 2021 A1
20210081189 Nucci et al. Mar 2021 A1
20210081418 Silveira et al. Mar 2021 A1
20210141863 Wu et al. May 2021 A1
20210149658 Cannon et al. May 2021 A1
20210149668 Gupta et al. May 2021 A1
20210303989 Bird et al. Sep 2021 A1
20210349801 Rafey Nov 2021 A1
20210357210 Clement et al. Nov 2021 A1
20210382712 Richman et al. Dec 2021 A1
20210397418 Nikumb et al. Dec 2021 A1
20210397546 Cser et al. Dec 2021 A1
20220012297 Basu et al. Jan 2022 A1
20220083577 Yoshida et al. Mar 2022 A1
20220107802 Rao et al. Apr 2022 A1
20220215068 Kittur et al. Jul 2022 A1
20230308700 Perez Sep 2023 A1
Foreign Referenced Citations (4)
Number Date Country
108052442 May 2018 CN
10-2020-0062917 Jun 2020 KR
WO-2007013418 Feb 2007 WO
WO-2020086773 Apr 2020 WO
Non-Patent Literature Citations (15)
Entry
Chung-Yang et al. “Toward Singe-Source Of Software Project Documented Contents: A Preliminary Study”, [Online], [Retrieve from INternet on Sep. 28, 2024], <https://www.proquest.com/openview/c15dc8b34c7da061fd3ea39f1875d8e9/1?pq-origsite=gscholar&cbl=237699> (Year: 2011).
Lampropoulos et al., “REACT—A Process for Improving Open-Source Software Reuse”, IEEE, pp. 251-254 (Year: 2018).
Leclair et al., “A Neural Model for Generating Natural Language Summaries of Program Subroutines,” Collin McMillan, Dept. of Computer Science and Engineering, University of Notre Dame Notre Dame, IN, USA, Feb. 5, 2019.
Schweik et al, Proceedings of the OSS 2011 Doctoral Consortium, Oct. 5, 2011, Salvador, Brazil, pp. 1-100, Http:/Avorks.bepress.com/charles_schweik/20 (Year: 2011).
Stanciulescu et al, “Forked and Integrated Variants in an Open-Source Firmware Project”, IEEE, pp. 151-160 (Year: 2015).
Zaimi et al, “:An Empirical Study on the Reuse of Third-Party Libraries in Open-Source Software Development”, ACM, pp. 1-8 (Year: 2015).
Iderli Souza, An Analysis of Automated Code Inspection Tools for PHP Available on Github Marketplace, Sep. 2021, pp. 10-17 (Year: 2021).
Khatri et al, “Validation of Patient Headache Care Education System (PHCES) Using a Software Reuse Reference Model”, Journal of System Architecture, pp. 157-162 (Year: 2001).
Lotter et al, “Code Reuse in Stack Overflow and Popular Open Source Java Projects”, IEEE, pp. 141-150 (Year: 2018).
Rothenberger et al, “Strategies for Software Reuse: A Principal Component Analysis of Reuse Practices”, IEEE, pp. 825-837 (Year:2003).
Tung et al, “A Framework of Code Reuse in Open Source Software”, ACM, pp. 1-6 (Year: 2014).
M. Squire, “Should We Move to Stack Overflow?” Measuring the Utility of Social Media for Developer Support, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, Italy, 2015, pp. 219-228, doi: 10.1109/ICSE.2015.150. (Year: 2015).
S. Bayati, D. Parson, T. Sujsnjak and M. Heidary, “Big data analytics on large-scale socio-technical software engineering archives,” 2015 3rd International Conference on Information and Communication Technology (ICoICT), Nusa Dua, Bali, Indonesia, 2015, pp. 65-69, doi: 10.1109/ICoICT.2015.7231398. (Year: 2015).
Andreas DAutovic, “Automatic Assessment of Software Documentation Quality”, published by IEEE, ASE 2011, Lawrence, KS, USA, pp. 665-669, (Year: 2011).
S. Bayati, D. Parson, T. Susnjakand M. Heidary, “Big data analytics on large-scale socio-technical software engineering archives,” 2015 3rd International Conference on Information and Communication Technology (ICoICT), Nusa Dua, Bali, Indonesia, 2015, pp. 65-69, doi: 1 0. 1109/ICoICT.2015.7231398. (Year: 2015).
Related Publications (1)
Number Date Country
20220291921 A1 Sep 2022 US
Provisional Applications (1)
Number Date Country
63154381 Feb 2021 US