The present invention relates to the analysis and use of software artifacts. More particularly, the present invention provides for extracting flowchart information from digital images.
Software engineering is the implementation of processes for development, maintenance and operation of software used in any application. An important aspect of software engineering is reusing existing software for efficient operation of a software system. Software reuse also helps in accelerating software development lifecycle.
One of the features of software reuse currently implemented in the industry is the reuse of information available in the form of software artifacts. A software artifact is a portion of a software development process containing useful information. Generally, software artifacts contain useful knowledge related to the features of a software system. Examples of software artifacts include use-cases, flowcharts, wireframe diagrams, activity diagrams, UML diagrams and the like.
A flowchart is a schematic representation of a process or an algorithm that illustrates the sequence of operations to be performed to get the solution of a problem. Nowadays, business organizations widely use software systems for implementing business processes. A majority of artifacts of a software system of a business organization may exist in the form of flowcharts. Flowcharts may be used to represent essential functions of an organizational process. Examples of the essential functions represented by a flowchart may include movement of materials through a machinery in a manufacturing process, flow of applicant information through a hiring process in a human resources department, etc.
In light of the above, there exists a need for extracting data from artifacts of a software system such as flowcharts, and storing the data in a format such that the data can be efficiently reused.
A system and method for extracting flowchart information from digital images is provided. The digital flowchart image includes text data, geometric components and connecting flow lines. The method includes binarizing the digital flowchart image. Text data is then extracted from the binarized image using rectangular region growing segmentation technique. The method then includes extracting and masking flow lines connecting geometric components within the digital flowchart image. After the extraction and masking of flow lines, geometric components are extracted and classified into one or more categories. Classifying the geometric components may include recognizing the components and arranging them into one or more shape categories. Flow line relationship information between the geometric components is also extracted. Thereafter, the extracted text data, flow line relationship information and geometric component information is stored in a database. In various embodiments of the present invention, the digital flowchart image may be a binary image, a color image, a grayscale image, a multispectral image or a thematic image.
In an embodiment of the present invention, prior to binarizing the digital flowchart image, the image is converted into a grayscale image
In an embodiment of the present invention, one or more regions including text data are masked prior to extracting and masking flow lines connecting geometric components. Masking of the one or more regions includes converting pixels within the one or more bounded regions into background color of the digital flowchart image.
In an embodiment of the present invention, extracting text data using rectangular region growing segmentation technique includes marking rectangular boundaries around one or more regions bounded by clusters of connected pixels of text data. An iterative algorithm is then executed for extracting one or more segment blocks enclosing individual characters from the one or more regions. In an embodiment of the present invention, a heuristic algorithm is implemented for separating closely connected individual characters prior to executing the iterative algorithm Characters are recognized in each of the one or more segmented blocks using a neural network based Optical Character Recognition algorithm. Thereafter, the characters are translated using a character encoding scheme.
In an embodiment of the present invention, recognition of geometric components is implemented using back-propagation neural network technique. In another embodiment of the present invention, recognition of geometric components is implemented by comparing the geometric components with standard geometric shapes stored in a database. The comparison of geometric components is performed using Dynamic Time Warping algorithm.
In an embodiment of the present invention, the standard geometric shapes are stored by representing the shapes using boundary-based shape representation. Angular directions of pixel points along boundary of a geometric shape is used for describing the shape and slope of line within a threshold limit traced along the boundary is used to define and form shape vectors.
In an embodiment of the present invention, the extracted text data is stored along with its location information. The location information indicates location of bounded geometric components within which text data is stored.
In an embodiment of the present invention, the extracted geometric component information is stored along with location, height and width information.
In an embodiment of the present invention, the extracted text data, flow line relationship information and geometric component information is stored in XML format. In another embodiment of the present invention, the extracted text data, flow line relationship information and geometric component information is stored in Graph Exchange Language format.
The present invention is described by way of embodiments illustrated in the accompanying drawings wherein:
A system, method and computer program product for extracting information from software artifacts is provided. The present invention is more specifically directed towards extracting flowchart information from digital images. An exemplary scenario in which the present invention may be implemented is in a software system in which information about the processes and functions of the system are stored in flowchart image files. In order to enable an efficient reuse of this information, data in flowchart images is to be extracted and stored in a format that is widely used.
In an embodiment of the present invention, system, method and computer program product disclosed provides extracting data from flowchart image files. Data extracted from flowchart images includes text data, data describing geometric flowchart components and flow lines connecting the geometric components. Text data is data located in the flowchart image. Text data may be enclosed within geometric flowchart components representing steps of flow of a process or it may be located outside the flowchart components.
In various embodiments of the present invention, system, method and computer program product disclosed provides utilizing a technique for extracting text data from a flowchart image. The method includes converting flowchart image into a grayscale image. Further, the method includes binarizing the image and extracting character segment blocks from the image using region growing segmentation. Thereafter, individual characters are recognized using neural network based Optical Character Recognition (OCR).
In an embodiment of the present invention, system, method and computer program product disclosed provides for extracting and classifying flowchart components from the flowchart image. Prior to extracting flowchart components, text data as well as flow lines connecting the flowchart components are masked. Then, flowchart components are extracted using region growing segmentation technique and the components are recognized using a back-propagation neural network. The neural network utilized for recognizing the geometric components is a network trained in recognizing geometric shapes. In various embodiments of the present invention, a Dynamic Time Warping (DTW) approach is used to recognize flowchart component shapes.
In yet another embodiment of the present invention, the system, method and computer program product disclosed provides for storing the extracted text data, data describing geometric components and flow lines in an Extensible Markup Language (XML) format.
Hence, the present invention enables an efficient reuse of information stored in flowcharts. The present invention also enables a proficient manner of exporting and using data across various software systems due to the data being stored in XML format.
The disclosure is provided in order to enable a person having ordinary skill in the art to practice the invention. Exemplary embodiments herein are provided only for illustrative purposes and various modifications will be readily apparent to persons skilled in the art. The general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. The terminology and phraseology used herein is for the purpose of describing exemplary embodiments and should not be considered limiting. Thus, the present invention is to be accorded the widest scope encompassing numerous alternatives, modifications and equivalents consistent with the principles and features disclosed herein. For purpose of clarity, details relating to technical material that is known in the technical fields related to the invention have been briefly described or omitted so as not to unnecessarily obscure the present invention.
The present invention would now be discussed in context of embodiments as illustrated in the accompanying drawings.
At step 304, the grayscale image is binarized. Image binarization comprises converting the grayscale image into a black-and-white image. Binarization is a process of simplifying the grayscale image in order to process it for information extraction, such as, extraction of text and geometric component information. In various embodiments of the present invention, thresholding techniques are used to binarize the grayscale image. A thresholding technique comprises choosing a threshold value and classifying all pixels in the grayscale image with value above the threshold value as white and all pixels with values below the threshold value as black. The thresholding technique can be applied by choosing two different threshold values: One threshold value results in an image with dark text in lighter background and the other threshold value results in an image with light text in dark background. Variations of the threshold technique include choosing an optimal threshold value for each area of the grayscale image and then classifying the pixels accordingly. The resultant image is a binary image in the form of a cluster of black and white pixels. At step 306, text is extracted from the binarized image. In an embodiment of the present invention, the resultant binary image is processed for text extraction by using a rectangular region growing segmentation technique. The rectangular region growing segmentation technique is a block segmentation technique in which a rectangular boundary around the cluster of connected pixels of text is marked for detecting characters. The rectangular region growing segmentation technique is a technique in which a region is allowed to grow in forward, backward, upward and downward direction for marking the rectangular boundary. An algorithm checks from left to right and top to bottom for left, top, right and bottom boundaries of the cluster of connected pixels. While going from left to right, the first black pixel is the left boundary and the last black pixel is the right boundary of the cluster of pixels. Similarly, for top to bottom, the first black pixel is the top boundary and the last black pixel is the bottom boundary of the connected pixels. In an embodiment of the present invention, the procedure implemented by the algorithm includes marking a rectangular boundary that is one pixel more than the region bounded by the cluster of connected pixels. The implementation of the algorithm yields a block segmented region with respect to certain number of pixels and characters along with their position and size information. The algorithm is further implemented iteratively on the block segmented region in order to extract the smallest possible segment block enclosing an individual character. The algorithm is iteratively implemented by imposing geometrical constraints in order to sort out individual character blocks. Examples of geometrical constraints include imposing a threshold limit for a width to height ratio corresponding to an individual character segment block. In certain example, 4 to 5 iterations are sufficient to extract individual character segment blocks when the characters are well separated from adjacent characters. In other embodiments, characters in text may not be well separated such as in digital images stored as compressed bmp/jpg/gif files. The compression may cause merging of closest adjacent characters. In these cases, if a block width to height ratio is greater than an average character segmented block ratio, a heuristic algorithm is used to separate the characters that are closely connected at the point of minimum pixel joining point.
The block segmented region is then processed through a character recognition phase for recognizing the character images and translating them into a standard encoding scheme such as ASCII or Unicode. The characters in the block segmented region are recognized by a neural network based Optical Character Recognition (OCR) algorithm. A neural network is an adaptive software system of interconnected mathematical processing elements that provides an optimal solution to a problem based on a learning phase and a solution phase. In an embodiment of the present invention, the neural network is a back-propagation neural network. A back-propagation neural network is described in conjunction with the description of
At step 308, the text region is masked. As described earlier, while using block segmentation technique for text extraction, rectangular boundaries around groups of connected pixels of text are marked for text identification. Since text boundaries are already known, all pixels within boundary areas are converted into background color of the image in order to mask text regions. In an embodiment of the present invention, wherein background color of an image is white in color as a result of image binarization, all pixels within text boundary areas are converted into white color. The resultant image obtained includes geometric components and connecting flow lines which are illustrated by black-colored pixels. Thereafter, at step 310, the flow lines connecting the geometric components are extracted and masked. In an embodiment of the present invention, flow line masking is done by processing pixels corresponding to the flow lines with the simple heuristic that the flow line pixels have a certain line width, and are oriented in either horizontal or vertical direction. During the masking of flow lines, the lines are labeled and their extreme points information is stored in a database. A resultant image after binarization and flow line masking shows distinct geometrical components with connected arrow head components. At step 312, the geometric components are extracted and classified. In an embodiment of the present invention, the geometric components are extracted by identifying clusters of connected pixels representing geometric shapes. The geometric components extracted from a digital image are then recognized and classified into categories. The classification of geometric components includes arranging the components into particular categories of shapes such as oval, square, hexagon, diamond and the like. In an embodiment of the present invention, for the purpose of recognizing geometric component shapes, a back-propagation neural network technique is used. In another embodiment of the present invention, for the purpose of recognizing geometric component shapes, a DTW algorithm is used, wherein the extracted component is compared with standard geometric shapes stored in database in order to determine a best match for recognition. As will be described further with reference to
At step 314, flow line relationships between geometric components are extracted. Extraction of flow line relationships is performed by tracing the flow lines based on the simple heuristic of detecting all flow lines and arrow heads connected to the geometric components. Additional information for identification includes pixels representing arrow heads connected to the components. In various embodiments of the present invention, a simple region growing segmentation technique is used to mark and label segment blocks with bounded box information that represent geometrical shapes. The arrow head components are separated while segmenting the geometric components. In various embodiments of the present invention, the separation criteria for separating the arrow head components is separating two components in region of minimal number of pixel link between two regions. The filtration of arrow head components is done by comparing the geometric components and the arrow head components based on a threshold. In an embodiment of the present invention, the tracing is done by starting with the top first geometric component bounded box, expanding the box boundary by one pixel area and tracking co-ordinates of any lines or arrow heads intersecting with the top first geometric component. The co-ordinates of the connected line are then used to trace the line to find an arrow head component connected to the other geometric component. The tracing of flow line is performed for all the geometric components to trace all flow line relationships between the components.
Finally, at step 316, the extracted text data, data describing geometric component shapes and flow line relationships are stored in a database. In various embodiments of the present invention, the extracted text data, the geometric components and the flow lines are stored in a database. In various embodiments of the present invention, the extracted text data, the geometric components and the flow lines are stored in an Extensible Markup Language (XML) format. XML is a markup language that provides a software and hardware independent manner of storing data so that the data can be shared across disparate software systems. In an embodiment of the present invention, the text data is stored along with its location information. The location information indicates the location of the bounded geometric component within which the characters are enclosed. The geometric components are stored with the location, width and height information. In other embodiments of the present invention, the extracted text data, the geometric components and the flow lines are stored in a Graph Exchange Language (GXL) format. The GXL format is an XML meta-language which is a standard for describing graphs across standard graph-based tools.
In various embodiments of the present invention, the tracing is performed by a software algorithm. The length of the vectors of the geometric shape descriptor is selected to be of a standard by selecting an average component segment size. A new component segment image is re-scaled to a standard segment image size before processing is done for creating a shape descriptor. The lengths of vectors in a geometric shape descriptor are made equal by re-sampling the vectors, when required.
In various embodiments of the present invention, a Dynamic Time Warping (DTW) approach is used to detect an optimal alignment between two flowchart components. DTW is an algorithm that detects similarity between two sequences that are separated either in speed or time. A classic DTW algorithm is explained as follows:
Considering two time series
Q=(q1, q2, q3, . . . qi, . . . , qn) (A)
and
C=(c1, c2, c3, . . . cj, . . . , cm) (B)
of length n and m respectively. In order to align the two sequences using DTW we construct an n x m matrix where the (ith, jth) element of the matrix contains the distance “d(′ q1i, c1j) between the two points qi and cj. In an example, the distance between the two points qi and cj is the Euclidean distance function:
“ d(” q1c1j)=[(q1i−c1j)]†2 (C)
Each matrix element corresponds to the alignment between the points qi and cj. A warping path is defined as a contiguous set of matrix elements that defines a mapping between Q and C.
W=(w1, w2, . . . , wk, . . . wK), max(m, n)≦K<m+n−1 (D)
The warping path is subject to several constraints such as boundary conditions, continuity and monotonicity. In various embodiments of the present invention, the constraints can be:
The length K, of the warping path is bounded such that max(m, n)<m+n−1. We have used the global constraints on the warping path.
In an embodiment of the present invention, the DTW algorithm is implemented to find the best match for a flowchart component in the database having standard flowchart component shapes. The implementation is done as follows:
The standard flowchart component shapes in the database are scaled to 160×80 pixels, signatures are derived from all points on the shape boundaries and the shape vector is generated which is sampled to 350 points. Any variation in the number of points for a new shape vector is re-sampled to a vector size of 350. K, the length of the warping path is bounded such that max(m, n) ≦K<m+n−1. Since all the shape vectors are re-sampled to a standard vector size of 350, we have m=n, and m≦K<2m−1.
W is defined as the amount of warping implied by an algorithm:
If the algorithm discovers no warping between the sequences, W=0. The more the warping discovered, the larger will be the value of W. (The maximum value of W=1).
As an example for illustrating the implementation of the DTW algorithm, a set of geometric shape vectors were compared with the standard flowchart component shapes stored in the database. The sequence of each geometric shape vector was compared to each sequence of the standard flowchart component shapes and the average value of W is calculated. The results signifying the amount of warping between standard component shapes are:
In an embodiment of the present invention, if a new geometric shape has a vector length smaller than the vector length of a stored geometric shape, the vector length of the stored geometric shape can be down-sampled to the length of the new geometric shape.
In various embodiments of the present invention, a back-propagation neural network 700 is used for recognizing flowchart components that have been extracted from a flowchart image. A back-propagation neural network is a multi-layer neural network implementing a back-propagation algorithm, where each layer comprises of neurons having specific functions. The basic layers of a multi-layer neural network are an input layer, a hidden layer and an output layer. The back-propagation neural network 700 comprises a first set of neurons 702 in the input layer that are configured to receive inputs. The first set of neurons 702 are connected to a second set of neurons 704 in the hidden layer. Thus, the input signals fed into the first set of neurons 702 are propagated through the second set of neurons 704 to a third set of neurons 706 at the output. Any connection between two neurons in the back-propagation neural network 700 has a unique weight value. In the learning phase, sample inputs signals are applied to the first set of neurons 702, for which the correct output values are known. The input signals are mathematically processed by the first set of neurons 702, transmitted through the hidden layer and the output is obtained after processing at the third set of neurons 706. The output obtained is dependent upon the individual weight values of the neuron connections. The difference between the output obtained and the correct output is an error value that is fed back to the network. Based on the error value, the individual weights of the neuron connections are slightly altered and the output value from the third set of neurons 706 is calculated again followed by the calculation of a new error value. A number of iterations of such calculations are repeated till the neural network 700 “learns” the weight values to be applied to the neuron connections across the layers such that the error value in less than a threshold limit.
As mentioned earlier, the back-propagation neural network 700 is used for recognizing flowchart components extracted from a flowchart image by training the network first in the learning phase. In an embodiment, shape vectors for standard flowchart shape components are generated for training the neural network 700. For example, a standard bounded size for the standard component shapes is determined to be 160×80 pixels and a standard number of sampled points for describing the shape vector is considered to be 350. Any variation in the number of points for a new shape vector is re-sampled to a vector size of 350. A back-propagation algorithm for training the neural network 700 inputs a test data set containing the shape vectors and the correct known output vectors to the neural network 700. Additionally, the shape vector data is perturbed with a gaussian noise of ±3 standard deviation of pixels and with zero mean in order to train the neural network. This ensures that the network is able to adapt itself for numerous variations in shapes. In various embodiments of the present invention, the back-propagation training algorithm implements various modes for training the neural network 700 such as varying the number of network layers, the number of neurons in the hidden layer, the activation function, the learning rate and the threshold error limit. The training algorithm was implemented to minimize a Root Mean Square (RMS) error value between a correct known output vector and the output vector processed by the neural network 700. Experimental values for RMS error values determined by implementing the training algorithm in various modes are illustrated in the description of
In an embodiment of the present invention, theoretical RMS error values were calculated for the three topologies of the neural network 700 by increasing the number of iterations performed for each neural network configuration. As illustrated in
In another embodiment of the present invention, the performance of the three neural network configurations were experimentally tested by training the three configurations using a database having 100 different geometrical shape vectors. The following table illustrates the RMS errors for the three configurations based on the experimental tests.
The present invention may be implemented in numerous ways including as a system, a method, or a computer readable medium such as a computer readable storage medium or a computer network wherein programming instructions are communicated from a remote location.
While the exemplary embodiments of the present invention are described and illustrated herein, it will be appreciated that they are merely illustrative. It will be understood by those skilled in the art that various modifications in form and detail may be made therein without departing from or offending the spirit and scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
452/CHE/2011 | Feb 2011 | IN | national |