The presently disclosed subject matter relates to computerized systems and methods of database management.
Organizations of all sorts use ever growing databases for storing and analyzing data. These organizations face different challenges in managing and curating the huge amounts of data across multiple data sources. These challenges also include those which are related to data discovery and data governance such as those prescribed by the General Data Protection Regulation (GDPR). As part of the management and usage of databases, string pattern matching with the help of classifiers are often used. Classifiers define specific rules for pattern searching, where the pattern comprises a desired sequence of characters and/or character types and provides a tool which facilitates searches and screening in databases. Classifiers can be implemented in various ways, each giving rise to a specific type of classifier. One example of classifiers are regular expressions (referred to herein in short as “RegExes”), which are an ordered sequence of characters, each character in the sequence representing a specific character or character type of a desired search pattern. A simple example of a regular expression in JAVA programming language is [abc] which defines a match to any string comprising the letter a or b or c, whereas the regular expression [a-d1-7] matches any letter between a and d and figures from 1 to 7, but not d1.
Another example of classifiers are code-based classifiers. One example of code-based classifiers implementation includes software snippets defining the rules for pattern searching. Another example of code-based classifiers implementation includes machine-learning based classifiers, where a machine learning model is trained for identifying desired patterns. A further example of a classifier is a heuristic classifier.
General Description
While classifiers provide a valuable tool when using databases, classifiers are deficient in different ways. For example, on the one hand a classifier which is too strict would miss a lot of valid results and thus provide many false negatives, while on the other hand a classifier which is too permissive would provide many false positives. Therefore, classifiers demand time consuming work for fine-tuning the classifiers to obtain the desired accuracy levels of the screening output.
The presently disclosed subject matter includes a computerized method and system directed for coping with the challenges related to classifiers in a scalable manner. The disclosed method and system provide the ability to train and execute a unique machine learning (ML) model specifically configured to enhance the classifier output by identifying and removing false positive results from the classifiers output.
As described below in more detail, classifier output comprising a collection of data-subsets (e.g., columns in a relational database) of one or more structured or semi-structured data sources (e.g., tables of a relational database), including both data-subsets which have been identified to have passed the pattern matching screening of the classifier (referred to herein below as “matching data-subsets”) and data-subsets which have been identified to have failed the pattern matching screening of the classifier (referred to herein below as “non-matching data-subsets”) are transformed to be represented by a plurality of numerical vectors. The numerical vectors are used during a training phase (as well as the execution phase), for training a machine learning model to enhance the classifier output and reduce false positives. Each numerical vector represents certain numerical characteristics of the data-values from which it is constructed.
According to some examples, the data-values in each data-subset are divided into subgroups and each subgroup is represented by a respective char_position vector. Division of values in a data-subset into subgroups may include dividing the values in a data-subset to a first subgroup of values comprising all matching data-values (i.e., values which have been identified to have passed the screening of the classifier), and a second subgroup of non-matching values (i.e., values which have been identified to have failed the screening of the classifier). A first numerical vector is thus generated for all matching values in the data-subset and a second numerical vector is generated for all non-matching values in the data-subset.
One example of a numerical vector is a features vector. A features vector (referred to herein also as “char_position vector”) is a numerical vector which stores features which represent the aggregated character position frequencies in data-values in a respective data-subset. The features vectors provide an accurate and easily maintainable representation of the pattern of the data-values in each data-subset.
The numerical vectors (e.g., features vectors) representing matching and non-matching values are used as input to the machine learning model, which are trained using supervised machine learning methods, to correlate between the patterns of matching and non-matching values and the respective classifier. During execution, this correlation is implemented by a trained machine learning model for removing false positives from the classifier output, thereby significantly improving the precision of the final output.
According to further examples, the precision of the final output is further improved by adding to the machine learning input one or more additional numerical vectors generated for each data-subset (e.g., columns) e.g., a features vector generated based on its name (referred to herein also as “name-features-vector”).
Enhancement of the accuracy of the classifier output enables to use more permissive classifiers, which require significantly less implementation effort. Using more permissive classifiers also facilitates the usage of computer programs for automatically generating classifiers which do not require human intervention (or at least considerably reduces the need for human intervention). It may also remove the need for a support term (a secondary classifier that helps to further enhance the classifier output, e.g., matching 16-digit numbers located in a column whose name is a string similar to “Credit Card”).
According to one aspect of the presently disclosed subject matter there is provided a computerized method of pattern (string) matching in data-subsets of one or more structured (e.g., relational databases) or semi-structured data sources, each data-subset comprising data-values; the method comprising using a processing circuitry for:
obtaining data indicative of matching data-subsets determined based on classifier data output; wherein the classifier data output is a product of applying on one or more data-subsets (from the one or more structured or semi-structured data sources) a classifier (e.g., regular expression) dedicated for identifying data-values that match a certain pattern (sequence of characters);
for each matching data-subset:
generating at least two respective numerical vectors (e.g., features vectors), comprising:
dividing the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generating a first numerical vector (e.g., first features vector) from the data-values in the first subgroup and a second numerical vector (e.g., second features vector) from the data-values in the second subgroup;
executing a ML model, wherein the at least two numerical vectors are used as input to the ML model, wherein the ML model is configured for processing the at least two numerical vectors and identifying one or more data-subsets from the matching data-subsets, which represent false positive matching data-subsets, thereby enhancing the classifier output.
In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (i) to (xxi) below, in any technically possibly combination or permutation:
I. The computerized method further comprising applying the classifier on the one or more data-subsets to thereby obtain the classifier data output;
II. wherein the classifier is a regular expression (RegEx);
III. wherein the classifier is a machine learning based classifier;
IV. wherein the first numerical vector is a first features vector and the second numerical vector is a second features vector; wherein the first features vector comprises a first collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the first subgroup of data-values, and the second features vector comprises a second collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the second subgroup.
V. The computerized method further comprises classifying the data-subset as either matching or non-matching based on the relative portion of matching and non-matching data-values;
VI. wherein n-gram-position pairs in the first collection and the second collection includes unigrams-position pairs and/or bigram-position pairs.
VII. wherein the generation of the first features vector and the second features vector comprises:
VIII. The computerized method further comprising calculating a hash value for each feature (n-gram-position pair and their respective frequencies) in the first features vector and second features vector.
IX. The computerized method further comprising applying a hashing trick on the first features vector and the second features vector.
X. The computerized method further comprising applying a term frequency-inverse document frequency (TF-IDF) on the first features vector and the second features vector, thereby obtaining a respective TF-IDF score for each feature in the first features vector and the second features vector.
XI. The computerized method further comprising:
XII. The computerized method, wherein the one or more additional numerical vectors are generated as word vectors representing the respective data.
XIII. The computerized method further comprising:
XIV. wherein the data-subsets are columns in a relational database.
XV. The computerized method further comprises implementing ML training, comprising:
XVI. wherein the first numerical vector is a first features vector and the second numerical vector is a second features vector; wherein the first features vector comprises a first collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the first subgroup of data-values, and the second features vector comprises a second collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the second subgroup;
XVII. wherein the implementing of the training phase further comprises: calculating for each feature a respective hash value.
XVIII. wherein n-gram-position pairs in the first collection and the second collection generated during the training phase includes unigrams-position pairs and/or bigram-position pairs.
XIX. wherein the implementing of the training phase further comprises applying a hashing trick on the first features vector and the second features vector.
XX. wherein the implementing of the training phase further comprises generating for each data-subset in the sample dataset at least one additional numerical vector.
XXI. wherein the implementing of the training phase further comprises generating for each data-subset in the sample dataset at least two additional features vectors comprising a first name-features-vectors and a second name-features-vector, wherein the first name-features-vector comprises a collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in a name of the data-subset and the second name-features-vector comprises a collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in a processed version of the name of the data-subset.
According to another aspect of the presently disclosed subject matter there is provided a computerized system of pattern matching in data-subsets of one or more structured or semi-structured data sources, each data-subset comprising data-values; the system comprises a processing circuitry configured to:
obtain data indicative of matching data-subsets determined based on classifier data output; wherein the classifier data output is a product of applying on one or more data-subsets (from the one or more structured or semi-structured data sources) a classifier (e.g., regular expression) dedicated for identifying data-values that match a certain pattern;
for each matching data-subset:
generate at least two respective numerical vectors (e.g., features vectors), comprising:
divide the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generate a first numerical vector (e.g., features vector) from the data-values in the first subgroup and a second numerical vector (e.g., features vector) from the data-values in the second subgroup;
execute a ML model, wherein the at least two numerical vectors are used as input to the ML model, wherein the ML model is configured for processing the at least two numerical vectors and identifying one or more data-subsets from the matching data-subsets, which represent false positive matching data-subsets, thereby enhancing classifier output.
According to another aspect of the presently disclosed subject matter there is provided a non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform a method of pattern (string) matching in data-subsets of one or more structured (e.g., relational databases) or semi-structured data sources, each data-subset comprising data-values, the method comprising:
obtaining data indicative of matching data-subsets determined based on classifier data output; wherein the classifier data output is a product of applying on one or more data-subsets (from the one or more structured or semi-structured data sources) a classifier (e.g., regular expression) dedicated for identifying data-values that match a certain pattern (sequence of characters);
for each matching data-subset:
generating at least two respective numerical vectors (e.g., features vectors), comprising:
dividing the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generating a first numerical vector (e.g., features vector) from the data-values in the first subgroup and a second numerical vector (e.g., features vector) from the data-values in the second subgroup;
executing a ML model, wherein the at least two numerical vectors are used as input to the ML model, wherein the ML model is configured for processing the at least two numerical vectors and identifying one or more data-subsets from the matching data-subsets, which represent false positive matching data-subsets, thereby enhancing classifier output.
The system and non-transitory program storage device, disclosed in accordance with the presently disclosed subject matter, can optionally comprise one or more of features (i) to (xxi) listed above, mutatis mutandis, in any technically possibly combination or permutation.
According to a further aspect of the presently disclosed subject matter there is provided a computerized method of training a machine learning (ML) model dedicated for detecting false detections determined based on a classifier applied on data-subsets of one or more structured or semi-structured data sources, each data-subset comprising data-values; the method comprising using a processing circuitry for:
generating a sample dataset comprising multiple matching data-subsets and multiple non-matching data-subsets, wherein classification of a data-subset as matching or non-matching is according to a relative portion of matching and non-matching data-values in each respective data-subset; wherein data-values in each data-subset are classified as matching or non-matching by applying the classifier on the data-values;
for each data-subset in the sample dataset:
dividing the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generating a first numerical vector (e.g., first features vector) from the data-values in the first subgroup and a second numerical vector (e.g., second features vector) from the data-values in the second subgroup;
receiving user input indicative of whether the classification of each data-subset in the sample dataset, to matching or non-matching, is true or false, thereby obtaining an annotated sample dataset;
using the first and second numerical vectors of each one of the data-subsets in the sample dataset as a training set for training the machine learning model for identifying false detections made by the classifier.
In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (i) to (ix) below, in any technically possibly combination or permutation:
I. wherein the classifier is a regular expression (RegEx);
II. wherein the first numerical vector is a first features vector and the second numerical vector is a second features vector; wherein the first features vector comprises a first collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the first subgroup of data-values, and the second features vector comprises a second collection of features, each feature being indicative of an aggregated frequency of a respective n-gram-position pair in the second subgroup.
III. wherein generating the first and second numerical vectors further comprises calculating for each feature a respective hash value.
IV. wherein n-gram-position pairs in the first collection and the second collection includes unigrams-position pairs and/or bigram-position pairs.
V. The computerized method further comprising:
VI. The computerized method wherein the one or more additional numerical vectors include at least one word vector representing the respective data.
VII. The computerized method further comprising:
VIII. The computerized method further comprising calculating a hash value for each n-gram-position pair and their respective frequencies in the first numerical vector and second numerical vector.
IX. The computerized further comprising applying a term frequency-inverse document frequency (TF-IDF) on the first numerical vector and the second numerical vector, thereby obtaining a respective TF-IDF score for each feature in the first numerical vector and the second numerical vector.
X. The computerized method, wherein the ML model is of any one of the following types: Stochastic Gradient Descent model; Decision Tree; Logistic Regression; and Random Forest Classifier.
According to another aspect of the presently disclosed subject matter there is provided a computerized system configured for training a machine learning (ML) model dedicated for detecting false-positive detections determined based on a classifier applied on data-subsets of one or more structured or semi-structured data sources, each data-subset comprising data-values; the system comprising a processing circuitry configured to:
generate a sample dataset comprising multiple matching data-subsets and multiple non-matching data-subsets, wherein classification of a data-subset as matching or non-matching is according to a relative portion of matching and non-matching data-values in each respective data-subset; wherein data-values in each data-subset are classified as matching or non-matching by applying the classifier on the data-values;
for each data-subset in the sample dataset:
divide the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generate a first numerical vector (e.g., a first features vector) from the data-values in the first subgroup and a second numerical vector (e.g., a second features vector) from the data-values in the second subgroup;
receive user input indicative whether the classification of each data-subset in the sample dataset, to matching or non-matching, is true or false, thereby obtaining an annotated sample dataset;
use the first and second numerical vectors of each one of the data-subsets in the sample dataset as a training set for training the machine learning model for identifying false detections made by the classifier.
According to another aspect of the presently disclosed subject matter there is provided a non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform a method of training a machine learning (ML) model dedicated for detecting false detections determined based on a classifier applied on data-subsets of one or more structured or semi-structured data sources, each data-subset comprising data-values; the method comprising using a processing circuitry for:
generating a sample dataset comprising multiple matching data-subsets and multiple non-matching data-subsets, wherein classification of a data-subset as matching or non-matching is according to a relative portion of matching and non-matching data-values in each respective data-subset; wherein data-values in each data-subset are classified as matching or non-matching by applying the classifier on the data-values;
for each data-subset in the sample dataset:
dividing the data-subset into a first subgroup comprising the matching data-values and a second subgroup comprising the non-matching data-values;
generating a first numerical vector from the data-values in the first subgroup and a second numerical vector from the data-values in the second subgroup; receiving user input indicative of whether the classification of each data-subset in the sample dataset to matching or non-matching is true or false, thereby obtaining an annotated sample dataset;
using the first and second numerical vectors of each one of the data-subsets in the sample dataset as a training set for training the machine learning model for identifying false detections made by the classifier.
The system and non-transitory program storage device, disclosed in accordance with the presently disclosed subject matter, can optionally comprise one or more of features (i) to (ix) listed above, mutatis mutandis, in any technically possibly combination or permutation.
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non-limiting examples only, with reference to the accompanying drawings, in which:
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations. Elements in the drawings are not necessarily drawn to scale.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that, throughout the specification, discussions utilizing terms such as “generating”, “obtaining”, “dividing”, “executing”, “classifying”, “assigning” or the like, include an action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g. such as electronic quantities, and/or said data representing the physical objects.
The terms “computer”, “computer device”, “computerized device” or the like, should be expansively construed to include any kind of hardware-based electronic device with a data processing circuitry (e.g., digital signal processor (DSP), a GPU, a TPU, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), microcontroller, microprocessor etc.). The processing circuitry can comprise, for example, one or more processors operatively connected to computer memory, loaded with executable instructions for executing operations as further described below. For example, control unit 105 described below with reference to
The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes, or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a computer readable storage medium.
As used herein, the phrase “for example,” “such as”, “for instance” and variants thereof, describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to “one example”, “some examples”, “other examples”, or variants thereof, means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase “one example”, “some examples”, “other examples” or variants thereof does not necessarily refer to the same embodiment(s).
It is appreciated that certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately, or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in
A structured data source is one whose elements are addressable for effective analysis. This includes, for example, relational databases organized in tables consisting of rows and columns (e.g., SQL databases). A semi-structured data source is one that is not stored in a relational database but that has some organizational properties that make it easier to analyze. This includes for example non-SQL databases, including but not limited to MongoDB. As used herein, and unless specifically indicated otherwise, the term “data-source” is used to include both a structured and a semi-structured data source, and the term “data-subset” is used to include a sub-component of a structured or semi-structured data source. Examples of data-subsets in structured data include columns and rows, and an example of a data-subset in a semi-structured data is a field retrieved from a plurality of documents in a MongoDB database, where the retrieved values collectively represent a data-subset. It is noted that while the following description predominantly refers to relational databases and their components, such as tables and columns, this is done by way of example only and should not be construed to limit the scope to relational databases alone. The presently disclosed subject matter can be likewise applied on other types of data resources, including semi-structured data resources (e.g., MongoDB).
As explained above, a classifier can be implemented in various ways, each giving rise to a specific type of classifier. In this regard it is noted that while the following description predominantly uses the term regular expression (RegEx), this is done by way of example only and should not be construed to limit the scope in any way. The presently disclosed subject matter contemplates and can be likewise applied with any other type of classifier. Bearing the above in mind, attention is drawn to
DBS 100 can further comprise a database management layer 110 comprising one or more control units (CU 1051-n) which are computerized devices, operatively connected to the physical storage space and to one or more hosts (1011-n), and configured to control and execute various operations in the DBS. These operations include for example data retrieval requests, received from users (interacting with the DBS via hosts (1011-n), or other applications interacting with the DBS. Other examples of operations, which may be performed by the control units are described in more detail below. A host includes any computer device which communicates with the database management layer 110 e.g., a PC computer, working station, a Smartphone, cloud host (where at least part of the processing is executed by remote computing services accessible via the cloud), or the like.
The process disclosed with reference to
During the ML pre-processing phase, the training dataset is generated. During the training phase, the ML model is trained using the training dataset. During the ML execution phase, the trained ML model is applied on classifier output to identify and remove false positive results and thereby enhance the output.
More specifically, at block 301a classifier (e.g., RegEx) is applied on a collection of data-subsets (e.g., with the help of classifier engine 202). During the application, a plurality of data-subsets (e.g., columns) from one or more data sources (e.g., relational tables) are scanned with the classifier. During the scanning, data-values stored in each column are processed using the RegEx to determine whether the data-values match the RegEx. Metadata can be added with respect to each scanned data-value, indicating whether the string pattern of the data-value matches the RegEx requirements. The scanning output can include, for example, for each column, data indicating which of the data-values in the column are matching, and which of the data-values in the column are non-matching. Furthermore, in some examples, each one of the scanned columns is classified (and marked by appropriate metadata) as a “matching column” in case the portion of matching data-values in the column is greater than a certain predefined threshold (also referred to as “matching criterion”), and a “non-matching column” in case it does not. For example, a matching column can be defined as a column in which at least 10% (or 15% or 20% or 50%, etc.) of the data-values match the RegEx. Thus, the RegEx output data includes a collection of data-subsets (e.g., columns) where each data-subset is classified as matching or non-matching the respective RegEx, and the data-values in each data-subset are classified individually as matching or non-matching the respective RegEx. The RegEx output is stored in a transitory and/or non-transitory computer data-storage.
As the purpose is to create a training dataset for the machine learning model, in some examples only a sample of the data, which is sufficient for the training dataset, is processed. Thus, while tables in a relational database may include many columns (e.g., thousands and more) and columns may include many rows (e.g., ranging up to thousands, hundreds of thousands and more), a sample dataset which includes only a limited number of data-subsets, each comprising a limited number of data-values, can be used. According to some examples, a sample dataset is generated by randomly selecting several columns ‘n’ from a larger collection of columns (e.g., 10≥n≥5). In further examples, a certain number of data-values ‘m’ are randomly selected from each column in the sample to create a smaller sample dataset (e.g., 500≥m≥100). The columns are selected so they comprise both matching and non-matching columns. In one non-limiting example, an equal number of matching and non-matching columns are used. Considering that a minimum of n matching columns and n non-matching columns are required for the training dataset, in case the RegEx is applied before the selection of the sample dataset, at least n matching columns and at least n non-matching columns are randomly selected from the RegEx output to thereby obtain the required number of matching and non-matching columns for the sample dataset. Otherwise, if the sample dataset is generated before the application of the RegEx, it is ensured that the RegEx output comprises an adequate number of matching and non-matching columns. For example, assuming n≥5, the number of columns initially selected may include 50 (or 40, or 30, etc.) randomly selected columns, which are processed by the RegEx and 5 or more matching, and 5 or more non-matching columns are then randomly selected from the RegEx output to thereby obtain the required number of matching and non-matching columns. According to some examples, processing circuitry 200 includes a sample generator 204 configured to process the RegEx output and generate the sample dataset.
At block 303 the RegEx output of the sample dataset is presented to a user to enable the user to review and annotate the RegEx output. To this end, processing circuitry 200 can further comprise a user interface which comprises a user interface application 206 being operatively connected to user interface devices (including for example, display screen and input device such as keyboard, touchscreen, computer mouse, etc.). The user interface is configured to generate and display on a display device a graphical representation of the sample dataset. The displayed data includes the sampled data-values of each sampled column and, optionally, data indicating, with respect to each sampled data-value, whether it is a matching or a non-matching data-value according to the RegEx output.
Reverting to
At block 307 two or more numerical vectors are generated for each one of the columns in the sample dataset (e.g., by numerical vectors generation module 208). As mentioned above, a numerical vector represents certain numerical characteristics of the data-values from which it is constructed. An example of numerical vectors are features vectors (otherwise referred to as “char_position vectors”), which are vectors that store data indicative of the aggregated frequencies of different character-position in data-values in a respective data-subset (e.g., column). In the following descriptions of various aspects of the presently disclosed subject matter, features vectors are used as an example of numerical vectors, however this is done by way of example only for ease of understanding and should not be construed as limiting.
Recall that each data-value in the sample dataset has been classified by the RegEx engine as either matching or non-matching. Each column in the sample is divided into a first subgroup comprising matching data-values, and a second subgroup comprising non-matching data-values, where a first features vector is generated using all matching data-values in the column, and a second features vector is generated using all non-matching data-values in the column, giving rise to two features vectors generated for each column in the sample dataset.
At block 53, each n-gram (e.g., of the type unigram and/or the type bigram) is assigned with a respective value indicating its position within the respective string, giving rise to a char-position pair or n-gram-position pair (e.g., unigram-position pair and bigram-position pair). At block 55 the occurrences of each n-gram-position pair in the first subgroup are aggregated, giving rise to a first vector (also referred to herein as a first “char position vector” of “first features vector”) that holds the number of occurrences of each n-gram-position pair. Likewise, the occurrences of each n-gram-position pair in the second subgroup are aggregated, giving rise to a second vector (also referred to herein as a second “char position vector” or “second features vector”) that holds the number of occurrences of each character-position pair. In case more than one type of n-gram is used (e.g., unigram and bigrams), the different type of n-grams-position pairs (e.g., unigrams-position pairs and bigrams-position pairs) are aggregated together in the same vector.
A simple example which demonstrates the principles of creating the features vector, is provided herein. Assuming a column ‘country’ consists of the values ‘USA’, ‘UK’ and ‘Uworld’ (the latter being negative to a ‘country’ RegEx), the aggregated vectors of unigrams and bigrams would be the following features (before applying the hashing trick):
First features vector (matching): [(‘U_0’, 2), (‘S_1’, 1), (‘A_2’, 1), (‘US_0, 1), (‘SA_1, 1), (‘K_1, 1), (‘UK_0’, 1)]; and
Second features vector (non-matching): [(‘U_0’, 1), (‘w_1’, 1), (‘o_2’, 1), (‘r_3’, 1), (‘I_4’, 1), (‘d_5’, 1), (‘Uw_0’, 1), (‘wo_1’, 1), (‘or_2’, 1), (‘r_3I’, 1), (‘Id_4’, 1)].
Each one of the vectors above are generated by aggregating features denoting one or more (in the current examples, two) types of n-gram-position pairs and their respective frequency. The annotation above indicates the frequency of occurrence (indicated by the digit at the right side of each feature) of each unigram-position pair and bigram-position pair (indicated by the n-gram-position pair at the left side of each feature) in the first vector generated for the matching subgroup of data-values and the frequency of occurrence of each unigram-position pair and bigram-position pair in the second vector generated for the non-matching subgroup of data-values.
For example, in the first vector (matching), the feature (‘U_0’, 2) indicates that the character (unigram) “U” appears in two data-values at the first position and the feature (‘SA_1’, 1) indicates that the bigram “SA” appears in one data-value at the second position. In the second vector (non-matching), the feature (‘w_1’, 1) indicates that the unigram “w” appears once in the second position and that the feature (‘Uw_0’, 1) indicates that the bigram “Uw” appears once in the first position.
At this point two features vectors are provided for each column, including the matching vector of the matching subgroup and the non-matching vector of the non-matching subgroup of each column. In some examples, one or more additional numerical vectors are generated. The additional numerical vectors can be generated based on various data related to the columns and/or the table. According to one example, a first additional numerical vector is a first name-features-vector which can be generated based on the raw column name, which is processed and transformed into a char-position vector (generated for example from its unigrams and bigrams-position pairs as described above). In some examples, a second numerical vector is added, where the name is first processed to clean the text and split into separated words (if it comprises more than one word), and a respective second name-features-vector and char-position vector is generated using the processing output. Thus, in some examples four numerical vectors (e.g., features vectors) are generated for each column of the sample dataset. According to another example, in addition or instead of the above name-features-vector, other additional features vectors can be generated using other data, including for example: the name of the table holding the data-subset (e.g., column), the names of one or more columns adjacent to the column, and any other metadata associated with the column and/or the table.
As explained above, the additional numerical vectors can be generated using the actual value (e.g., the actual column name as explained above) in which case a respective features vector and char-position vector are generated (e.g., name-features-vector). Alternatively, the additional numerical vectors can be generated using word vectors representing the respective data. For example, where the data is the name of the column, a word vector of the column name can be generated. As is well known in the art in natural language processing (NLP), a word vector is used for representing a word for text analysis, where the word vector encodes the meaning of the word, such that the words that are closer in the vector space are expected to be similar in meaning. Using word vectors can help for example to associate between different columns which have different names which have similar meanings. See for example word embedding in Wikipedia: (https://en.wikipedia.org/wiki/Word_embedding).
At block 57 a hash function is applied on each feature in each vector, thereby transforming the features vector to a vector of hashes. In some examples, to avoid the need to go over all the columns once to get all the char-position combinations that exist in all the data-values across all columns, the hashing trick is used. In short, a vectorizer that uses the hashing trick applies a hash function h to the elements in the vector (i.e., char-position), and then uses the hash values directly as feature indices. In other words, instead of assigning unique indices to each char-position pair, the hashing algorithm applies a hash function on the char-position string and uses the numerical output as the index of the specific char-position pair. For example, instead of assigning the unique index 0 to the ‘0_U’, char-position pair, the hashing trick algorithm applies a hash function to the string ‘0_U’ and uses the output of the hash function (e.g., 47) as the index of that specific char-position pair. To build a vector of a predefined length, the algorithm applies the modulo operation on the resulting hash values. Even if some of the hashed values collide, it usually does not significantly affect downstream computations.
After performing the hashing trick, two vectors are obtained (and possibly more if additional numerical vectors are generated). Note the shared hash that corresponds to the char-position pair “0_U” character that is found in both vectors:
First vector of hashes (matching): [(101, 2), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1)]; and
Second vector of hashes (non-matching): [(101, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1)]
Notably, for simplicity, serial hashes are used in the example, however, in reality, the generated hashes are randomly scattered over a sufficiently large range of integers.
The operations described above with reference to
As mentioned above, some char-position pairs may appear in more than one of the features vectors. According to some examples, a term frequency-inverse document frequency (TF-IDF) transformation is applied on the features vectors. As is well known in the art, TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. In the current case TD-IDF is applied on the char position pairs to obtain information indicative of the char_position pairs which are most relevant for each features vectors. Meaning that if, for example, the char_position pair “a_0” appears in many values, in both the matching and not matching vectors, this indicates that it would have little contribution to the ML model for differentiating between matching and non-matching values. On the other hand, if “a_0” is found only in the matching values, this indicates that it would provide strong indication to the ML model for differentiating between matching and non-matching values.
In some examples, the output of the TF-IDF is a vector similar to the features vectors described above, where for each hash (generated from a char_position), a TF-IDF score is assigned in place of the number of occurrences.
Reverting to the training phase (32, block 309) in
It is noted that the classifier enhancing ML model can be repeatedly trained using different sample datasets, thereby continuously improving the accuracy of the ML learning output. Since in each training iteration different data is being processed and user input is provided with respect to the processed data, the “knowledge base” of the ML model is continuously improved, thus enabling to improve its accuracy.
Proceeding to
At block 601 a “tested dataset” of one or more data-subsets is processed using one or more RegEx. The tested dataset can be, for example, one or more tables of a relational database, each comprising a plurality of columns, or, in another example, a collection of columns specifically selected from one or more tables.
The RegEx are applied on the data-subsets in the tested dataset to thereby obtain a respective output of each RegEx. Given a certain RegEx, the output includes data indicative of each data-value in a data-subset, whether it is a matching or non-matching data-value. Based on the RegEx data output, data-subsets (e.g., columns) can be classified as matching data-subset or non-matching data-subset. As mentioned above, this is determined based on the relative portion of matching and non-matching data-values in the data-subset. Data-subsets which comprise a portion of the matching data-values which is greater than a certain threshold (e.g., >10% or >20% or >40% or >50%, etc.) are classified as matching data-subsets.
At block 603 features vectors are generated for each of the matching data-subsets, based on the matching and non-matching data-values in the matching data-subset. As mentioned above, according to some examples, one or more additional features vectors are generated based on the data-subset name. For example, a first name-feature can be generated based on the raw column name, which is processed and transformed into a char-position vector (generated from its unigrams and bigrams-position pairs). In some examples, a second name-feature is added, where the name is first processed to clean the text and split into separated words (if it comprises more than one word), and a respective char-position vector is generated using the processing output. Thus, in some examples four features vectors are generated for each column of the tested dataset. Notably, if more than one RegEx is applied on a given column, more than one collection of features vectors can be generated for the given column, each collection generated being based on the respective RegEx output. The process of generating the features vectors is explained above with respect to block 307 in
At block 605 a classifier enhancing ML model is executed using the collection of features vectors as input. As mentioned above, in case additional features vectors are generated for each data-subset (e.g., two additional name-features-vectors), the additional features vectors and the TF-IDF-transformed matching and non-matching features vectors are stacked together and fed into the classifier enhancing ML model. In some examples, the plurality of features vectors, generated for the same data-subset, are concatenated to a single vector which is used as ML input. In case a word vector is used for representing the additional data rather than features vectors, the corresponding word vector can be concatenated together with the other features vectors.
The ML learning model determines whether any of the matching data-subsets are false positives, i.e., are erroneously marked as matching by based on the RegEx output. If false-positives are identified, they can be marked and/or removed from the pool of matching data-subsets provided by the RegEx. For each specific RegEx, a respective classifier enhancing ML model is used for processing and enhancing the output of the specific RegEx.
It will also be understood that the system according to the presently disclosed subject matter may be a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the method of the presently disclosed subject matter. The presently disclosed subject matter further contemplates a machine-readable non-transitory memory tangibly embodying a program of instructions executable by the machine for executing the method of the presently disclosed subject matter.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.