Aspects of the present disclosure relate to techniques for automatically assigning labels to dataset fields by analyzing the names of the dataset fields using natural language processing (NLP) and, optionally, by analyzing sample data from the dataset fields. The techniques determine labels for the dataset fields that provide a natural language description of data stored in the dataset fields.
Modern data processing systems manage vast amounts of data (e.g., millions, billions, or trillions of data records). An institution (e.g., a multinational bank, a global technology company, an e-commerce company, etc.) may have vast amounts (e.g., hundreds or thousands of terabytes) of data that is used for its operations. For example, the data may include transaction records, documents, tables, files, and/or other types of data. A data processing system may store data in thousands or millions of different datasets. Each of the datasets may include multiple fields in which data is stored. For example, a dataset may be a table having multiple columns (or rows), where each column (or row) represents a dataset field storing one or multiple field values.
Some embodiments provide a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises using at least one computer hardware processor to perform: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, associating the particular field with the field label; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a system for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, associating the particular field with the field label; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, associating the particular field with the field label; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's field. The method comprises using at least one computer hardware processor to perform: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a system for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
Some embodiments provide a method for processing a dataset comprising data stored in field to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data from the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field glossary, a field label for each field in a set of one or more of the dataset fields of the dataset. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data stored in the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data stored in the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
Some embodiments provide a method for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of the field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of a field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of a field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a method for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying comprising: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying comprising: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The method comprises: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining comprising: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying comprising: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
Some embodiments provide a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises using at least one computer hardware processor to perform: for each particular field in the set of the dataset's fields: determining, using a name of the particular field and a subset of data from the particular field, whether any field label from a field label glossary matches the particular field, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and a corresponding set of scores thereby obtaining a plurality of sets of candidate words and a corresponding plurality of sets of scores; determining, using the plurality of sets of words and the corresponding plurality of sets of scores, whether any field label from the field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: generating, using the plurality of sets of candidate words and the corresponding plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label that is not in the field label glossary; and assigning the new field label to the particular field.
Some embodiments provide a system for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions. The instructions, when executed by the at least one processor, cause the at least one processor to perform: for each particular field in the set of the dataset's fields: determining, using a name of the particular field and a subset of data from the particular field, whether any field label from a field label glossary matches the particular field, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and a corresponding set of scores thereby obtaining a plurality of sets of candidate words and a corresponding plurality of sets of scores; determining, using the plurality of sets of words and the corresponding plurality of sets of scores, whether any field label from the field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: generating, using the plurality of sets of candidate words and the corresponding plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label that is not in the field label glossary; and assigning the new field label to the particular field.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises: for each particular field in the set of the dataset's fields: determining, using a name of the particular field and a subset of data from the particular field, whether any field label from a field label glossary matches the particular field, the determining comprising: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and a corresponding set of scores thereby obtaining a plurality of sets of candidate words and a corresponding plurality of sets of scores; determining, using the plurality of sets of words and the corresponding plurality of sets of scores, whether any field label from the field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: generating, using the plurality of sets of candidate words and the corresponding plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label that is not in the field label glossary; and assigning the new field label to the particular field.
Some embodiments provide a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises using at least one computer hardware processor to perform: for each particular field in the set of the dataset's fields: determining whether any field label from a field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: accessing, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field, the generating comprising: accessing a language model indicating a plurality of word collections that appear in a set of text and, for each of the word collections, a relative position of each word in the word collection; generating, using the plurality of sets of candidate words, a plurality of candidate word collections for the particular field; determining, using the language model and the plurality of sets of scores corresponding to the plurality of sets of candidate words, a score for each of the plurality of candidate word collections; selecting a word collection from the plurality of candidate word collections; and generating the word sequence using the word collection selected from the plurality of candidate word collections; and generating, using the word sequence, a new field label that is not in the field label glossary; and assigning the new field label to the particular field.
Some embodiments provide a system for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions. The instructions, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: for each particular field in the set of the dataset's fields: determining whether any field label from a field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: accessing, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field, the generating comprising: accessing a language model indicating a plurality of word collections that appear in a set of text and, for each of the word collections, a relative position of each word in the word collection; generating, using the plurality of sets of candidate words, a plurality of candidate word collections for the particular field; determining, using the language model and the plurality of sets of scores corresponding to the plurality of sets of candidate words, a score for each of the plurality of candidate word collections; selecting a word collection from the plurality of candidate word collections; and generating the word sequence using the word collection selected from the plurality of candidate word collections; and generating, using the word sequence, a new field label that is not in the field label glossary; and assigning the new field label to the particular field.
Some embodiments provide a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset comprising data stored in fields to determine field labels for a set of the dataset's fields. The method comprises: for each particular field in the set of the dataset's fields: determining whether any field label from a field label glossary matches the particular field; when it is determined that one or more field labels from the field label glossary match the particular field, assigning a field label of the one or more field labels to the particular field; when it is determined that no field label from the field label glossary matches the particular field: accessing, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field, the generating comprising: accessing a language model indicating a plurality of word collections that appear in a set of text and, for each of the word collections, a relative position of each word in the word collection; generating, using the plurality of sets of candidate words, a plurality of candidate word collections for the particular field; determining, using the language model and the plurality of sets of scores corresponding to the plurality of sets of candidate words, a score for each of the plurality of candidate word collections; selecting a word collection from the plurality of candidate word collections; and generating the word sequence using the word collection selected from the plurality of candidate word collections; and generating, using the word sequence, a new field label that is not in the field label glossary; and assigning the new field label to the particular field
The foregoing summary is non-limiting.
Various aspects and embodiments will be described with reference to the following figures. It should be appreciated that the figures are not necessarily drawn to scale. Items appearing in multiple figures are indicated by the same or a similar reference number in all the figures in which they appear.
The inventors have developed techniques for automatically processing a dataset to assign field labels to respective fields in the dataset. A field label assigned to a dataset field represents metadata about the field. In turn, metadata about the field may be used to identify further processing to be performed on data stored in the field. In this way, field labels provide a way for a data processing system to use metadata to automatically identify processing to be performed on one or more fields of a dataset and to trigger performance of such processing. Processing on a dataset field or fields that is identified and/or triggered based on metadata associated with the field(s) may be termed “metadata-driven” processing or logic. This disclosure describes a variety of ways in which field labels may be assigned to dataset fields. One way is that an existing label (e.g., a label in an existing set of field labels) may be assigned to a given field. Another way is that a new field label is generated and assigned to a given dataset field instead of an existing label being assigned.
For example, in some cases, a data processing system may match a dataset field to one of an existing set of field labels (which may be referred to as a “field label glossary”). In some embodiments, the system may identify, from the field label glossary, candidate field label(s) for a field and corresponding score(s) by: (1) analyzing the name of the field using natural language processing (NLP); (2) analyzing at least some of the data stored in the field; and (3) merging the results of the field name analysis and the field data analysis to identify the candidate field label(s) and corresponding scores. In turn, the field label may be selected from among the candidate field label(s), for example, based on the scores. In other embodiments, the system may identify, from the field label glossary, candidate field label(s) and corresponding score(s) for a field by analyzing the name of the field using NLP without analyzing any of the data stored in the field. In yet other embodiments, the system may identify, from the field label glossary, candidate field label(s) and corresponding score(s) for a field by analyzing the data stored in the field without analyzing the name of the field. The system may select one of the identified candidate field label(s) as the label assigned to the field.
As another example, in some cases, a data processing system may generate a new field label for a dataset field instead of assigning an existing field label to that field. This may occur in a variety of situations, for example, when none of the existing field labels in a field label glossary provide a sufficiently good match to the field or when no field label glossary exists in the first place. In such situations, the data processing system may generate a new field label for assignment to the field by: (1) generating, using the field's name, a word sequence describing data stored in the field; and (2) generating the new field label from the word sequence. The system can then assign the newly generated field label to the field.
A field label indicates metadata about dataset field(s) to which the field label is assigned. A field label may include a text string that describes data stored in the field (e.g., “telephone number”, “email address”, “first name”, “last name”, etc.). In some embodiments, a field label may have one or more attributes which take on value(s) indicating information about a dataset field to which the field label is assigned. Attributes may include, for example, a user responsible for the data (e.g., a steward), whether the data is PII, a format of the data, data standard(s) applicable to the data, a definition of the data, a set of possible values that a field value may take on, relationship(s) with other field label(s), and/or other attributes. Accordingly, a field label's text string and attribute value(s) are examples of metadata than can be used to identify, invoke, and perform metadata-drive processing on the dataset field.
A data processing system may manage data for an organization such as a multinational corporation (e.g., a logistics company, a financial institution, a utility company, an automotive company, an e-commerce company, etc.) and/or any other organization or entity. The organization may have vast amounts of data (e.g., hundreds or thousands of terabytes of data) managed by the data processing system. The data may include thousands or millions of datasets, which may store data in multiple records that include data values. For example, a dataset may store multiple records in a table. Each record may have multiple fields each storing one or more respective values. The table may store multiple records in its rows such that field values of a record are stored in different columns (or vice versa such the table stores multiple records in columns and field values of a record are stored in different rows). However, it should be appreciated, that a dataset is not limited to storing multiple records using a tabular structure and may store the multiple records in any other suitable way (e.g., using attribute-value pairs, mark-up language such as XML, JSON, using an object-oriented approach, etc.), as aspects of the technology described herein are not limited in this respect. A data processing system may manage thousands, millions, or more of such datasets.
Datasets processed by a data processing system may be very large (e.g., with hundreds of fields) and may be obtained from many different data sources, some of which may be legacy systems. As a result, data in fields of different datasets may have different formats, naming conventions, languages, and/or may change over time relative to data from other data sources. A data processing system needs an efficient way to recognize data in a field such that the system can apply the appropriate processing to data from the field. For example, the data processing system may be configured to execute several software application programs that perform operations using data from fields of datasets. Execution of the software application programs and their logic may be dictated at least in part by the type of information is stored in the fields. For example, execution of certain software application programs may need to be triggered based on recognizing information stored in a field (e.g., execution of an anonymizing software application program may be triggered based on recognizing that data stored in a given field is personally identifiable information (PII) or a data reformatting software application may be triggered based on recognizing that data stored in a field is a phone number). As another example, a software application program may need to be reconfigured based on recognizing information stored in a field (e.g., to prevent the software application program from failing to execute or from malfunctioning). As another example, a software application program may have functionality that is triggered based on recognizing that data from one field is related to data stored in another field. Field labels provide an efficient way for a data processing system (and software application programs thereof) to recognize data in fields of datasets obtained from different sources.
When a dataset field is assigned a field label, the data processing system may use metadata about the dataset field indicated by the field label for various applications. For example, the data processing system may use the metadata to identify relationships between the dataset field and other dataset fields. As another example, the data processing system may use metadata about a dataset field to automatically generate lineage information about the dataset field that indicates how the dataset field was obtained, how the dataset field may change over time, and/or how the dataset field may be used by one or more processes over time. Lineage information for a dataset field may include upstream lineage information indicating how the dataset field was obtained (e.g., by identifying data source(s) and/or data processing operation(s) that have been applied to the dataset field). Lineage information for a dataset field may additionally or alternatively indicate downstream lineage information indicating one or more other dataset fields and/or processes that depend on and/or use the dataset field. As another example, a field label may be associated with a data standard. The data processing system may apply the data standard associated with a field label to all fields to which the field label is assigned. The data standard may indicate data quality requirements that must be met by dataset fields to which the data standard is applied in order to comply with the data standard. When a dataset field fails to meet the data quality requirements, the dataset field may, for example, be updated or flagged for further review. Accordingly, field label assignments may be used by the data processing system to ensure consistent data quality across all the data managed by the data processing system.
Metadata about a dataset field indicated by a field label assigned to the dataset field may further be used by software application(s) in processing data from the dataset field. For example, a software application may need to recognize that data stored in a dataset field includes personally identifiable information (PII) (e.g., social security numbers, bank account numbers, government ID numbers, and/or other PII) to trigger functionality that is appropriate for such PII data (e.g., anonymizing data stored in the dataset field, restricting access to such data, de-identifying such data, masking such data, etc.). As another example, a software application may need to recognize a category and/or format of data (e.g., a phone number social security number, name, address, and/or other categories of data) stored in a dataset field to determine how to format the data and/or how the data should be formatted (e.g., to meet a data standard). Field labels may further be used to control the functioning of a computer along a path leading to the desired data processing (e.g., PII masking in data) and/or data processing in a computationally efficient manner. Controlling the computer along the path may involve identifying processing (e.g., anonymizing data from a field, restricting access to such data, de-identifying data, and/or other processing) to be performed and triggering the processing.
Recognizing data from fields of datasets becomes particularly important when the data includes PII which may need to be masked for hundreds, thousands, or even millions of data records (e.g., of financial transactions) every day. A human would be unable to manually perform annotation or masking of PII of these records. Because the names of fields in a dataset may not be descriptive of the information stored in the fields, it may not be apparent to a human user that data records include PII. Also, the location of fields within a dataset may change over time making it impossible for a human to follow the fields containing PII. Even if a human could discern that PII is present within a particular field, the determination would require permitting access to unmasked data records by that person. Such access to the records to evaluate the records for PII would compromise data security and data privacy.
Aspects of this disclosure enable reliable and computationally efficient identification of PII and its masking. In particular, rapid and computationally efficient labeling of fields to discover the meaning of the data and process the data by masking PII is enabled. Field labels indicate the meaning of data in fields of data sets. For example, field labels may specify whether a detected date in a field represents a date of birth, a date of expiration of a driver's license, or some other particular kind of date. If the type of date is PII, the system can automatically trigger masking functionality that masks the data. An example further illustrates the power of this functionality, as follows: datasets with a date of birth field are received by a data processing system. This date of birth field has a meaningless alphanumeric field name of “rwr8342.” Such alphanumeric field names often exist in raw data. Aspects of this disclosure enable an automatic association by a processing system of the field name of “rwr8342” with a semantic meaning of “date of birth.” Moreover, the field name may change from “rwr8342” to “sfs3432”, and the system may start to receive data with this new field name. Even though the new field “sfs3432” is a date of birth field, conventional systems would not be able to determine this, much less at scale. As such, data in the “sfs3432” field goes unmasked. However, a system using aspects of this disclosure can easily and efficiently recognize a semantic meaning of date of birth for the “sfs3432” field and assign a field label indicating that semantic meaning. Once the “sfs3432” field has been associated with the field, a data processing system may automatically apply masking rules that are applicable to date of birth fields.
Improved field label accuracy leads to overall improved performance of a data processing system and applications that utilize the field labels. This is particularly the case for metadata-driven processing described herein with reference to
Techniques described herein may also be used by a data processing system to better ascertain metadata stored about data managed by the data processing system. By more accurately assigning field labels to dataset fields, the techniques improve the accuracy of metadata about the dataset fields stored by the data processing system. By allowing the data processing system to better ascertain metadata about dataset fields, the techniques further improve the data processing system's ability to generate data lineage information for dataset fields. For example, when the same field label is assigned to two dataset fields, the data processing system may recognize that the two dataset fields are related (e.g., one dataset field may be derived from the other dataset field, or one dataset field may depend on the other dataset field). The data processing system may generate data lineage information indicating the relationship between the dataset fields. Thus, more accurate metadata about dataset fields provided by improved field label assignments may allow the data processing system to capture data lineage more accurately and completely. The improved lineage information may improve processes that utilize the lineage information such as: (1) identifying, tracing, and resolving errors in data processing; (2) identifying and resolving data security risks (e.g., risk of granting improper access to data); and (3) determining how changes in a dataset field affect downstream data and operations. A data lineage may represent relationships among physical datasets accessible/used by a software application of a data processing system, and may be generated by analyzing source code of a computer program and analyzing information obtained during runtime of the computer program. The generation of a data lineage involves identifying physical datasets accessed/used by a computer program, transformations applied to inputs of the computer program. This emphasizes the relationship and effect that the physical datasets and the generated data lineage have with respect to the computer program and the data processing performed by the data processing system (inputs, outputs, and transformations used by the computer program).
Furthermore, the techniques improve the data processing system's ability to govern data quality in data managed by the data processing system. More accurate field label assignments to dataset fields result in more accurate application of data standards to the dataset fields (as the data standards may be mapped to the dataset fields through their assigned field labels). Thus, the techniques allow the data processing system to apply data standards more accurately across dataset fields.
The improved field label accuracy leads to improved performance of software applications that utilize the field labels. For example, more accurate field labels lead to improved security of data by triggering data protection functionality (e.g., masking of PII) in a software application. As another example, more accurate field labels may better inform the format of data stored in the dataset fields which may allow software applications to process data from the dataset fields more efficiently (e.g., by configuring processing according to the format). To illustrate, some data processing applications cannot be initiated if the input data is not in a suitable format. The improved field label accuracy may provide field labels that better indicate whether input data is not in a suitable format and thus avoids use of computing resources and time to execute a data processing application using the input data. The field labels may mitigate a failure in the execution of a software application or a malfunction in its execution. The data from the field may be reformatted or the application may be reconfigured based on the field labels to improve the efficiency of the data processing system. As another example, more accurate field labels may provide better descriptions of data in a software application development environment to make development easier and more efficient.
As illustrated in
As shown in
As illustrated in
Although example embodiments described herein used dataset profiles, some embodiments may use techniques described herein without using dataset profiles. In such embodiments, the techniques may access field names and field values from a dataset instead of accessing them from a dataset profile. For example, the field labeling system 110 may assign labels to fields of the datasets 112A, 112B, 112C by accessing field names and field values directly from the datasets 112A, 112B, 112C (e.g., by accessing the datasets in datastores 109A, 109B, 109C or in storage of the data processing system 100) as opposed to from dataset profiles. Accordingly, field data analysis and field name analysis techniques described herein may be performed without using dataset profiles, in some embodiments.
The data processing system 100 may ascertain information about data (information about data may be referred to as “metadata”) stored in one or more fields of datasets 112A, 112B, 112C to perform various functions and/or for certain applications. The data processing system 100 may be configured to drive processing based on metadata about field(s) of the datasets 112A, 112B, 112C. Metadata-driven processing may be identified and/or triggered based on metadata associated with the field(s). Further, metadata-driven processing may further use metadata about the field(s) to perform various functions. Accordingly, the data processing system 100 needs an efficient way to ascertain metadata about data stored in field(s) of the datasets 112A, 112B, 112C.
The data processing system 100 may need to apply the metadata-drive processes 180 to various fields of the datasets 112A, 112B, 112C (e.g., to ensure that PII is protected, data quality requirement(s) are met, and/or to update datasets). However, given that the data processing system 100 is frequently processing data, the data processing system 100 is unaware that datasets have been introduced and that metadata-driven processes may be applicable to fields of the datasets. Thus, as illustrated in
To allow the data processing system 100 to ascertain metadata-driven processes applicable to data from a dataset field, the dataset field may be assigned a field label (e.g., one of a pre-defined set of field labels in a field label glossary assigned by data processing system 100 or a field label generated by the data processing system 1100 described herein with reference to
As illustrated in
As shown in
The field labeling system 110 has assigned the field labels 1126D, 126E, 126F to fields of the datasets 112D, 112E, 112F using the dataset profiles 116D, 116E, 116F. In the example of
Furthermore, assigning a field label to a particular dataset field maps metadata associated with the field label to the particular dataset field. For example, each field label may be associated with a data entity definition specifying a set of attributes for capturing metadata about a dataset field. In a data entity instance generated from the data entity definition, each of the attributes may take on a value such as number, string, or reference to another data entity instance. Assigning a field label to a dataset field may associate the data entity definition with the dataset field. As another example, each field label may be associated with a data standard. A data standard associated with a given field label is applied to all dataset fields that are assigned the field label. Thus, a field label allows a data processing system to apply a data standard to multiple dataset fields that need to be governed by the data standard without requiring the data standard to be mapped directly to each of the multiple dataset fields, which would be computationally expensive to do given the large number of dataset fields. Moreover, a data standard associated with a field label may automatically be associated with new or updated dataset fields that are assigned the field label. The field label thus allows mappings of the data standard and dataset field labels to dynamically update in response to addition or modification of data (e.g., that causes a new dataset field to be assigned the label).
As an illustrative example, a data standard for phone numbers may be applicable to any dataset field storing phone numbers. The data standard may, for example, have data quality requirements such as a format in which the phone numbers are stored, a length of the phone numbers, an indication of whether a value of the field needs to be populated, and/or other data quality requirements. The data standard may be associated with a field label such as one named “telephone number.” Techniques described herein may be used to assign the field label to one or more fields. By assigning the field label to a given field, the data standard for phone numbers may automatically be applied to the field. Thus, the field may be required to meet the data standard (e.g., by meeting data quality requirements specified by the data standard). Additional fields may subsequently be assigned the field label (e.g., as result of labeling performed after addition of datasets and/or fields to dataset). Each field that is assigned the field label may automatically be associated with the data standard for telephone numbers without requiring analysis of data in the field outside of assignment of a field label.
Given the large number (e.g., millions) of dataset fields in data managed by a data processing system that need to be assigned a field label and the frequency of updates to the datasets (e.g., by addition of fields and/or modification of data within the dataset fields), it is impractical or impossible for the dataset fields to be manually assigned a field label from a field label glossary. Thus, the dataset fields need to be assigned a field label from the field label glossary automatically.
In some embodiments, a field label glossary may comprise a set of strings or information used to identify a string (e.g., a reference to a string). Each string in the set of strings may be a field label. For example, each string may be a field label describing data that is assigned the field label. In some embodiments, each field label in the field label glossary may indicate metadata beyond a descriptive string. Examples of metadata beyond a descriptive string include data standard(s) applicable to data, a data steward, a data owner, a location where data is stored, a functional area associated with the data, a data domain, a PII classification, a security access level of the data, a geographic location the data is associated with, and/or other metadata.
Conventional techniques for automatically assigning field labels to dataset fields involve analyzing data stored in the dataset fields to identify field labels. A subset of data from a given field may be analyzed to identify candidate field label(s) for the dataset field in a field label glossary. The subset of data may be used to determine the score(s) for the candidate field label(s), where each of the score(s) indicates how well the subset of data matches a particular field label. A field label may be assigned to the dataset field based on the scores (e.g., by selecting the highest (or one of the top) scoring candidate field labels or presenting a set of the highest scoring labels to a user for selection as a label).
Although the name of a dataset field should describe the field and be useful to facilitate automated labeling, incorporating field names into the labeling process is difficult. There are various reasons that make field names difficult to accurately map to field labels in a field label glossary. One reason is that field names often include multiple abbreviations each of which may indicate multiple possible words, each of which may map to a different field label in the field label glossary. For example, the abbreviation “CTE” in the field name “CTE_NUM” may indicate the word “client”, “cite”, “customer”, or another word. Another reason is that a field name may use language that is idiosyncratic to an organization (e.g., based on a vocabulary internal to the organization). Thus, field labels of a field label glossary may not accurately map to a particular field name. For example, the field name “CTE_NUM” may refer to a customer's social security number for one organization and a customer's telephone number for another organization. As another example, “CTE_LOC” may refer to a customer's address for one organization and refer to a customer's zip code for another organization. Yet another reason is that field names may not consistently adhere to a particular format. Different field names may use different characters (e.g., a space, tab, underscore, another character, or no character) to separate words or abbreviations in the field names. For example, one field name (e.g., “TEL_NUM”) may use an underscore to separate portions of the field name while another field name (e.g., “CREDITCARDNUM”) does not use any character to separate portions of the field name.
Incorporating field names into the labeling processing is also difficult because of the large number (e.g., thousands or millions) of datasets managed by the data processing system. Different datasets may use different naming conventions for field names. Thus, a labeling process that utilizes the field name needs to handle the different naming conventions in order to assign field labels from a field label glossary to the dataset fields. Further, field names in different datasets may also be in different languages (e.g., because an organization may store some or all of its data in multiple languages for different geographic regions) relative to field labels of a field label glossary. A labeling process needs to be robust enough to handle different language grammars and conventions to identify, from a field label glossary, a field label for a dataset field using its field name.
For these reasons, it is difficult to map a field name to a field label in a field label glossary. Thus, conventional techniques of assigning a field label to a dataset field do not consider field name as it would result in poor accuracy of field labels assigned to dataset fields. This in turn would lead to downstream issues such as inaccurate metadata indicated about dataset fields and improper functioning of software applications that rely on the metadata. Accordingly, conventional techniques analyze data stored in a dataset field to assign a field label to the field without considering its field name.
Accordingly, the inventors have developed field labeling techniques that perform field name analysis to assign labels, from a field label glossary to dataset fields. The field name analysis uses natural language processing (NLP) to map a given field name to one or more candidate field labels in a field label glossary (e.g., in which each field label indicates respective metadata about a dataset field that is assigned the field label). The NLP involves deriving a word sequence from a field name (e.g., by dividing the field name into component abbreviations and resolving the abbreviations) and using the word sequence to determine candidate field label(s) and corresponding score(s) for the dataset field. Optionally, the candidate field label(s) and score(s) obtained from performing the field name analysis may be merged with candidate field label(s) and score(s) obtained from analyzing data from the dataset fields. The merged candidate field label(s) and score(s) may be used to assign a field label to the dataset field. By incorporating field name analysis, techniques described herein improve over conventional techniques of mapping dataset fields to field labels in a field label glossary.
The techniques for assigning field labels, from a field label glossary, to dataset fields described herein may be implemented in any of numerous ways, as the techniques are not limited to any particular manner of implementation. Examples of details of implementation are provided herein solely for illustrative purposes. Furthermore, the techniques of assigning field labels to dataset fields disclosed herein may be used individually or in any suitable combination, as aspects of the technology described herein are not limited to the use of any particular technique or combination of techniques.
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field label glossary (e.g., field label glossary 159 described herein with reference to
In some embodiments, the system may be configured to determine, using the name of the particular field and the NLP, the first set of candidate field labels for the particular field and the field name analysis scores for the first set of candidate field labels by: (1) identifying a set of abbreviations in the name of the particular field (e.g., abbreviations 404A, 404B, 404C in field name 402 of
In some embodiments, the system may be configured to determine, using the subset of data from the particular field and the tests associated with respective field labels in the field label glossary, the second set of candidate field labels and the field data analysis scores for the second set of candidate field labels by: (1) applying the tests associated with the respective field labels (e.g., field label 1 test and field label 2 test of
In some embodiments, the system may be configured to determine the merged candidate field labels and the corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores by: (1) identifying a first field label associated with a first one of the field name analysis scores and a first one of the field data analysis scores, the first field label being in the first set of candidate field labels and the second set of candidate field labels; and (2) determining a first merged score for the first field label by adjusting the first field name analysis score using the first field data analysis score to obtain the first merged score (e.g., by adjusting the first field name analysis score based on a bias value determined from a ratio between the first field name analysis score and the first field data analysis score).
In some embodiments, the system may be configured to determine the merged candidate field labels and the corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores comprises: (1) identifying a first field label from the first set of candidate field labels associated with a first one of the field name analysis scores; (2) determining that none of the subset of data passes a test associated with the first field label; and (3) determining a first merged score for the first field label by reducing the first field name analysis score.
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields. The system may be configured to, for each particular field in the set of fields, determine, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels (e.g., candidate field labels and scores 416 of
In some embodiments, the system may be configured to generate, using the sets of candidate words identified for the abbreviations and the n-gram model indicating the plurality of word collections that appear within the field labels of the field label glossary, the at least one word sequence describing data stored in the particular field by: (1) combining words from the sets of candidate words to obtain a plurality of word sequences; and filtering, using the n-gram model, the plurality of word sequences to obtain the at least one word sequence (e.g., by determining, using the n-gram model, for each of the plurality of word sequences, a likelihood that words of the word sequence are collocated and filtering, using likelihoods determined for the plurality of word sequences, the plurality of word sequences to obtain the at least one word sequence).
In some embodiments, the system may be configured to determine, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels by: (1) determining that one of the words in the at least one word sequence specifies a particular category of data; (2) determining a target position of the word in the at least one word sequence (e.g., based on a target language for a field label); and (3) determining the field name analysis scores for the candidate field labels based on the target position of the word in the at least one word sequence.
In some embodiments, the system may be configured to determine, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels by: (1) accessing a sequence position model (e.g., sequence position model 158B shown in
In some embodiments, the system may be configured to identify, for each particular abbreviation in the set of abbreviations, the set of candidate words indicated by the particular abbreviation by determining a similarity score between each candidate word in the set of candidate words and the particular abbreviation thereby obtaining sets of similarity scores corresponding to the sets of candidate words (e.g., sets of candidate words and corresponding scores 406A, 406B, 406C in
Some embodiments provide a system for processing a dataset comprising data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset. The field labels describe data stored in the set of fields (e.g., by associating the fields with metadata indicated by the field labels). The system may be configured to determine, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels. The system may be configured to determine the candidate field labels and the field name analysis scores by identifying a set of abbreviations (e.g., abbreviations 404A, 404B, 404C of
In some embodiments, the measure of similarity comprises multiple component measures of similarity (e.g., cosine similarity, Jaro-Winkler similarity, Jaro-Winkler similarity modified to scale based on a shared suffix, a loss value based on positions of shared letters and/or a combination thereof) and determining the measure of similarity between the particular abbreviation and each of the plurality of words in the glossary to obtain the plurality of similarity scores comprises: (1) determining, for the particular abbreviation and the word, the component measures of similarity to obtain values of the component measures of similarity; and (2) determining the measure of similarity between the particular abbreviation and the word using the values of the component measures of similarity (e.g., as a maximum of the component similarities).
In some embodiments, the system may be configured to determine the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores by determining one of the multiple component measures of similarity based on a degree to which a prefix and/or a suffix of the first word matches a prefix and/or a suffix of the particular abbreviation.
In some embodiments, the system may be configured to determine the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores by: (1) removing vowels from the first word to obtain a vowelless word; and (2) determining one of the multiple component measures of similarity using the vowelless word.
In some embodiments, the system may be configured to determine the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores by: (1) stemming the first word to obtain a word stem; and (2) determining one of the multiple component measures of similarity using the word stem.
In some embodiments, the system may be configured to select, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores comprises, by: (1) identifying a subset of the plurality of similarity scores that meet a threshold similarity score, the subset of similarity scores associated with a subset of the plurality of words; and (2) selecting the subset of the plurality of words as the set of candidate words for the particular abbreviation.
In some embodiments, the candidate field labels comprise words from the sets of candidate words and the system may be configured to determine, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels by: (1) determining, using the sets of candidate words, at least one word sequence (e.g., candidate sequences 408A) describing data stored in the particular field; and (2) determining, using the sets of similarity scores and the at least one word sequence, the field name analysis scores for the candidate field labels (e.g., candidate field labels and scores 416).
In some embodiments, the system may be configured to determine, using the sets of similarity scores and the at least one word sequence, the field name analysis scores for the candidate field labels comprises by performing, for each of the candidate field labels: (1) identifying words from the sets of candidate words included in the candidate field label; (2) obtaining similarity scores corresponding to the identified words; and (3) determining a field name analysis score for the candidate field label using the similarity scores corresponding to the identified words.
As shown in
The data recognition module 102 may be configured to process dataset fields to identify candidate field labels and scores 136 for the dataset fields from a field label glossary 104 (shown in
In some embodiments, the data processing system 100 may be configured to generate the field label glossary 104. For example, the data processing system 100 may generate the field label glossary 104 from a set of text (e.g., provided to the data processing system 100). The field label glossary 104 may identify strings in the set of text and store the identified strings in a data structure (e.g., an array) as the field label glossary 104. In some embodiments, the field label glossary 104 may be loaded into the data processing system 100. For example, the field label glossary 104 may be loaded as a file into the memory of the data processing system 100. In some embodiments, the field label glossary 104 may be updated to add additional field labels that can be assigned to dataset fields. For example, an initial field label glossary may be stored in the data processing system 100. The initial field label glossary may be subsequently updated to obtain the field label glossary 104.
The field label glossary 104 may be stored in memory of the data processing system 100 (e.g., in datastore 158 as shown in
In some embodiments, the data recognition module 102 may be configured to determine candidate field labels and scores 136 for dataset fields. The data recognition module 102 may be configured to determine one or more candidate field labels for a particular dataset field by performing two separate analyses to obtain two sets of candidate field labels and corresponding scores: (1) a field name analysis in which the data recognition module 102 determines, using a name of the particular field, a first set of candidate field labels (“field name analysis candidate field labels”) and corresponding scores (“field name analysis scores”); and (2) a field data analysis in which the data recognition module 102 determines, using a subset of data stored in the particular field, a second set of candidate field labels (“field data analysis candidate field labels”) and corresponding scores (“field data analysis scores”). The data recognition module 102 may be configured to merge the two sets of candidate field labels and corresponding scores to obtain merged candidate field labels and corresponding scores. The data recognition module 102 may be configured to provide the merged candidate field labels and corresponding scores 136 to the field label assignment module 106 for the assignment of one of the merged candidate field labels to the particular field (e.g., automatically based on the scores and/or based on user input).
In some embodiments, the data recognition module 102 may be configured to determine a first set of candidate field labels for a dataset field using its field name by processing the field name using natural language processing (NLP). The data recognition module 102 may be configured to use NLP to: (1) determine a semantic meaning indicated by the field name; and (2) identify candidate field label(s) corresponding to the semantic meaning in field label glossary 104. Examples of NLP that may be performed by the data recognition module 102 are described herein.
Once candidate field label(s) are determined for a dataset field by the data recognition module 102, one of the candidate field label(s) needs to be assigned as the field label to the dataset field. The field label assignment module 106 may be configured to select field labels 126 to assign to dataset fields from candidate field labels determined for the dataset fields (by the data recognition module 102). The field label assignment module 106 may be configured to receive candidate field labels and corresponding scores 136 from the data recognition module 102. For example, the field label assignment module 106 may obtain merged candidate field labels and corresponding scores determined by the data recognition module 102.
In some embodiments, the field label assignment module 106 may be configured to automatically assign one of the candidate field labels associated with a dataset field based on scores associated with the candidate field labels. For example, when the field label assignment module 106 receives a single merged candidate field label for a dataset field with a corresponding score that meets a first threshold, the field label assignment module 106 may automatically assign the candidate field label to the dataset field. In some embodiments, the first threshold score may be any suitable value in the range of 0.8 to 1. For example, the first threshold score may be 0.9. In some embodiments, the first threshold score may be a configurable parameter. For example, the first threshold may be configured by the field label assignment module 106 based on user input received through a GUI.
In some embodiments, the field label assignment module 106 may be configured to request user input to assign one of the candidate field labels associated with a dataset field based on scores associated with the candidate field labels. For example, when the field label assignment module 106 receives a merged candidate field label for a dataset field with a corresponding score that meets a second threshold lower than the first threshold, the field label assignment module 106 may request user input confirming the field label. In some embodiments, the second threshold score may be any suitable value in the range 0.6-0.9. For example, the field label assignment module 106 may request user input to assign one of the candidate field labels to the dataset field when the greatest one of the scores associated with the candidate field labels meets a second threshold score of 0.75 but is less than a first threshold score of 0.9. In some embodiments, the second threshold score may be a configurable parameter. For example, the second threshold score may be configured by the field label assignment module 106 based on user input received through a GUI.
In some embodiments, the field label assignment module 106 may be configured to request user input when multiple candidate field labels have associated scores that meet a third threshold score. In some embodiments, the third threshold score may be any suitable value in the range 0.6-1.0. For example, the third threshold score may be 0.75. In this example, when multiple candidate field labels have scores of at least 0.75, the field label assignment module 106 may request user input selecting one of the multiple candidate field labels to assign to the particular field. In some embodiments, the third threshold score may be a configurable parameter. For example, the third threshold score may be configured by the field label assignment module 106 based on user input received through a GUI.
In some embodiments, the field label assignment module 106 may be configured to determine that there is no matching candidate field label when none of the associated scores meet a fourth threshold score. In some embodiments, the fourth threshold score may be a value in the range 0.1-0.2, 0.2-0.3, 0.3-0.4, 0.4-0.5, 0.5-0.6, 0.6-0.7, or 0.7-0.8. For example, the fourth threshold score may be 0.75. In this example, when none of the candidate field labels have a score of at least 0.75, the field label assignment module 106 may determine that none of the candidate field labels are to be assigned to the particular field.
As shown in
In some embodiments, the datastore 108 may be configured to store metadata about data stored in the datastores 109A, 109B, 109C. In some embodiments, the metadata may be stored in data entity instances which each store metadata about a particular dataset field. In some embodiments, the data entity instances for a dataset field may be defined based on the field label assigned to the dataset field. For example, the data entity instance for the dataset field may be instantiated from a data entity definition associated with the field label assigned to the dataset field. The data entity definition may define metadata to be stored about the dataset field in the instantiated data entity instance. In the example of
In some embodiments, assigned field labels may be used by the data processing system 100 for various processes. For example, metadata attribute values in data entity instances associated with the dataset fields F1, F12, F13 may be used by the data processing system to identify relationships between the dataset fields F1, F12, F13, to generate lineage information about the dataset fields F11, F12, F13, determine whether to anonymize PII in the dataset fields F11, F12, F13, to identify a format of data stored in the dataset fields F1, F12, F13, to govern access to the dataset fields F1, F12, F13, to determine a data standard(s) to apply to the dataset fields F11, F12, F13, and/or for other purposes). As another example, the data processing system 100 may use the metadata attribute values to identify which of dataset fields F1, F12, F13 in dataset 112A is the key or index field of the dataset 112A. As another example, the data processing system 100 may use the metadata attribute values to automatically identify different dataset fields from different datasets that store the same type of information. The system may identify the different dataset fields by determining that they are all assigned a common field label from field label glossary 104. The system may further generate a visual map associating the field label to the different fields and their corresponding source datasets (e.g., to illustrate that a software application is accessing the same type of information from multiple datasets). To illustrate, the system may generate a visual map showing that two different data entity instances (e.g., “member identifier” and “member identification code”) that each represents a respective set of dataset field(s) both indicate the same field label assignment (e.g., “member identifier”) for their respective sets of dataset field(s). This shows that the dataset field(s) associated with the two data entity instances store the same information (e.g., a member identifier).
In some embodiments, the datastore 108 may comprise any suitable storage hardware configured to store field label assignments. For example, the datastore 108 may comprise one or more hard drives (e.g., solid state drive(s) (SSD(s)), disk(s), and/or other hard drive(s)). In some embodiments, the datastore 108 may comprise local storage. In some embodiments, the datastore 108 may comprise distributed storage. The distributed storage may be connected through a network. For example, the datastore 1058 may include a distributed database. In some embodiments, the datastore 108 may comprise virtual storage. In some embodiments, the datastore 108 may comprise a database managed by a database management system. For example, the database may be a relational database managed by a relational database management system (RDBMS).
As shown in
The field name analysis module 102A may be configured to process the field names 120 to determine, for each dataset field, a set of one or more candidate field labels and corresponding field name analysis score(s). In some embodiments, the field name analysis module 102A may be configured to determine a set of candidate field label(s) for a given field name using NLP. The field name analysis module 102A may be configured to determine one or more candidate word sequences using the field name. The field name analysis module 102A may be configured to determine the candidate word sequence(s) using the field name by: (1) segmenting the field name into portions (e.g., abbreviations); (2) identifying candidate sets of words represented by the portions; and (3) generating, using the candidate sets of words, the candidate word sequence(s). In some embodiments, the candidate word sequence(s) may indicate a recognized type of data in the dataset field. The candidate word sequence(s) may each provide a description of data stored in the dataset field.
Candidate word sequence(s) identified for a dataset field are then used to identify candidate field label(s) from the field label glossary 104 for the dataset field. The field name analysis module 102A may be configured to use a candidate word sequence generated for a dataset field to identify one or more candidate field labels for the dataset field in the field label glossary 104 by determining a degree to which each field label in the field label glossary 104 matches the word sequence. The field name analysis module 102A may be configured to select field label(s) that match the word sequence as the candidate field label(s). For example, the field name analysis module 102A may score each field label in the field label glossary 104 based on how closely it matches the word sequence and select field label(s) that meet a threshold score as the candidate field label(s).
The data recognition module 102 further performs the field data analysis using field values 122. In some embodiments, the field data analysis may be performed independently of the field name analysis module. The field data analysis module 102B may be configured to process values from a given field to identify candidate field label(s) for the dataset field in the field label glossary 104. In some embodiments, the field data analysis module 102B may be configured to determine candidate field label(s) for a dataset field using values accessed from the dataset field by: (1) accessing tests associated with field labels of the field label glossary 104; (2) applying the tests associated with the field labels to the values accessed from the dataset field to obtain scores corresponding to the field labels; and (3) determine the candidate field label(s) based on the scores (e.g., by identifying field label(s) that meet a threshold score as the candidate field label(s) for the dataset field). In some embodiments, a test may quantify how well the values meet an expected pattern of data described by a field label corresponding to the test. Example tests associated with field labels are described herein with reference to
After obtaining the results of the field name analysis and the field data analysis, the data recognition module 102 merges the results of the two analysis paths. The score merging module 102C may be configured to merge a field name analysis score corresponding to a candidate field label with a field data analysis score corresponding to the candidate field. In some embodiments, the score merging module 102C may be configured to merge the field name analysis score with the field data analysis score by adjusting the field name analysis score using the field data analysis score. For example, the score merging module 102C may: (1) penalize a candidate field label obtained from the field name analysis that the field data analysis indicates is poor; and (2) reward a candidate field label obtained from the field name analysis that the field data analysis indicates is good. In some embodiments, the score merging module 102C may apply a loss function (e.g., cross entropy) to penalize a poor candidate field label and reward a good candidate field label. For example, the score merging module 102C may determine a bias value based on a comparison (e.g., a ratio) between the field name analysis score and the field data analysis score and adjust the field name analysis score using the bias to obtain a merged score (e.g., by adding or subtracting the calculated bias value to the field name analysis score). Example techniques of merging a field name analysis score and a field data analysis score are described herein.
In some embodiments, the score merging module 102C may be configured to obtain, from the field name analysis module 102A for a dataset field, a candidate field label that was not identified as a candidate field label by the field data analysis module 102B (e.g., because values obtained from the dataset field failed to score sufficiently when a test associated with the candidate field label was applied to the values). The score merging module 102C may be configured to determine a merged score for such a candidate field label. In some embodiments, the score merging module 102C may be configured to determine the merged score by reducing the field name analysis score. For example, the score merging module 102C may reduce the field name analysis score by a pre-determined penalty percentage (e.g., 1%, 5%, 10%, 15%, 20%, 25%, or a percentage between any of the aforementioned percentages). This may reflect the determination that the dataset field values failed to match a pattern expected for the candidate field label identified by the field name analysis module.
Once the merged candidate field labels and corresponding scores 136 are determined, the field label assignment module 106 uses the merged candidate field labels and corresponding scores 136 to assign field labels 126A, 126B, 126C to the dataset fields F1, F12, F13. As shown in
As shown in
In the example of
The results from the two paths are merged by the score merging module 102C. As shown in
In some embodiments, each of the analysis paths illustrated in
As shown in
In some embodiments, the field name segmentation module 152 may be configured to segment a field name into multiple portions. For example, the field name segmentation module 152 may segment a field name into abbreviations present in the field name. In some embodiments, the field name segmentation module 152 may be configured to identify, for each field name portion, a candidate set of words that may be indicated by the field name portion. For example, for each abbreviation in the field name, the field name segmentation module 152 may identify a candidate set of words that may be indicated by the abbreviation. The field name segmentation module 152 may be configured to determine a similarity score between each word in a candidate set of words and the corresponding field name portion (e.g., abbreviation). The similarity scores may be used to identify and score candidate field labels by the field label identification module 154.
In some embodiments, the field label identification module 154 may be configured to determine candidate field labels and corresponding scores using candidate sets of words identified by the field name segmentation module 152. The field label identification module 154 may be configured to determine a candidate set of labels for a dataset field and corresponding scores by: (1) using the candidate sets of words to generate one or more word sequences; and (2) identifying candidate field labels in the field label glossary 104 using the word sequence(s). The field label identification module 154 may be configured to use a word sequence to configure sequence position model 158B and determine scores for field labels in the field label glossary 104 using sequence position model 158B.
In some embodiments, the field label identification module 154 may be configured to identify candidate labels using the n-gram model 158A. In some embodiments, the NLP module 156 may be configured to generate an n-gram model 158A which may be used by the field label identification module 154 to identify candidate field labels. The NLP module 156 may be configured to generate the n-gram model 158A using the field label glossary 104. The n-gram model 158A may thus provide a language model representing the field labels. In some embodiments, the NLP module 156 may be configured to generate the n-gram model 158A by: (1) identifying all word sequences that appear in the field label glossary 104; (2) storing an indication of the identified sequences in a data structure (e.g., a table); and (3) storing the data structure in memory of the data processing system 100 as the n-gram model 158A. For example, the NLP module 156 may store each identified sequence in a row of a table in which one column stores the last word in the sequence and the other column stores one or more words that precede the last word. The last word in the sequence may be referred to as a “target” that completes the sequence. Example generation of the n-gram model 158A is described herein with reference to
The field label identification module 154 may be configured to score field labels using the sequence position model 158B. In some embodiments, the NLP module 156 may be configured to generate the sequence position model 158B. The NLP module 156 may be configured to generate the sequence position model 158B using a word sequence (e.g., generated from sets of candidate words determined by the field name segmentation module 152). The NLP module 156 may be configured to set parameters of the sequence position model 158B based on the word sequence. For example, the NLP module 156 may map points in the sequence position model 158B to respective words in the word sequence. In some embodiments, the sequence position model 158B may be used to generate an output score for a field label that is based on: (1) whether words in the field label appear in the word sequence; and (2) the degree to which the order of words in the field label match the order of words in the word sequence. An example sequence position model 158B is described herein with reference to
In some embodiments, the field name segmentation module 152 may be configured to identify candidate sets of words that may be indicated by field name portions in a glossary of words. In some embodiments, the NLP module 156 may be configured to generate the glossary of words. In some embodiments, the NLP module 156 may be configured to generate the word collection by: (1) loading words from a dictionary; (2) filtering the words to obtain a filtered set of words; and (3) including the filtered set of words in the word collection. For example, the NLP module 156 may filter single character words out of the words loaded from the dictionary. In some embodiments, the NLP module 156 may be configured to identify synonyms and antonyms of the filtered set of words and include the identified synonyms and antonyms in the glossary. The NLP module 156 may configure the glossary to indicate synonyms and antonyms of each word. For example, the NLP module 156 may identify synonyms and antonyms of words using the WordNet lexical database. The NLP module 156 may be configured to store the glossary in the datastore 158.
As shown in
In some embodiments, the field value selection module 162 may be configured to select values from a set of field values (e.g., stored in a dataset profile or from a dataset). The field value selection module 162 may be configured to select field values using one or more criteria. For example, the field value selection module 162 may select a number (e.g., 10-20, 20-30, 30-40, 40-50, 50-100, 100-150, 150-200, 200-300, 300-400, 400-500, 500-1000, or another number of values) of the most frequently occurring field values. As another example, the field value selection module 162 may randomly select a number of field values. As another example, the field value selection module 162 may select a number of most recently added field values in the field.
In some embodiments, the match testing module 164 may be configured to determine a score for each of one or more of the field labels in the field label glossary 104 using the selected field values. In some embodiments, match testing module 164 may be configured to determine the score for a given field label using various techniques. The match testing module 164 may be configured to determine the score(s) for the field label(s) by applying test(s) associated with the field label(s) to the selected field values. For example, the match testing module 164 may execute a first test associated with a first field label on the selected values to obtain a first field data analysis score for the first field label and may execute a second test associated with a second field label on the selected values to obtain a second field data analysis score for the second field label.
In some embodiments, the match testing module 164 may be configured to apply tests associated with field labels (e.g., of field label glossary 104) to selected field values. The match testing module 164 may be configured to apply a test associated with a given field label to the selected field values to determine a score associated with the field label. The match testing module 164 may be configured to determine a score for multiple field labels (e.g., of field label glossary 104). In some embodiments, a test, when executed by the match testing module 164, may indicate a number of selected field values that match an expected pattern for a field label. The match testing module 164 may use the number to determine a score for the field label (e.g., by determining a ratio of matching values to total selected values, or a ratio of matching values to values that do not match the pattern). The field data analysis module 102B may be configured to identify a candidate set of labels based on scores determined for the multiple field labels. For example, the field data analysis module 102B may identify field labels with scores that meet a threshold score as the candidate set of labels. As another example, the field data analysis module 102B may identify a set of the highest scoring field labels as the candidate set of labels.
As shown in
In some embodiments, the abbreviation recognition module 152A may be configured to divide the field name 200 into its abbreviations 200A, 200B, 200C. In some embodiments, the abbreviation recognition module 152A may be configured to segment the field name 200 by determining a set of locations at which to segment the field name 200. The abbreviation recognition module 152A may be configured to determine the set of locations by: (1) identifying different segmentations of the field name; (2) determining a score for each segmentation; and (3) selecting one of the segmentations based on scores determined for the segmentations (e.g., by selecting the highest scoring set of segmentations). The abbreviation recognition module 152A may be configured to score each segmentation by: (1) determining a score for each field name portion obtained from the segmentation; and (2) determining the score for the segmentation using scores determined for the field name portions. As an illustrative example, the field name segmentation module 152 may segment the field name “numtelcel” at a first set of candidate locations into “num”, “tel”, and “cel”. The abbreviation recognition module 152A may determine a score for each of “num”, “tel”, “cel” based on whether it is a valid abbreviation for a word (e.g., by determining whether the abbreviation exists in a set of abbreviations and corresponding words).
In some embodiments, the abbreviation recognition module 152A may be configured to determine a score for a segmentation of a field name using a glossary of words. The glossary of words may include words, lemmatizations of words, stemmed words, and/or words with vowels removed. A lemmatization of a word may refer to a root or lemma of the word. For example, a lemmatization of the word “running” would be “run”. The abbreviation recognition module 152A may determine a score for a segmentation of the field name using the glossary of words by: (1) identifying, for each field name portion of the segmentation, one or more words in the glossary that match the field name portion; and (2) determining a measure of similarity between each field name portion and its matched word(s) to obtain a set of similarity score(s) for the field name portion. The abbreviation recognition module 152A may be configured to determine the score for the segmentation using the sets of similarity score(s) obtained for the field name portions of the segmentation. For example, the abbreviation recognition module 152A may identify the maximum similarity score determined for each field name portion in the segmentation. When there are multiple field name portions, the abbreviation recognition module 152A may determine an average of the similarity scores determined for the field name portions as the score for the segmentation.
As an illustrative example, the abbreviation recognition module 152A may determine a segmentation of the field name “disped” into field name portions “disp” and “cd”. For the field name portion “disp”, the abbreviation recognition module 152A may determine that “disp” matches the word “disposition”, the lemmatization of disposition “dispose”, the stemmed word “dispos”, and the vowelless word “dspstn” from a glossary of words. The abbreviation recognition module 152A may determine a measure of similarity between “disp” and each of the words “disposition”, “dispose”, “dispos”, and “dspstn” to obtain a first set of similarity scores [0.9, 0.92, 0.94]. For the field name portion “cd”, the abbreviation recognition module 152A may determine that “cd” matches the words “code” and the vowelless word “cd” in the glossary. The abbreviation recognition module 152A may determine a measure of similarity between “cd” and each of words “code” and “cd” to obtain a second set of similarity scores [0.88, 0.9]. The abbreviation recognition module 152A may determine a score for the segmentation of [“disp”, “cd”] by: (1) identifying the maximum similarity score in the first set of similarity scores determined for the field name portion “disp” (i.e., 0.94) and the maximum similarity score in the second set of similarity scores determined for the field name portion “cd” (i.e., 0.9); and (2) averaging the maximum similarity scores (i.e., 0.94 and 0.9) to obtain the score of 0.92 for the segmentation.
After identification of the abbreviations 200A, 200B, 200C in the field name 200, the abbreviation resolution module 152B determines candidate word sets for the abbreviations 200A, 200B, 200C. In some embodiments, the abbreviation resolution module 152B may be configured to determine, for each field name portion, a candidate set of words that may be indicated by the field name portion. For example, the abbreviation resolution module 152B may determine, for each of a set of abbreviations into which a field name is segmented, a candidate set of words that may be indicated by the abbreviation. In some embodiments, the abbreviation resolution module 152B may be configured to identify a candidate set of words that may be represented by a field name portion (e.g., an abbreviation) from a word collection (e.g., a glossary generated by the NLP module 156). In some embodiments, the field name segmentation module 152 may be configured to identify a candidate set of words for a field name portion by: (1) determining a measure of similarity between the field name portion and each of the words in the word collection; and (2) identifying a subset of the words to be the candidate set of words. The field name segmentation module 152 may be configured to identify the subset of words that meet a threshold similarity score to be the candidate set of words for the field name portion. Example measures of similarity that may be used are described herein. Accordingly, the abbreviation resolution module 152B may be configured to generate, for each field name portion, a set of candidate words and corresponding similarity scores. The candidate sets of words 202 for the field name portions and the similarity scores may be used by the field label identification module 154 to identify and score candidate field labels for the dataset field.
As shown in
In some embodiments, the candidate sequence generation module 154A may be configured to generate one or more word sequences that indicate a semantic meaning for the dataset field based on its field name 200. The candidate sequence generation module 154A may be configured to use the candidate sets of words and similarity scores to generate the word sequence(s). The candidate sequence generation module 154A may be configured to generate the word sequence(s) using the n-gram model 158A. For example, the candidate sequence generation module 154A may use the n-gram model 158A to generate word sequences by combining words taken from each of the candidate sets of words and filtering the word sequences using the n-gram model 158A to obtain the word sequence(s) that are used for determining candidate field label(s). The candidate sequence generation module 154A may filter the word sequences by removing word sequences composed of words that do not co-occur in any sequence indicated by the n-gram model 158A.
In some embodiments, the sequence rectification module 154B may be configured to use word sequence(s) generated by the candidate sequence generation module 154A to: (1) identify candidate field label(s); and (2) determine score(s) for the candidate field label(s). The candidate sequence generation module 154A may be configured to identify field label(s) in the field label glossary 104 using the word sequence(s). The candidate sequence generation module 154A may be configured to score the identified label(s) (e.g., using sequence position model 158B). The candidate sequence generation module 154A may be configured to determine the candidate field labels and scores 210 for the field name 200.
As shown in
Next, the abbreviation resolution module 152B generates, for each of the abbreviations 200A, 200B, 200C, a candidate set of words that may be indicated by the abbreviation. The abbreviation resolution module 152B generates a candidate word set 202A for the abbreviation 200A, a candidate word set 202B for the abbreviation 200B, and a candidate word set 202C for the abbreviation 200C. In some embodiments, the abbreviation resolution module 152B may be configured to identify the candidate word set for each of the abbreviations 200A, 200B, 200C by: (1) determining a measure of similarity between the abbreviation and each of a glossary of words (e.g., generated by the NLP module 156); and (2) identify the candidate set of words based on similarity scores obtained for the glossary of words. For example, the abbreviation resolution module 152B may identify words from the glossary that meet a threshold similarity score as the candidate set of words. In some embodiments, the similarity scores may be values between 0 and 1. The threshold similarity score may be a value between 0.5 and 0.6, 0.6 and 0.7, 0.7 and 0.8, 0.8 and 0.9, or 0.9 and 1. For example, the threshold similarity score may be 0.8.
Next, the candidate sequence generation module 154A uses the candidate word sets 202A, 202B, 202C to generate candidate sequences 204 for the field name 200. As shown in
In some embodiments, the word co-locator 154A-1 may be configured to determine a score for each word collection. For example, the score associated with a word collection may indicate a probability of the words in the word collection co-occurring. The word co-locator 154A-1 may be configured to determine a score for each word collection using similarity scores associated with words in the word collection. For example, the word co-locator 154A-1 may be configured to determine a score for a word collection by determining a mean similarity score of words in the word collection if they co-occur in any sequence indicated by the n-gram model 158A. If words in a word collection do not co-occur in any sequence indicated by the n-gram model 158A, then the word co-locator may determine a score of 0 for the word collection.
In some embodiments, sequence generator 154A-2 of the candidate sequence generation module 154A may be configured to generate the candidate word sequences 204. The sequence generator 154A-2 may be configured to generate candidate sequences using one or more of the word collections generated by the word co-locator 154A-1. In some embodiments, the sequence generator 154A-2 may be configured to generate candidate sequences using word collection(s) that meet a threshold score (e.g., probability value). The threshold score may be a value between 0 and 1. For example, the threshold score value may be 0.5. Thus, the sequence generator 154A-2 may generate candidate sequences using only collections of words generated from the candidate word sets 202A, 202B, 202C that co-occur in a sequence indicated by the n-gram model 158A.
In some embodiments, the sequence generator 154A-2 may be configured to modify an order of words in a generated word sequence. For example, the sequence generator 154A-2 may modify the order of words in a word sequence based on language convention. In some embodiments, the sequence generator 154A-2 may be configured to modify an order of words in a word sequence by: (1) identifying a classword in the word sequence; and (2) changing a position of the classword in the sequence. A classword may be a word that identifies a category (e.g., name, amount, code, number, and/or other category) of data. The sequence generator 154A-2 may determine a position of an identified classword (e.g., based on language convention). For example, for an English word sequence, the sequence generator 154A-2 may move a classword identified in the word sequence to the end of the word sequence. To illustrate, in the word sequence “number client phone”, the sequence generator 154A-2 may identify the word “number” as a classword and move it to the end of the word sequence to obtain the updated sequence “client phone number”.
As shown in
In some embodiments, the sequence rectification module 154B may be configured to identify, in the field label glossary 104, one or more field labels for each of the candidate sequences 204. For each candidate sequence, the sequence rectification module 154B may select one or more field labels from the field label glossary 104. In some embodiments, the sequence rectification module 154B may be configured to select a field label for a candidate sequence if the field label shares at least one word with the candidate sequence. In some embodiments, the sequence rectification module 154B may be configured to select a field label for a candidate sequence if all the words of the field label are included in the candidate sequence. As an illustrative example, one of the candidate sequences 204 may be “client phone number”. The sequence rectification module 154B may identify the field labels “phone number” and “client number” from the field label glossary 104 for the candidate sequence “client phone number”.
In some embodiments, the sequence rectification module 154B may be configured to determine a score for each identified field label. The sequence rectification module 154B may be configured to use the sequence position model 158B to determine a score for each identified field label. In some embodiments, the sequence rectification module 154B may be configured to use the sequence position model 158B to determine a score for a field label that is rectified based on the degree to which the order of words in the field label matches the order of words in one of the candidate sequences 204 from which the field label was identified. For example, for the field labels “phone number” and “client number” determined from the candidate sequence “client phone number”, the sequence position model 158B may indicate a higher score for “phone number” than “client number” because it more closely matches the order of words in “client phone number”.
In some embodiments, the sequence rectification module 154B may be configured to use similarity scores associated with words in each of the candidate sequences 204 to determine scores for field label(s) identified for the candidate sequence. For example, the similarity score for each word in the candidate sequence may indicate a relative position in the sequence position model 158B associated with the word. An example sequence position model 158B and use thereof to determine a candidate field label score is described herein with reference to
As shown in
In the example of
As shown in
In the example of
After obtaining the candidate word sets and corresponding scores 406A, 406B, 406C for the abbreviations 404A, 404B, 404C, the candidate word sets and corresponding scores 406A, 406B, 406C are used to generate word collections 410.
In some embodiments, the sequence generator 154A-2 may be configured to re-order words in a generated sequence. The sequence generator 154A-2 may be configured to re-order a candidate sequence by identifying a classword within the candidate sequence. A classword may indicate a category of data. For example, the word “number” in “client number phone” may indicate that the data is numerical. As another example, the word “name” in “customer name” may indicate that the data is a name. The sequence generator 154A-2 may be configured to change the position of a classword in the sequence.
As shown in
In the example of
Accordingly, the sequence rectification module 154B may be configured to determine a score for each field label using the sequence position model 158B that is based on words shared between a candidate sequence and the field label, and on how closely the order of words in the candidate sequence match the order of words in the field label. In the above example of “Phone Number” and “Client Number”, the field label “Phone Number” more closely matches the order of terms in “Client Phone Number” than does the field label “Client Number”. Accordingly, as illustrated in
As shown in
As shown in
In some embodiments, the test definition module 164A may be configured to define tests for field labels in field label glossary 104. In some embodiments, a test may be defined as follows and the test definition module 164A may be used to create the test based on user input. The user input may indicate one or more measures of how well a set of field values match a field label associated with a test and information to use in computing each of the measure(s). In some embodiments, the test definition module 164A may be configured to store defined tests (e.g., in datastore 164C). In some embodiments, the test definition module 164A may be configured to translate a defined test into an executable software application (e.g., that can be executed by the test execution module 164B to apply the test to the selected values 602).
In some embodiments, test definition module 164A, the test definition module 164A may be configured to define a test for a field label by specifying a regular expression indicating an expected pattern for values stored in a dataset field assigned the field label. For example, the test definition module 164A may define a regular expression based on user input indicating the user expression (e.g., received through a test definition GUI). As another example, the test definition module 164A may automatically define the regular expression. The test definition module 164A may automatically define the regular expression by analyzing a set of target data values to generate a regular expression that matches all the target data values.
In some embodiments, the test definition module 164A may be configured to define a test for a field label by specifying a set of reference values that can be stored in a dataset field assigned the field label. For example, the test definition module 164A may specify a set of integer values from 1-12 (i.e., each representing a month of the year) as the set of reference values for the field label “month”. As another example, the test definition module 164A may specify a set of names of states in the United States of America and/or abbreviations thereof as the set of reference values for a field label “State”. In some embodiments, the test definition module 164A may be configured to specify a set of reference values as a set of values indicated by user input indicating the set of reference values.
In some embodiments, the test definition module 164A may be configured to define a test for a field label by specifying a set of values on which a distribution associated with the field label is defined. For example, the test definition module 164A may specify a set of integer values from 1-31 for a field label referring to the date of the month in a date. A distribution associated with the field label may be defined on the set of integer values from 1-31.
In some embodiments, the test definition module 164A may be configured to define a test for a field label by specifying one or more rules that must be met by a field value in order for the field value to be valid. The rule(s) may be specified as logical statements. In order for a field value to be considered correctly described by the field label, the field value may be required to meet the rule(s). For example, for a field label of “birth year” the test definition module 164A may specify a rule requiring that a field value is greater than 1910 and less than 2023. As another example, for a field label “age” the test definition module 164A may specify a rule requiring that a field value is greater than 0 and less than 200.
In some embodiments, the test definition module 164A may be configured to define a test for a field label by: (1) identifying an information type (e.g., date, month, year, social security number, credit card number, phone number, city, state, and/or other information type); (2) and defining the test based on the identified information type. For example, the test definition module 164A may define the test to include a regular expression associated with the information type. As another example, the test definition module 164A may define the test by specifying a set of valid values and/or a distribution of values associated with the information type for the test.
In some embodiments, the test execution module 164B may be configured to execute tests associated with field labels in the field label glossary 104 on the selected values 602 to obtain a corresponding score for each of the field labels. For example, the test execution module 164B may apply a test associated with a field name to the selected values 602 by: (1) accessing information (e.g., from datastore 164C) associated with the test (e.g., a regular expression, a set of reference values, a distribution of values, support of a distribution of values, and/or one or more rules); and (2) applying the test to the selected values 602 using the information to obtain a score corresponding to the field name. The test execution module 164B may apply the test to the selected values 602 using the information by determining to execute one or more component tests and determining the score using result(s) of the component test(s).
In some embodiments, test execution module 164B may be configured to apply a test associated with a field label to the selected values 602 by: (1) accessing a regular expression (e.g., a regular expression indicating a pattern of mm/dd/yyyy for a “date” field label) specified by the test; and (2) determining a score for the field label using the regular expression (e.g., by determining a number of the selected values 602 that match the regular expression). In some embodiments, the test execution module 164B may be configured to determine the score using the regular expression by: (1) determining a percentage of the selected values 602 that match the regular expression; and (2) determining the score for the field label using the percentage of the selected values 602 that match the regular expression. For example, the text execution module 164B may determine a percentage of the selected values 602 that match a regular expression indicating an expected pattern for a social security number and determine a score for the field label “social security number” using the determined percentage.
In some embodiments, the test execution module 164B may be configured to apply a test associated with a field label to the selected values 602 by: (1) accessing a set of reference values (e.g., the integer values 1-12 for a “month” field label) specified by the test; and (2) determining a score for the field label using the set of reference values and the selected values 602 (e.g., by determining a number of the selected values 602 that are included in the set of reference values). In some embodiments, the match testing module 164 may be configured to determine the score by: (1) determining whether each of the selected field values 602 is in the set of reference values associated with the field label; and (2) determining the score based on a number of the selected field values are included in the set of reference values. In some embodiments, the set of reference values associated with the field label may be an enumerated set of values that would be valid for a dataset field to which the field label is assigned. For example, the dataset field may store an indication of a state in the United States of America (USA) and the set of reference values may be a list of the 50 states in the USA. The test execution module 164B may determine a percentage of the selected values 602 that are in the list of 50 states and determine a score for the field label using the determined percentage.
In some embodiments, the test execution module 164B may be configured to apply a test associated with a field label to the selected values 602 by: (1) accessing a distribution of values specified by the test (e.g., a distribution defined on the values 1-31 that is associated with a field label “day of the month”; (2) comparing the distribution specified by the test to a distribution defined on the selected values 602. The test execution module 164B may be configured to determine the score associated with the field label based on a result of the comparison. For example, the test execution module 164B may compare the distribution specified by the test to the distribution defined on the selected values 602 using a chi-squared test, and determine the score associated with the field label using the result of the chi-squared test.
In some embodiments, the test execution module 164B may be configured to apply a test associated with a field label to the selected values 602 by: (1) accessing a set of values specified by the test; and (2) comparing a support of the set of values to a support of the selected values 602. The test execution module 164B may be configured to determine the score associated with the field label based on the comparison. For example, the test execution module 164B may determine a ratio of the support of the selected values 602 to the support of the distribution specified by the test and determine the score for the field label using the ratio. To illustrate, the support for a distribution specified by a test associated with the field label “gender” may be male, female, other, and unknown while the support for the selected values 602 may be male and other. The test execution module 164B may determine a support ratio of 0.5 for the field label and determine the score for the field label using the support ratio.
In some embodiments, the test execution module 164B may be configured to apply a test associated with a field label to the selected values 602 by: (1) accessing one or more rules specified by the test; and (2) determining a percentage of the selected values 602 that meet the rule(s) (e.g., by determining a percentage of the selected values 602 for which logical statement(s) indicating the rule(s) are true). The test execution module 164B may be configured to determine the score for the field label based on the percentage of the selected values 602 that meet the rule(s). For example, a rule specified by test associated with the field label “age” may require values to be greater than greater than 0 and less than 200. The test execution module 164B may determine a percentage of the selected values 602 that are within the range required by the rule and determine the score for the field label “age” based on the percentage of the selected values 602 that are within the range.
In some embodiments, the test execution module 164B may be configured to combine results of multiple test components (e.g., match to an expression, match to a reference set of values, match to a distribution of values) to obtain a score for a field label. For example, the test execution module 164B may be configured to determine a component score from each test component and determine the score for the field label by combining the component scores (e.g., by determining a weighted average of the component scores). A test associated with a field label may involve any one or more test components described herein and/or other test component(s). For example, the test may involve regular expression matching, comparison of the selected values 602 to a reference set of values and/or a distribution.
In some embodiments, the datastore 164C may store tests generated by the test definition module 164A. For example, data defining a pattern patching test (e.g., a regular expression) may be stored in the datastore 164C. The test information may be accessed when identifying candidate field labels and scores 604. For example, the test execution module 164B may access the tests stored in the datastore 164C to apply them to the selected values 602.
In some embodiments, the field values may be a sampled subset of values obtained from the field in the dataset. For example, the field values may be a randomly sampled subset of values from the field. As another example, the field values may be a subset of the most frequently occurring values in the field. As another example, the field values may be a subset of values that were most recently written to the field.
In some embodiments, the statistical information about a field may include statistics about data stored in the field. For example, statistics about data stored in a field may include a minimum value, a maximum value, a range of values, a medium value, a variance, a total number of values stored in the field, a number of empty (e.g., null) values, a minimum length of values in the field, a maximum length of values in the field, a most common value in the field, a least common value in the field, and/or other statistical information.
In some embodiments, the relationship information about the field may include an indication of relationships of the field with one or more other fields. For example, the relationship information may indicate a statistical correlation of the field with another field. As another example, the relationship information may indicate a dependency of the field on another field.
In some embodiments, the format information for a field may indicate that values in the field need to adhere to a particular format and/or indicate the particular format. For example, the format information may indicate a standard format for telephone numbers, addresses, social security numbers, birth dates, or other type of value to be stored in the field. As another example, the format information may indicate a number of decimal places for numbers stored in the field. As another example, the format information may indicate a minimum or maximum number of digits or characters for values stored in the field.
Examples of information stored in a dataset profile described herein are for illustrative purposes. In some embodiments, the dataset profile 610 may include other information related to the dataset and/or a field therein. Some embodiments may be configured to store, in a dataset profile, other information instead of or in addition to examples of information described herein.
As shown in
In some embodiments, the score merging module 102C may be configured to identify candidate field labels that exist in both field name analysis candidate field label scores 700 and field data analysis candidate field label scores 702. In the example of
In some embodiments, the score merging module 102C may be configured to merge the field name analysis score and the field data analysis score of a candidate field label by adjusting the field name analysis score using the field data analysis score. For example, the score merging module 102C may determine a ratio between the field name analysis score and the field data analysis score and adjust the field name analysis score based on the ratio. For example, the score merging module 102C may determine a proportion of log, of the ratio as a bias value that can be used to adjust the field name analysis score (e.g., by adding the bias value to the field name analysis score). When the field data analysis score is greater than the field name analysis score, the score merging module 102C may determine the bias value as
where m is a configurable value. For example, the value of m may be a value between 0 and 1 (e.g., 0.25). When the field name analysis score is greater than the field data analysis score, the score merging module 102C may determine the vias value an
For example, the value of m may be a value between 0 and 1 (e.g., 0.25). In the example of
In some embodiments, the score merging module 102C may be configured to penalize a field name analysis score of field label that was identified as a candidate field label by the field name analysis but not by the field data analysis. For example, the score merging module 102C may reduce the field name analysis score by a particular amount. To illustrate, the score merging module 102C may reduce the field name analysis score by 1-10%, 10-20%, 20-30% or another suitable percentage. In the example of
Prior to performing process 800, the system accesses a field name of one of the dataset fields to be assigned a field label. In some embodiments, the system may be configured to read the field name from a dataset profile corresponding to the dataset. For example, the system may read the field name from a dataset profile generated for the dataset by the pre-processing module 101 described herein with reference to
Process 800 begins at block 802, where the system determines, using the field name of the dataset field and NLP, a first set of candidate field labels for the dataset field and corresponding scores. The system may be configured to perform a field name analysis to identify the first set of candidate field labels and the corresponding scores. The scores corresponding to the first set of candidate field labels may thus be field name analysis scores. For example, the system may perform the field name analysis using the field name analysis module 102A as described herein with reference to
Next, process 800 proceeds to block 804, where the system determines, using a subset of data from the dataset field (e.g., a subset of field values) and tests associated with field labels of the field label glossary, a second set of candidate field labels and corresponding scores. The system may be configured to perform a field data analysis to identify the second set of candidate field labels and corresponding scores. The scores corresponding to the second set of candidate field labels may thus be field data analysis scores. For example, the system may perform the field data analysis using the field data analysis module 102B as described herein with reference to
In some embodiments, the system may be configured to determine, using the subset of data from the particular field and the tests associated with respective field labels in the field label glossary, the second set of candidate field labels and corresponding scores by applying the tests associated with the respective field labels to the subset of data to obtain test results. For example, the subset of data may be a set of the most commonly occurring field values. The system may be configured to determine the second set of candidate field labels and the corresponding scores using the test results. The system may be configured to access tests associated with respective field labels in the field label glossary. Examples of how the system may access and apply tests to the subset of data (e.g., a set of selected field values) are described herein with reference to
After determining the first and second sets of candidate field labels and their corresponding sets of scores (e.g., field name analysis scores and field data analysis scores), process 800 proceeds to block 806, where the system determines merged candidate field labels and scores using the first and second candidate field labels and scores.
In some cases, the first and second sets of candidate field labels may have one or more candidate field labels in common. For each of such candidate field labels, the system may be configured to merge the two scores associated with the candidate field label. In other words, the system may be configured to merge a field name analysis score associated with the candidate field label with a field data analysis score associated with the candidate field label. In some embodiments, the system may be configured to merge the two scores by adjusting the field name analysis score using the field data analysis score. For example, the system may: (1) determine a ratio between the field name analysis score and the field data analysis score; and (2) adjust the field name analysis score using the ratio. In some embodiments, the system may determine the ratio as the field data analysis score divided by the field name analysis score when the field data analysis score is greater, and the ratio as the field name analysis score divided by the field data analysis score when the field name analysis score is greater.
In some embodiments, the system may determine a bias value using the ratio between the field name analysis score and the field data analysis score. The system may adjust the field name analysis score using the bias value. For example, the system may determine the bias value as a log of the ratio between the field name analysis score and the field data analysis score. The system may adjust the field name analysis score using the bias value. For example, the system may add a proportion of the bias value to the field name analysis score. The proportion of the bias value may be a value in the range 0-0.1, 0.1-0.2, 0.3-0.4, or 0.4-0.5. For example, the proportion of the bias value added to the field name analysis score may be 0.25. The system may use the ratio to determine a bias (e.g., a logarithm of the ratio) and increase the field name analysis score by the bias. The bias may thus be a reward term added to the field name analysis score when it has a matching candidate field label identified from the field data analysis. Equations (i)-(ii) are an example merging of a field name analysis score (Sn) and a field data analysis score (Sa).
In some cases, the first set of candidate field labels may include a candidate field label that is not included in the second set of candidate field labels. In such cases, the system may be configured to determine a merged score for the candidate field label by reducing the field name analysis score associated with the candidate field label (e.g., as a penalty for not matching a candidate label of the second set of candidate field labels). For example, the system may reduce the field name analysis score by a pre-determined percentage. The pre-determined percentage may be a percentage between 0-5%, 5-10%, 10-15%, 15-20%, 20-25%, 25-30%, 30-35%, 35-40%, or other suitable percentage. For example, the pre-determined percentage may be a 10% reduction of the field name analysis score.
In some cases, the second set of candidate field labels may include a candidate field label that is not included in the first set of candidate field labels. In such cases, the system may be configured to determine a merged score for the candidate field label by reducing the field data analysis score associated with the candidate field label (e.g., as a penalty for not matching a candidate label of the first set of candidate field labels). For example, the system may reduce the field data analysis score by a pre-determined percentage. The pre-determined percentage may be a percentage between 0-5%, 5-10%, 10-15%, 15-20%, 20-25%, 25-30%, 30-35%, 35-40%, or other suitable percentage. For example, the pre-determined percentage may be a 10% reduction of the field data analysis score.
After determining the merged candidate field labels and scores at block 806, process 800 proceeds to block 808, where the system assigned one of the merged candidate field labels to the dataset field using the merged scores.
In some embodiments, the system may be configured to automatically select one of the candidate field labels using the merged scores. For example, the system may select the candidate label from the merged candidate field label with the highest score. In some embodiments, the system may be configured to obtain user input selecting one of the merged candidate field labels as the assigned field label. For example, the system may present one or more of the merged candidate field labels in a graphical user interface (GUI) and receive user input indicating a selection of one of the merged candidate field labels. In some embodiments, the system may present an indication of the merged score of each of the merged candidate field labels. For example, the system may rank the candidate labels in the GUI based on their merged scores.
In some embodiments, the system may be configured to assign one of the merged candidate field labels to the dataset field using the merged score by determining whether any of the merged candidate field labels meet a first threshold merged score. The first threshold merged score may be a value in the range 0.8-0.85, 0.85-0.9, 0.9-0.95, or 0.95-1.0. For example, the first threshold merged score may be 0.95. When one of the merged candidate field labels meets the first threshold merged score, the system may be configured to automatically assign the candidate field label to the dataset field. When multiple of the merged candidate field labels meets the first threshold merged score, the system may obtain user input indicating a selection of one of the candidate field labels as the assigned field label (e.g., through a GUI). When none of the merged candidate field labels meets the first threshold score, the system may request user input selecting one of the merged candidate field labels as the assigned field label.
In some embodiments, the system may be configured to request user input for a given candidate field label when it has a merged score that meets a second threshold merged score that is lower than the first threshold merged score but fails to meet the first threshold score. For example, the system may determine that this indicates that the candidate label is a near match and requests user input confirming the match.
In some embodiments, the system may be configured to store the assigned field label in association with the dataset field in a datastore (e.g., datastore 108) of the system. For example, the system may map a data entity definition associated with the assigned field label to the dataset field (e.g., for storage of metadata about the dataset field). As another example, the system may store a metadata attribute value indicating the assigned field label of the dataset field.
After assigning one of the merged candidate field labels to the dataset field, process 800 proceeds to block 810, where the system determines whether labeling is complete for the dataset field(s) to be labeled. For example, the system may determine whether there are additional dataset field(s) to be labeled. If the system determines that there are additional dataset field(s) to be labeled, then the system may determine that labeling is not complete. In this case, the system may proceed to block 812, where the system selects another dataset field. The system may be configured to access a field name of the dataset field and a subset of data from the dataset field (e.g., a subset of field values). The system then proceeds to block 802 and repeats the steps at blocks 802-808 for another dataset field. If the system determines that there are no additional dataset(s) to assign a field label, then the system may determine that labeling is complete. Thus, process 800 may end.
Prior to performing process 900, the system accesses a field name of the dataset field to be assigned a field label. In some embodiments, the system may be configured to read the field name from a dataset profile corresponding to the dataset (e.g., a dataset profile generated by the pre-processing module 101 described herein with reference to
Process 900 begins at block 902, where the system identifies a set of abbreviations in the field name of the dataset field. For example, the system may be configured to identify the set of abbreviations using the abbreviation recognition module 152A as described herein with reference to
Next, process 900 proceeds to block 904, where the system identifies, for each abbreviation, a set of candidate words that may be indicated by the abbreviation. In some embodiments, the system may be configured to identify a set of candidate words for an abbreviation by determining a measure of similarity between the abbreviation and a pre-determined set of words (e.g., a glossary of words generated by the NLP module 156) to obtain similarity scores for the set of words. Example measures of similarity are described herein. For example, the example measure of similarity may be a cosine similarity, Jaro-Winkler similarity, Euclidean distance, and/or a combination of one or more similarity measures. In some embodiments, the system may be configured to select a subset of the pre-determined set of words as the candidate set of words. For example, the system may determine the candidate set of words for the abbreviation as those with a similarity score that meets a threshold similarity score. The similarity score may, for example, be a value between 0 and 1 and the threshold similarity score may be a value in one of the following ranges: 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, 0.9-1. For example, the threshold similarity score may be 0.8.
Next, process 900 proceeds to block 906, where the system generates, using the sets of candidate words identified for the identified abbreviations and an n-gram model (e.g., n-gram model 158A), one or more word sequences. For example, the system may generate the word sequence(s) using the candidate sequence generation module 154A as described herein with reference to
In some embodiments, the system may be configured to identify a word in a word sequence as a classword. The classword may indicate a category of data (e.g., number, name, or other category of data). The system may determine a target position of the classword in the word sequence. For example, the system may determine a target position of the classword using a classword model (e.g., classword model 418) as described herein with reference to
Next, process 900 proceeds to block 908, where the system uses the word sequence(s) and field label glossary to identify the candidate field labels and scores. In some embodiments, the system may be configured to access a sequence position model (e.g., sequence position model 158B) that is based on the order of words in each word sequence. The system may be configured to use the sequence position model to determine scores for field labels in the field label glossary (e.g., as described herein with reference to
Next, process 900 proceeds to block 910, where the system assigns one of the candidate field labels to the dataset field using the scores. In some embodiments, the system may be configured to assign a field label in conjunction with candidate field labels and scores obtained from performing field data analysis as described at blocks 806-808 of process 800 described herein with reference to
In some embodiments, the system may be configured to assign a field label using the candidate field labels obtained at block 908 without using candidate field labels and scores obtained from performing field data analysis. In some embodiments, the system may be configured to automatically select one of the candidate field labels using the scores. For example, the system may select the candidate label from the candidate field labels with the highest score. In some embodiments, the system may be configured to obtain user input selecting one of the candidate field labels as the assigned field label. For example, the system may present one or more of the candidate field labels in a graphical user interface (GUI) and receive user input indicating a selection of one of the candidate field labels. In some embodiments, the system may present an indication of the score of each of the candidate field labels. For example, the system may rank the candidate labels in the GUI based on their scores.
In some embodiments, the system may be configured to assign one of the candidate field labels to the dataset field using the score by determining whether any of the candidate field labels meet a first threshold score. The first threshold score may be a value in the range 0.8-0.85, 0.85-0.9, 0.9-0.95, or 0.95-1.0. For example, the first threshold score may be 0.95. When one of the candidate field labels meets the first threshold score, the system may be configured to automatically assign the candidate field label to the dataset field. When multiple of the candidate field labels meet the first threshold score, the system may obtain user input indicating a selection of one of the candidate field labels as the assigned field label (e.g., through a GUI). When none of the candidate field labels meets the first threshold score, the system may request user input selecting one of the candidate field labels as the assigned field label.
In some embodiments, the system may be configured to request user input for a given candidate field label when it has a score that meets a second threshold score that is lower than the first threshold score but fails to meet the first threshold score. For example, the system may determine that this indicates that the candidate label is a near match and requests user input confirming the match.
In some embodiments, the system may be configured to store the assigned field label in association with the dataset field in a datastore (e.g., datastore 108) of the system. For example, the system may map a data entity definition associated with the assigned field label to the dataset field (e.g., for storage of metadata about the dataset field). As another example, the system may store a metadata attribute value indicating the assigned field label of the dataset field.
Prior to performing process 1000, the system accesses a field name of the dataset field to be assigned a field label. The system may be configured to read the field name from the dataset. For example, the system may query a datastore for the field name of the dataset field to be assigned a field label for the name of the dataset field to be assigned the field label. In some embodiments, field name(s) of the dataset field(s) to be assigned a field label may be pre-loaded into a datastore of the system. The system may read a field name of the dataset from the datastore of the system.
Process 1000 begins at block 1002, where the system identifies a set of abbreviations in the field name of the dataset field. The system may identify the set of abbreviations as described at block 902 of process 900 described herein with reference to
Next, process 1000 proceeds to block 1004, where the system determines, for each abbreviation, a set of candidate words indicated by the abbreviation and corresponding similarity scores. The block 1004 includes two sub-blocks 1004A, 1004B.
At sub-block 1004A, the system determines a measure of similarity between the abbreviation and words in a glossary (e.g., generated by the NLP module 156) to obtain similarity scores for the words. In some embodiments, the measure of similarity between the abbreviation and a given word is based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word. In some embodiments, the measure of similarity is based on a prefix (e.g., the first 1 to 4 letters) of the abbreviation and a prefix of the word. In some embodiments, the measure of similarity is based on a suffix (e.g., the last 1 to 4 letters) of the abbreviation and a suffix of the word. In some embodiments, the measure of similarity may be based on multiple component measures of similarity between the abbreviation and the word. For example, the measure of similarity may be based on a cosine similarity, Jaro-Winkler similarity between the abbreviation and the word, and/or Jaro-Winkler similarity modified to be based on suffix instead of prefix. An example measure of similarity that may be used at sub-block 1004A is described herein.
Next, at sub-block 1004B, the system selects, using the similarity scores, a set of candidate words from the glossary. In some embodiments, the system may be configured to select a subset of the pre-determined set of words as the candidate set of words. For example, the system may determine the candidate set of words for the abbreviation as those with a similarity score that meets a threshold similarity score. The similarity score may, for example, be a value between 0 and 1 and the threshold similarity score may be a value in one of the following ranges: 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, 0.9-1. For example, the threshold similarity score may be 0.8.
After determining sets of candidate words and corresponding sets of similarity scores at block 1004, process 1000 proceeds to block 1006. At block 1006 the system determines, using the sets of candidate words and corresponding sets of similarity scores, candidate field labels and scores for the dataset field. The system may be configured to determine the candidate field labels and scores for the dataset field as described at blocks 906-908 of process 900 described herein with reference to
Next, process 1000 proceeds to block 1008, where the system assigns one of the candidate field labels to the dataset field using the scores. The system may be configured to assign one of the candidate field labels to the dataset field using the scores as described at block 910 of process 900 described herein with reference to
Described herein are techniques for generating a new field label that can be assigned to one or more dataset fields. In some embodiments, the techniques involve generating a field label for a particular dataset field by: (1) generating, using a name of the particular dataset field, a word sequence describing data stored in the particular dataset field; and (2) generating the field label for the particular dataset field using the word sequence.
In some cases, none of the field labels in a field label glossary may match a particular dataset field (e.g., because none of the field labels in the field label glossary accurately describe data stored in the particular dataset field). Thus, none of the field labels from the field label glossary can be assigned to the particular dataset field. For example, a data processing system may determine candidate field labels from the field label glossary using techniques described herein with reference to
Accordingly, the inventors have developed techniques for generating a new field label for assignment to one or more dataset fields. In some embodiments, the techniques may be used to generate a field label for a dataset field when none of the field labels in a field label glossary can be assigned to the dataset field. For example, the system may determine that none of the field labels in the field label glossary have a sufficiently high associated score to be assigned to a particular dataset field. The system may determine to generate a new field label for assignment to the dataset field. The new field label may be assigned to one or more other dataset fields. For example, the new field label may be added to the field label glossary for identification as a candidate field label for other dataset field(s).
The techniques generate a new field label for a dataset field using the name of the dataset field (“field name”). The techniques generate a word sequence describing data stored in the field and generate the field label using the word sequence (e.g., by including some or all of the word sequence in the field label). In some embodiments, the techniques may be employed in cases when no field labels from a field label glossary are assigned to a given dataset field (e.g., because none of them indicate metadata about the field with sufficient accuracy). For example, the system may attempt to identify a candidate field label for a field from a field label glossary (e.g., by performing process 800 described with reference to
As shown in
The data recognition module 102 may be configured to process dataset fields to identify candidate field labels for dataset fields from a field label glossary (e.g., field label glossary 104 described herein with reference to
In some embodiments, the data processing system 1100 may be configured to use the field label generation module 1102 to generate one or more field labels for a dataset field (e.g., when no field labels are identified for the field from the field label glossary by data recognition module 102). The field label generation module 1102 may be configured to provide generated field label(s) to the field label assignment module 106 for assignment of one of the generated field label(s) to the dataset field (e.g., by selecting from multiple generated candidate field labels).
In some embodiments, the field label generation module 1102 may be configured to generate candidate field label(s) for a given dataset field. The field label generation module 1102 may be configured to generate the candidate field label(s) using the field name of the dataset field. The field label generation module 1102 may be configured to generate the candidate field label(s) using the field name of the dataset field by: (1) segmenting the field name into multiple segments (e.g., abbreviations); (2) identifying a candidate set of words indicated by each of the segments and corresponding words to obtain multiple candidate word sets and corresponding sets of scores; and (3) generating the candidate field label(s) using the candidate word sets and corresponding sets of scores. For example, for the field name “CTENUMTEL”, the field label generation module 1102 may: (1) segment the field name into the segments “CTE”, “NUM”, and “TEL”; (2) identify a candidate set of words indicated by each of the segments “CTE”, “NUM”, and “TEL” and a corresponding set of scores to obtain multiple candidate word sets and corresponding sets of scores; and (3) generate the candidate field label(s) for the dataset field using the multiple candidate word sets.
In some embodiments, the field label generation module 1102 may be configured to generate candidate field label(s) for a dataset field using NLP. The data recognition module 102 may be configured to use a set of text to generate a language model, and generate the candidate field label(s) using the language model. For example, the field label generation module 1102 may access a corpus of text with vocabulary related to a particular industry and generate a language model using the corpus of text. To illustrate, the field label generation module 1102 may access a corpus of text produced by the Federal Reserve to generate a language model for use in generating field labels for a banking institution. As another example, the field label generation module 1102 may access a corpus of text from a medical encyclopedia to generate a language model for use in generating field labels for a medical institution. In some embodiments, the language model may encode word sequences identified in the set of text and relative positions of words in the word sequences. The field label generation module 1102 may be configured to use the language model to generate a word sequence that describes data of a dataset field, and generate a candidate field label from the identified word sequence.
In some embodiments, the field label generation module 1102 may be configured to use a colocation scoring model to generate candidate field label(s) for a dataset field. The field label generation module 1102 may be configured to use the colocation scoring model to determine score(s) for word collections generated using candidate word sets associated with respective segments of a field name. The field label generation module 1102 may be configured to identify a word collection with which to generate a candidate field label based on the scores determined using a colocation scoring model. In some embodiments, the field label generation module 1102 may be configured to determine a score for each generated candidate field label using the colocation scoring model.
As illustrated in
In some embodiments, the field label generation module 1102 may be configured to use values obtained from the dataset field to generate candidate field label(s) for the field. For example, the field label generation module 102 may be configured to identify a portion of a candidate field label using field values. In some embodiments, the field label generation module 1102 may be configured to identify attribute values of a candidate field label using field values. For example, the field label generation module 1102 may apply various attribute tests to the field values to determine metadata about the field. The field label generation module 1102 may store the metadata as attribute values (e.g., in a data entity definition and/or instances thereof associated with the candidate field label).
As shown in the example of
As shown in the example of
In the example of
Although in the example of
As shown in the example of
In some embodiments, the label generator 1102A may be configured to obtain candidate word sets (associated with segments (e.g., abbreviations) of a field name) from the field name segmentation module 152 and use the candidate word sets to generate candidate field label(s) for a dataset field. The label attribute identification module 1102B may be configured to determine attributes of the candidate field label(s) generated by the label generator 1102A. As shown in
In some embodiments, the field label assignment module 106 may be configured to assign one of the candidate field label(s) generated for dataset field F114 to the dataset field F114. For example, the field label assignment module 106 may associate a data entity definition with one of the candidate field label(s). In some embodiments, the field label assignment module 106 may be configured to assign one of the candidate field label(s) to the dataset field F114 based on user input. For example, the field label assignment module 106 may present the candidate field label(s) to the user in a graphical user interface, receive input through the GUI indicating a selection of one of the candidate field label(s), and assign the selected candidate field label to the dataset field F114. In some embodiments, the field label assignment module 106 may be configured to automatically select a field label from the candidate field label(s). For example, the candidate field label(s) may have associated score(s) determined by the field label generation module 1102 and the field label assignment module 106 may assign one of the candidate field label(s) to the dataset field F114 using the associated score(s) (e.g., by selecting the candidate field label associated with the highest score). In the example of
As shown in
As shown in
In some embodiments, the language model 1159 may encode word collections of collocated words in a set of text.
In the example of
In some embodiments, the word colocation module 1204A may be configured to identify the word collection C by: (1) generating multiple candidate word collections using the candidate word sets and corresponding scores 1302; and (2) selecting the word collection C from the candidate word collections. In some embodiments, the word colocation module 1204A may be configured to generate the candidate word collections by combining words from the candidate word sets. For example, the word colocation module 1204A may generate each word collection by obtaining a word from candidate word set 1302A, a word from candidate word set 1302B, and a word from candidate word set 1302C.
In some embodiments, the word colocation module 1204A may be configured to select the word collection C from among the candidate word collections by: (1) determining a score for each of the candidate word collections using scores associated with the words in the collection; and (2) selecting the word collection C based on scores corresponding to the candidate word collections. For example, the word colocation module 1204A may be configured to identify the word collection having the greatest associated score as the word collection C from which to generate the field label 1206.
As shown in
In some embodiments, the field label assignment module 106 may be configured to automatically assign one of the candidate field labels associated with a dataset field based on scores associated with the candidate field labels. For example, when the field label assignment module 106 receives a single candidate field label for a dataset field with a corresponding score that meets a first threshold, the field label assignment module 106 may automatically assign the candidate field label to the dataset field. In some embodiments, the first threshold score may be any suitable value in the range of 0.8 to 1. For example, the first threshold score may be 0.9. In some embodiments, the first threshold score may be a configurable parameter. For example, the first threshold may be configured by the field label assignment module 106 based on user input received through a GUI.
In some embodiments, the field label assignment module 106 may be configured to request user input to assign one of the candidate field labels associated with a dataset field based on scores associated with the candidate field labels. For example, when the field label assignment module 106 receives a merged candidate field label for a dataset field with a corresponding score that meets a second threshold lower than the first threshold, the field label assignment module 106 may request user input confirming the field label. In some embodiments, the second threshold score may be any suitable value in the range 0.6-0.9. For example, the field label assignment module 106 may request user input to assign one of the candidate field labels to the dataset field when the greatest one of the scores associated with the candidate field labels meets a second threshold score of 0.75 but is less than a first threshold score of 0.9. In some embodiments, the second threshold score may be a configurable parameter. For example, the second threshold score may be configured by the field label assignment module 106 based on user input received through a GUI.
In some embodiments, the field label assignment module 106 may be configured to request user input when multiple candidate field labels have associated scores that meet a third threshold score. In some embodiments, the third threshold score may be any suitable value in the range 0.6-1.0. For example, the third threshold score may be 0.75. In this example, when multiple candidate field labels have scores of at least 0.75, the field label assignment module 106 may request user input selecting one of the multiple candidate field labels to assign to the particular field. In some embodiments, the third threshold score may be a configurable parameter. For example, the third threshold score may be configured by the field label assignment module 106 based on user input received through a GUI.
As illustrated in
In some embodiments, the field label assignment module 106 may be configured to determine feedback to send to the label generator 1102A using field assignments made by the field label assignment module 106. In some embodiments, the field label assignment module 106 may be configured to receive user input indicating selecting a field label from among one or more field labels generated by the label generator 1102A to assign to a dataset field. The field label assignment module 106 may be configured to transmit an indication of the selected field label to the label generator 1102A. The label generator 1102A may be configured to use the indication of the selected field label to update scores associated with candidate words, a colocation scoring model 1204A-1, and/or the language model 1159. Example techniques for performing updates to the label generator 1102A are described herein. In some embodiments, the field label assignment module 106 may be configured to automatically assign one of one or more field labels generated by the label generator 1102A to a dataset field (e.g., using the score(s) associated with the field label(s)). The field label assignment module 106 may be configured to transmit an indication of the field label assigned to the label generator 1102A for learning (e.g., by updating parameters of the language model 1159, the colocation scoring model 1204A-1, and/or scores associated with candidate words).
In some embodiments, the label generator 1102A may be configured to learn using feedback provided by the field label assignment module 106. The label generator 1102A may be configured to update parameters used to score word collections by the label generator 1102A (e.g., to select a word collection from which to generate a field label). In some embodiments, the label generator 1102A may be configured to use feedback provided by the field label assignment module 106 by updating similarity scores associated with words in the candidate word sets and scores 1202. For example, the similarity scores may be used to determine an output of the word colocation scoring model 1204A-1 and the label generator 1102A may update the similarity scores such that the updated similarity scores are used subsequently by the model 1204A-1. As another example, the label generator 1102A may update weights of the word colocation scoring model 1204A-1 used to generate an output score for a word collection. As another example, the label generator 1102A may update a language model used for generating a field label based on the feedback.
In some embodiments, the label generator 1102A may be configured to update parameters using the feedback received from the field label assignment module 106 using any suitable method. For example, in some embodiments, the label generator 1102A may use stochastic gradient descent to update the parameters. The label generator 1102A may be configured to update the parameters by: (1) determining a field label assigned to a field; and (2) updating the parameters based on the assigned field label. For example, the assigned field label may be a modified version of the candidate field label 1206 (e.g., specified by a user). The label generator 1102A may update parameters based on the modified version of the candidate field label 1206. As another example, the assigned field label may be a field label selected from multiple candidate field labels. The label generator 1102A may update parameters based on a selection of one of the candidate field labels. Example updates that may be made by the label generator 1102A are described herein with reference to
As illustrated in
As illustrated in
Equation (iii) below is an example equation that may be used to determine an output of the word colocation scoring model 1204A-1 for a given word collection.
In Equation (iii) above, N is the number of words in the collection, Si is a similarity score associated with the ith word in the collection. The similarity score associated with a given word may be a similarity score between the word and its respective abbreviation determined by the field name segmentation module 152. Li+1 is the loss indicated by the language model 1159 for the subsequent word. C is a value indicating whether the words in the collection co-occur in any word sequence indicated by the language model 1159. C may be equal to 1 if the words in the collection co-occur in any word sequence and 0 if the words in the collection do not co-occur in any word sequence. Wi may be a weight parameter. An example determination of an output of the colocation scoring model 1204A-1 using Equation (iii) is described herein with reference to
In some embodiments, the word colocation module 1204A may be configured to identify a loss value for a word subsequent to a given word in the collection by: (1) identifying the given word among the target words indicated by the language model 1159; and (2) identifying the loss value for the subsequent word among the loss values associated with the target word. For example, for target word 1 in a word collection, the word colocation module 1204A may identify a loss associated with word 11 subsequent to target word 1 in the word collection. In some embodiments, the word colocation module 1204A may be configured to search among synonyms of target words when a given word is not among the target words indicated by the language model 1159. The word colocation module 1204A may identify a loss for a subsequent word in the collection in losses associated with a target word of which the given word is a synonym. For example, the word colocation module 1204A may determine that synonym 11 in a word collection is associated with target word 1 and determine a loss for word 11 subsequent to synonym 11 in the word collection as loss 11 indicated by the language model 1159.
In some embodiments, the word colocation module 1204A may be configured to use Equation (iii) above to determine the output of the colocation scoring model 1204A-1 for each word collection generated using the candidate word sets as the score associated with the word collection. The scores may be used to select one of the word sequence(s) generated from the word collection(s) to use in generating a field label (e.g., by selecting the word sequence with the greatest associated score).
Although in the example of
In some embodiments, the word positioning module 1204B may be configured to arrange words in word collection(s) using the language model 1159. The word positioning module 1204B may be configured to identify a set of collocated words indicated by the language model 1159 that includes a given word collection and determine positions of the words in the collection based on the relative positions of the words indicated by the language model 1159. For example, for the word collection C, the word positioning module 1204B may: (1) determine that the words in the word collection C are words 21, 22, 23 included in the set of collocated words associated with target word 2 indicated by the language model 1159; (2) determine relative positions of the words indicated by the language model 1159; and (3) arrange the words in the word collection C to obtain the word sequence C′.
In some cases, the word positioning module 1204B may not identify any word in the word collection C among target words indicated by the language model 1159. In such cases, the word positioning module 1204B may be configured to search for words of the collection among synonyms of the target words indicated by the language model 1159. When a word is identified as a synonym of a particular target word, the word positioning module 1204B may be configured to use the relative position information mapped to the particular target word to arrange the words of the word collection C (e.g., by arranging the words in the word collection C relative to the synonym according to the positions of the words relative to the particular target word).
In some embodiments, the word positioning module 1204B may be configured to select one of multiple word sequences (obtained from arranging respective word collections) to provide to the label promoter 1204C for generation of the field label 1206. The word positioning module 1204B may be configured to select the word sequence having the highest associated score (i.e., the score associated with the word collection from which the word sequence was generated). In some embodiments, one of the word sequences may be selected using a contextual scoring model that takes into account contextual information about neighboring fields and the name of the dataset to adjust scores associated with the word sequences. Context-adjusted scores output by the contextual scoring model may be used to select a word sequence to provide to the label promoter 1204C. An example of such a contextual scoring model and use thereof is described herein with reference to
In some embodiments, the label promoter 1204C may be configured to generate the field label 1206 from the word sequence C′. For example, the label promoter 1204C may output the word sequence C′ as the field label 1206. In some embodiments, the label promoter 1204C may be configured to identify a portion of the word sequence C′ as the field label 1206. The label promoter 1204C may be configured to categorize each word in the sequence C′ and determine the field label 1206 based on the word categorizations. For example, the label promoter 1204C may: (1) categorize each word in the sequence C′ as a classword, prime word, modifier, or entity; and (2) determine the field label to be the prime word and the and the classword. To illustrate, in the word sequence “client cell telephone number”, “client” may be categorized as an entity, “cell” as a modifier, “telephone” as a prime word, and “number” as a classword. The label promoter 1204C may output “telephone number” as the field label 1206 generated from the sequence C′. In some embodiments, the categorization of words in the sequence C′ may be based on language conventions. The label promoter 1204C may be configured to use different categorizations based on which language the label promoter 1204C is generating a label for.
As shown in
As shown in
As shown in
Next, the output for “client number” is determined. The output for “client number” is the product of (1) the weight W2, which is equal to 1; and (2) the sum of the output H1 for “client”, the similarity score W2 associated with “number”, and the negative loss associated with “telephone”. The loss indicated by the language model 1404 for the word “telephone” collocated with the target word “number” is 0. The output H2 for “client number” is thus the sum of S2 and H1 divided by 2.
Next, the output for “client number telephone” is determined. There is no word in the collection after “telephone”, thus the loss term L3 is 0. The output is thus the product of (1) the weight W3, which is equal to 1; and (2) the sum of the similarity score S3 associated with “telephone” with the outputs H1 and H2 divided by the total number of words in the collection (i.e., 3). Accordingly, the score for the word collection “client number telephone” is 0.81. The word collections and their corresponding scores are provided to the word positioning module 1204B to be arranged into word sequences.
In some embodiments, the word positioning module 1204B may be configured to generate word sequences from multiple word collections provided by the word colocation module 1204A and select one of the word sequences to provide to the label promoter 1204C. The word positioning module 1204B may select a word sequence based on the scores provided by the word colocation module 1204A. As shown in
In the example of
The word category identification and field label generation illustrated by
In some embodiments, the label promoter 1204C may be configured to determine the classword using the field values 1512 by applying tests 1516A, 1516B, 1516C to the field values 1512. The tests 1516A, 1516B, 1516C are associated with respective candidate classwords 1514. The label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing information associated with the test (e.g., a regular expression, a set of reference values, a distribution of values, support of a distribution of values, and/or one or more rules); and (2) applying the test to the field values 1512 using the information to obtain a score for the test. The label promoter 1204C may be configured to select a classword associated with the test that yielded the highest score as a classword for the field label. In some embodiments, the label promoter 1204C may be configured to select one of the candidate classwords 1514 as the classword for the field label if at least one of the scores 1518A, 1518B, 1518C meets a threshold score. For example, the label promoter 1204C may determine to use the classword associated with the test to generate a field label when the score meets or exceeds a threshold value (e.g., a value between 0.7-1.0).
In some embodiments, the label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing a regular expression (e.g., a regular expression indicating a pattern of mm/dd/yyyy for a “date” field label) specified by the test; and (2) determining a score using the regular expression (e.g., by determining a number of the field values 1512 that match the regular expression). In some embodiments, the label promoter 1204C may be configured to determine the score using the regular expression by: (1) determining a percentage of the field values 1512 that match the regular expression; and (2) determining the score using the percentage of the field values 1512 that match the regular expression. For example, the label promoter 1204C may determine a percentage of the field values 1512 that match a regular expression indicating an expected pattern for a social security number and determine a score using the determined percentage.
In some embodiments, the label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing a set of reference values (e.g., the integer values 1-12 for a “month” field label) specified by the test; and (2) determining a score using the set of reference values and the field values 1512 (e.g., by determining a number of the field values 1512 that are included in the set of reference values). In some embodiments, the label promoter 1204C may be configured to determine the score by: (1) determining whether each of the field values 1512 is in the set of reference values; and (2) determining the score based on a number of the selected field values are included in the set of reference values. In some embodiments, the set of reference values may be an enumerated set of values. For example, the dataset field may store an indication of a state in the United States of America (USA) and the set of reference values may be a list of the 50 codes for states in the USA. The label promoter 1204C may determine a percentage of the field values 1512 that are in the list of 50 codes and determine a score for the field label using the determined percentage. If the label promoter 1204C determines that the score meets or exceeds a threshold, the label promoter 1204C may determine to use a classword “code” associated with the test to generate a field label (e.g., by making the classword the last word in the field label).
In some embodiments, the label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing a distribution of values specified by the test (e.g., a distribution defined on the values 1-31 that is associated with a field label “day of the month”; (2) comparing the distribution specified by the test to a distribution defined on the field values 1512. The label promoter 1204C may be configured to determine the score based on a result of the comparison. For example, the label promoter 1204C may compare the distribution specified by the test to the distribution defined on the field values 1512 using a chi-squared test, and determine the score associated with the field label using the result of the chi-squared test.
In some embodiments, the label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing a set of values specified by the test; and (2) comparing a support of the set of values to a support of the field values 1512. The label promoter 1204C may be configured to determine the score based on the comparison. For example, the label promoter 1204C may determine a ratio of the support of the field values 1512 to the support of the distribution specified by the test and determine the score using the ratio. To illustrate, the support for a distribution specified by a test for “gender” may be male, female, other, and unknown while the support for the field values 1512 may be male and other. The label promoter 1204C may determine a support ratio of 0.5 and determine the score for the field label using the support ratio.
In some embodiments, the label promoter 1204C may be configured to apply a test to the field values 1512 by: (1) accessing one or more rules specified by the test; and (2) determining a percentage of the field values 1512 that meet the rule(s) (e.g., by determining a percentage of the field values 1512 for which logical statement(s) indicating the rule(s) are true). The label promoter 1204C may be configured to determine the score for the test based on the percentage of the field values 1512 that meet the rule(s). For example, a rule specified by test for “age” may require values to be greater than greater than 0 and less than 200. The label promoter 1204C may determine a percentage of the field values 1512 that are within the range required by the rule and determine the score for the test based on the percentage of the field values 1512 that are within the range.
As shown in
As shown in
As shown in
A word sequence for the fields of the dataset may be selected (e.g., to provide to the label promoter 1204C of the label generator 1102A) based on scores output by the output layer 1606. Accordingly, the contextual scoring model 1600 may be a dataset context-based filter for candidate word sequences identified for fields of the dataset.
The hidden layer 1604 may be configured to determine values indicating whether different combinations of word sequences are collocated according to the language model 1159. In the example of
A word sequence may be selected for each of the fields of the “Cust_Data” dataset using the scores output by the output layer 1606. For example, the word sequence having the highest score may be selected (e.g., to provide to the label promoter 1204C for generation of a field label). Thus, the selected word sequence for the field “Fst_Nm” may be {first name}, the selected word sequence for the field “Lst_Nm” may be {last name}, the selected word sequence for the field “Tel_Num” may be {telephone number}, and the selected word sequence for the dataset “Cust_Data” may be {customer data}.
Examples of attributes that may be determined from application of the tests 1702A, 1702B, 1702C to the field values 1704 include a requirement attribute indicating whether a value is required for the field, a uniqueness attribute indicating whether each value of the field is unique relative to other values of the field, a code attribute indicating a code set (e.g., U.S. state codes) from which values of the field are selected, and/or other attributes. In some embodiments, the attribute values may be metadata about a dataset field.
In some embodiments, the label attribute identification module 1102B may be configured to use the attributes 1706A, 1706B, 1706C to populate a data entity definition and/or instance thereof associated with a field label. For example, the label attribute identification module 1102B may use the attributes 1706A, 1706B, 1706C to determine attributes of a data entity definition and/or values of the attributes in instances of the data entity definition. Assigning the field label to a dataset field may automatically associate the data entity definition and/or instance thereof with the dataset field.
Process 1800 begins at block 1802 where the system performing process 1800 determines whether any field label from a field label glossary (e.g., field label glossary 104) match the field. An example process for determining whether any field label from the field label glossary matches the field is described herein with reference to
If at block 1802 the system determines that none of the field labels in the field label glossary match the field, then process 1800 proceeds to block 1806, where the system generates one or more new candidate field labels for the field. As shown in
At block 1806A, the system generates, using candidate word sets and corresponding sets of scores (e.g., candidate word sets and scores 1402 described herein with reference to
Next, at block 1806B, the system generates, using the word sequence(s), field label(s) that are not in the field label glossary. In some embodiments, the system may be configured to make an entire word sequence a new field label. In some embodiments, the system may be configured to select a portion of a word sequence as a field label. For example, the system may select a prime word and a classword of the word sequence as the field label (e.g., as described herein with reference to
In some embodiments, the system may be configured to generate a portion of a field label using field values from the field. The system may be configured to: (1) apply tests associated with candidate words for the portion of the new field label to the field values to obtain test scores; and (2) select one of the candidate words for the portion of the new field label using the test scores. For example, as described herein with reference to
After generating the new candidate field label(s) at block 1806, process 1800 proceeds to block 1808, where the system assigns a generated field label to the field. In some embodiments, the system may be configured to assign the field label to the field based on user input. The system may be configured to: (1) present the generated candidate field label(s) to a user in a GUI; (2) receive input indicating a selection of one of the generated candidate field label(s); and (3) assign the selected generated candidate field label to the field. In some embodiments, the system may be configured to automatically assign one of the generated field label(s) to the field. For example, the system may select a generated field label based on scores associated with word collections from which the field label(s) were generated. The system may assign the generated field label with the highest associated score to the field.
In some embodiments, process 1800 may be performed for multiple fields in a dataset. In some embodiments, process 1800 may be repeated on fields of a dataset to update labeling. Labeling results may be different in different iterations due to learning of the model based on assignment feedback and or changes in field values.
Process 1820 begins at block 1822, where the system performing process 1820 identifies a set of abbreviations in a name of a field. An example process for identifying a set of abbreviations in the name of the field is described herein with reference to block 902 of
Next, process 1820 proceeds to block 1824, where the system identifies, for each abbreviation, a candidate word set and corresponding set of scores. An example process for determining, for each abbreviation, the candidate word set and the corresponding set of scores is described herein with reference to block 1004 of
Next, process 1820 proceeds to block 1826, where the system determines, using the candidate word sets and the set of scores to determine whether any field label from a field label glossary matches the field. In some embodiments, the system may be configured to determine whether any field label from the field label glossary matches the field by: (1) determining scores for the field labels in the field label glossary; and (2) determining whether any of the field labels match the field using the scores. An example process for determining scores for the field labels is described herein with reference to
In some embodiments, the system may be configured to determine whether any of the field labels from the field label glossary match the field using the corresponding scores by: (1) determining whether any of the scores meet or exceed a threshold score; and (2) determining that none of the field labels match the field when none of the field labels have corresponding scores that meet or exceed the threshold score. In some embodiments, the threshold score may be a value in the range 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, or 0.9-1.0). For example, the threshold score may be 0.7. In some embodiments, the threshold score may be a configurable parameter that is adjustable based on user input. The system may be configured to adjust the threshold score based on the user input. For example, the system may receive, through a GUI, a user input indicating a threshold score. When the system determines that none of the field labels from the field label glossary match the field, the system may proceed to generate a new field label to assign to the field (e.g., as described at blocks 1806-1808 described herein with reference to
Process 1830 begins at block 1832, where the system accesses sets of candidate words indicated by abbreviations in the name of the field and corresponding sets of scores. An example technique for identifying a set of abbreviations in the name of the field is described herein with reference to block 902 of
Next, process 1830 proceeds to block 1834, where the system accesses a language model (e.g., language model 1159 described herein with reference to
Next, process 1830 proceeds to block 1836, where the system generates, using the sets of candidate words, candidate word collections for the field. In some embodiments, the system may be configured to generate the candidate word collections by combining words from the sets of candidate words to obtain the candidate word collections. The system may be configured to: (1) select a word from each of the sets of candidate words; and (2) combine the words selected from the sets of candidate words to obtain a candidate word collection. In some embodiments, the system may be configured to generate all combinations words from the candidate word sets as the candidate word collections for the field. An example of generating candidate word collections is illustrated in
Next, process 1830 proceeds to block 1838, where the system determines, using the language model and sets of scores corresponding to the sets of candidate words, a score for each of the candidate word collections. In some embodiments, the system may be configured to use a colocation scoring model (e.g., colocation scoring model 1204A-1 and/or contextual scoring model 1600 described herein with reference to
Next, process 1830 proceeds to block 1840, where the system selects one or more word collections from the candidate word collections based on the scores. In some embodiments, the system may be configured to select a word collection with the highest associated score to use in generation of a word sequence. In some embodiments, the system may be configured to select one or more word collections that meet or exceed a threshold word collection score to use in generating the word sequence(s). The threshold word collection score may be a value in the range 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, or 0.9-1.0. For example, the system may select word collection(s) from the candidate word collections that have an associated score of at least 0.7, In some embodiments, the system may be configured to select a particular number (e.g., 1, 2, 3, 4, 5, 6, or another number) of the highest scoring candidate word collections to use in generating the word sequence(s).
Next, process 1830 proceeds to block 1842, where the system generates a respective word sequence using each of the selected word collection(s). In some embodiments, the system may be configured to generate a word sequence using a given selected word collection by arranging words in the word collection into a particular order. The system may be configured to determine the particular order using a language model (e.g., language model 1159). The system may be configured to determine the particular order of the words by: (1) identifying collections of collocated words indicated by the language model in which words of the word collection occur; (2) determining relative positions of the words in the identified collections of collocated words indicated by the language model; and (3) arranging the words in the word collection according to the relative positions. The relative positions of the words indicated by the language model may be determined by the order of the words in a set of text from which the language model was generated. Example position information that may be indicated by the language model is described herein with reference to
Described herein is an example measure of similarity that may be used in some embodiments. In some embodiments, the measure of similarity may be used to determine a similarity score between two strings. For example, the measure of similarity may be used to determine a similarity score between an abbreviation and candidate words that may be indicated by the abbreviation. In some embodiments, the measure of similarity may be used by the data processing system 100 (e.g., by field name segmentation module 152 of data recognition module 102) to determine a set of candidate words that may be indicated by a field name portion (e.g., an abbreviation) and a corresponding set of similarity scores.
In some embodiments, the measure of similarity may be based on the characters of the two strings being compared and the order of the characters in the two strings. The measure of similarity may comprise multiple component measures of similarity. The measure of similarity may be determined as a combination of the component measurements of similarity. For example, the measure of similarity may be a weighted sum of the component measures of similarity. Example component measures of similarity include cosine similarity, Otsuka-Ochiai coefficient, Jaro-Winkler similarity, and Jaro-Winkler similarity modified to scale based on common suffix instead of a common suffix. In some embodiments, a measure of similarity may be determined between two strings using numerical representations of the strings. For example, each string may be represented as a numerical vector. A numerical vector representation of a string may represent characters in a string and their order.
For two strings a and b (e.g., where a is a field name portion (e.g., an abbreviation) and b is a candidate word represented by the field name portion), let a1 be the string a with its vowels removed and let b1 be the string b with its vowels removed. Let a2 be the stemmed version of string a and let b2 be the stemmed version of string b. cosine(a, b) denotes the cosine similarity between the strings a and b. Each of equations (iv)-(vii) below is a component measure of similarity that is used in determining the measure of similarity.
The modified Jaro-Winkler similarity Js(a, b) may be defined by equation (viii) below.
The measure of similarity between two strings a and b is given by equation (ix) below. The measure of similarity is a function of multiple different component similarities. The measure of similarity Sraw is determined using a cosine similarity between the two strings, a cosine similarity between vowelless versions of the two strings, a cosine similarity between stemmed versions of the two strings, and differences in order of characters between the two strings, the vowelless versions of the two strings, and the stemmed versions of the two strings. The similarity given by equation (ix) is further based on a degree to which prefixes of the two strings match because it incorporates the Jaro-Winkler similarity measure J(a, b) and on a degree to which suffixes of the two strings match because it incorporates the modified Jaro-Winkler similarity measure Js(a, b).
The technology described herein is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the technology described herein include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The computing environment may execute computer-executable instructions, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The technology described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 1910 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 1910 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by computer 1910. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 1930 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 1931 and random access memory (RAM) 1932. A basic input/output system 1933 (BIOS), containing the basic routines that help to transfer information between elements within computer 1910, such as during start-up, is typically stored in ROM 1931. RAM 1932 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 1920. By way of example, and not limitation,
The computer 1910 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media described above and illustrated in
The computer 1910 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1980. The remote computer 1980 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 1910, although only a memory storage device 1981 has been illustrated in
When used in a LAN networking environment, the computer 1910 is connected to the LAN 1971 through a network interface or adapter 1970. When used in a WAN networking environment, the computer 1910 typically includes a modem 1982 or other means for establishing communications over the WAN 1983, such as the Internet. The modem 1982, which may be internal or external, may be connected to the system bus 1921 via the actor input interface 1960, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 1910, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Having thus described several aspects of at least one embodiment of the technology described herein, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art.
Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of disclosure. Further, though advantages of the technology described herein are indicated, it should be appreciated that not every embodiment of the technology described herein will include every described advantage. Some embodiments may not implement any features described as advantageous herein and in some instances one or more of the described features may be implemented to achieve further embodiments. Accordingly, the foregoing description and drawings are by way of example only.
The above-described embodiments of the technology described herein can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor. Alternatively, a processor may be implemented in custom circuitry, such as an ASIC, or semicustom circuitry resulting from configuring a programmable logic device. As yet a further alternative, a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom. As a specific example, some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor. However, a processor may be implemented using circuitry in any suitable format.
Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.
Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, aspects of the technology described herein may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments described above. As is apparent from the foregoing examples, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the technology as described above. As used herein, the term “computer-readable storage medium” encompasses only a non-transitory computer-readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine. Alternatively or additionally, aspects of the technology described herein may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the technology as described above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the technology described herein need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the technology described herein.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the dataset fields with locations in a computer-readable medium that conveys relationship between the dataset fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
Various aspects of the technology described herein may be used alone, in combination, or in a variety of arrangements not specifically described in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Also, the technology described herein may be embodied as a method, of which examples are provided herein including with reference to
Further, some actions are described as taken by an “actor” or a “user”. It should be appreciated that an “actor” or a “user” need not be a single individual, and that in some embodiments, actions attributable to an “actor” or a “user” may be performed by a team of individuals and/or an individual in combination with computer-assisted tools or other mechanisms.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
In some embodiments, the techniques described herein relate to a method for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the method including: using at least one computer hardware processor to perform: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, assigning the field label to the particular field; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a method, wherein generating the word sequence describing data stored in the particular field includes: determining, using the plurality of sets of candidate words and the plurality of sets of scores, a plurality of candidate word collections and corresponding plurality of word collection scores; generating a plurality of candidate word sequences using the plurality of candidate word collections; and selecting the word sequence from the plurality of candidate word sequences using the plurality of word collection scores.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of candidate word collections and the corresponding plurality of word collection scores includes: combining words from each of the plurality of sets of candidate words to obtain the plurality of candidate word collections; and determining the plurality of word collection scores corresponding to the plurality of candidate word collections using a colocation scoring model.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of word collection scores corresponding to the plurality of candidate word collections using the colocation scoring model includes, for each particular candidate word collection of one or more of the plurality of candidate word collections: determining a first output of the colocation scoring model for a first word in the particular candidate word collection; determining, using the first output of the colocation scoring model for the first word, a second output of the colocation scoring model for a second word in the particular candidate word collection; and determining the score for the particular candidate word collection using the first output of the colocation scoring model and the second output of the colocation scoring model.
In some embodiments, the techniques described herein relate to a method, wherein selecting the word sequence from the plurality of candidate word sequences includes determining context-adjusted scores for the plurality of candidate word sequences using a colocation scoring model, the colocation scoring model including: a first layer including: a first plurality of nodes each associated with a respective field of the set of the dataset's fields; and a node associated with a name of the dataset; and a second layer including a second plurality of nodes each configured to output word sequence scores corresponding to candidate word sequences determined for a respective field of the set of the dataset's fields.
In some embodiments, the techniques described herein relate to a method, wherein determining the context-adjusted scores for the plurality of candidate word sequences using the colocation scoring model includes: determining outputs of the first plurality of nodes of the first layer; and determining outputs of the second plurality of nodes of the second layer, the outputs of the second plurality of nodes including outputs of a particular node associated with the particular field, the outputs of the particular node indicating the context-adjusted scores for the plurality of candidate word sequences.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of word collection scores corresponding to the plurality of candidate word collections includes: accessing a language model indicating sets of collocated words that appear in a set of text and, for each of the sets of collocated words, a relative position of each word in the set of collocated words; and determining, using the language model and the plurality of sets of scores corresponding to the plurality of sets of candidate words, the plurality of word collection scores corresponding to the plurality of candidate word collections.
In some embodiments, the techniques described herein relate to a method, wherein generating the plurality of candidate word sequences using the plurality of candidate word collections includes, for each candidate word collection: arranging words in the candidate word collection to obtain a corresponding word sequence.
In some embodiments, the techniques described herein relate to a method, wherein generating, using the word sequence describing data stored in the particular field, the new field label includes: selecting a subset of words from the word sequence; and generating the new field label such that the new field label includes the subset of words without including words in the word sequence that are not included in the subset of words.
In some embodiments, the techniques described herein relate to a method, wherein generating the new field label includes: identifying a classword; and including the classword in the new field label.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword includes identifying the classword in the word sequence describing data stored in the particular field.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword includes: accessing stored in the particular field of the dataset, the data including a set of alphanumeric values; and identifying the classword using the set of alphanumeric values.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword using the set of values from the particular field includes: applying a plurality of tests to the set of values, each of the plurality of tests associated with a respective classword; and selecting, using results of applying the plurality of tests to the set of values, the classword from among classwords associated with the plurality of tests.
In some embodiments, the techniques described herein relate to a method, wherein assigning the new field label to the particular field includes: presenting a plurality of candidate field labels to a user in a graphical user interface (GUI), the plurality of candidate field labels including the new field label; and obtaining user input through the GUI indicating a selection of the new field label.
In some embodiments, the techniques described herein relate to a method, further including: updating the plurality of sets scores corresponding to the plurality of sets of candidate words based on the user input indicating the selection of the new field label.
In some embodiments, the techniques described herein relate to a method, wherein updating the plurality of sets scores corresponding to the plurality of sets of candidate words based on the user input indicating the selection of the new field label includes: increasing scores associated with words in the plurality of sets of candidate words that are included in the word sequence from which the new field label for the particular field was generated.
In some embodiments, the techniques described herein relate to a method, wherein generating the new field label includes: determining values of a plurality of attributes for the new field label; and generating a data entity definition for the new field label using the values of the plurality of attributes.
In some embodiments, the techniques described herein relate to a method, wherein determining the values of the plurality of attributes for the new field label includes: obtaining values from the particular field; and applying a plurality of tests, associated with the plurality of attributes, to the values from the particular field to obtain the values of the plurality of attributes for the new field label.
In some embodiments, the techniques described herein relate to a method, further including: associating a metadata-driven process with a first field of the set of the dataset's fields at least in part by associating the metadata-driven process with a first field label assigned to the first field.
In some embodiments, the techniques described herein relate to a method, further including applying the metadata-driven process to data stored from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data stored from the first field, masks personally identifiable information (PII) stored in the first field.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data from the first field, determines whether the data meets one or more data quality requirements.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data from the first field, updates the data from the first field.
In some embodiments, the techniques described herein relate to a method, further including: determining, using the first field label, that the first metadata-driven process is associated with the first field label assigned to the first field; and in response to the determining that the first metadata-driven process is associated with the first field label assigned to the first field, triggering application of the first metadata-driven process to data from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates that data from the first field contains data to be protected, such as personally identifiable information, without having to access the data stored in the first field, and wherein the first metadata-driven process is a process for protecting the data from the first field, such as anonymizing the data from the first field, restricting access to the data from the first field, and/or de-identifying the data from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the process for anonymizing the data from the first field includes masking of personally identifiable information (PII).
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data from the first field, that data from the first field is of a data format that makes the data not suitable as input for a data processing application, and wherein the first metadata-driven process is a process for: reformatting the data from the first field in accordance with a data format that is suitable as input for the data processing application; and providing the reformatted data as input to the data processing application for execution of the data processing application.
In some embodiments, the techniques described herein relate to a method, wherein the data format of the data from the first field is not suitable as input for the data processing application in that the data processing application would not run, or would run with a malfunction, on the data of that not suitable data format.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data from the first field, that data from the first field depends on data from a second field of the set of the dataset's fields, so that the first and second fields are related by a relationship, and wherein the first metadata-driven process is a process for generating lineage information about the relationship of the first and second fields and providing the generated lineage information to a computer for display as lineage diagram.
In some embodiments, the techniques described herein relate to a method, wherein the relationship between the first and second fields is that data from the second field is computed using data from the first field and/or vice versa.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data stored in the first field, that data from the first field is of a data format that is incompatible as input for a particular version of a data processing application, and wherein the first metadata-driven process is a process for reconfiguring the particular version of the data processing application to obtain a reconfigured data processing application and providing the data from the first field as input to the reconfigured data processing application for execution of the reconfigured data processing application.
In some embodiments, the techniques described herein relate to a method, wherein the data format of the data from the first field causes the data processing application to fail to run or to malfunction.
In some embodiments, the techniques described herein relate to a system for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the system including: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, assigning the field label to the particular field; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the method including: for each particular field in one or more fields in the set of the dataset's fields: determining whether any field label in a field label glossary matches the particular field; when it is determined that a field label in the field label glossary matches the particular field, assigning the field label to the particular field; and when it is determined that no field label in the field label glossary matches the particular field: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a method for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the method including: using at least one computer hardware processor to perform: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a method, wherein generating the word sequence describing data stored in the particular field includes: determining, using the plurality of sets of candidate words and the plurality of sets of scores, a plurality of candidate word collections and corresponding plurality of word collection scores; generating a plurality of candidate word sequences using the plurality of candidate word collections; and selecting the word sequence from the plurality of candidate word sequences using the plurality of word collection scores.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of candidate word collections and the corresponding plurality of word collection scores includes: combining words from each of the plurality of sets of candidate words to obtain the plurality of candidate word collections; and determining the plurality of word collection scores corresponding to the plurality of candidate word collections using a colocation scoring model.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of word collection scores corresponding to the plurality of candidate word collections using the colocation scoring model includes, for each particular candidate word collection of one or more of the plurality of candidate word collections: determining a first output of the colocation scoring model for a first word in the particular candidate word collection; determining, using the first output of the colocation scoring model for the first word, a second output of the colocation scoring model for a second word in the particular candidate word collection; and determining the score for the particular candidate word collection using the first output of the colocation scoring model and the second output of the colocation scoring model.
In some embodiments, the techniques described herein relate to a method, wherein selecting the word sequence from the plurality of candidate word sequences includes determining context-adjusted scores for the plurality of candidate word sequences using a colocation scoring model, the colocation scoring model including: a first layer including: a first plurality of nodes each associated with a respective field of the set of the dataset's fields; and a node associated with a name of the dataset; and a second layer including a second plurality of nodes each configured to output word sequence scores corresponding to candidate word sequences determined for a respective field of the set of the dataset's fields.
In some embodiments, the techniques described herein relate to a method, wherein determining the context-adjusted scores for the plurality of candidate word sequences using the colocation scoring model includes: determining outputs of the first plurality of nodes of the first layer; and determining outputs of the second plurality of nodes of the second layer, the outputs of the second plurality of nodes including outputs of a particular node associated with the particular field, the outputs of the particular node indicating the context-adjusted scores for the plurality of candidate word sequences.
In some embodiments, the techniques described herein relate to a method, wherein determining the plurality of word collection scores corresponding to the plurality of candidate word collections includes: accessing a language model indicating sets of collocated words that appear in a set of text and, for each of the sets of collocated words, a relative position of each word in the set of collocated words; and determining, using the language model and the plurality of sets of scores corresponding to the plurality of sets of candidate words, the plurality of word collection scores corresponding to the plurality of candidate word collections.
In some embodiments, the techniques described herein relate to a method, wherein generating the plurality of candidate word sequences using the plurality of candidate word collections includes, for each candidate word collection: arranging words in the candidate word collection to obtain a corresponding word sequence.
In some embodiments, the techniques described herein relate to a method, wherein generating, using the word sequence describing data stored in the particular field, the new field label includes: selecting a subset of words from the word sequence; and generating the new field label such that the new field label includes the subset of words without including words in the word sequence that are not included in the subset of words.
In some embodiments, the techniques described herein relate to a method, wherein generating the new field label includes: identifying a classword; and including the classword in the new field label.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword includes identifying the classword in the word sequence describing data stored in the particular field.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword includes: accessing stored in the particular field of the dataset, the data including a set of alphanumeric values; and identifying the classword using the set of alphanumeric values.
In some embodiments, the techniques described herein relate to a method, wherein identifying the classword using the set of values from the particular field includes: applying a plurality of tests to the set of values, each of the plurality of tests associated with a respective classword; and selecting, using results of applying the plurality of tests to the set of values, the classword from among classwords associated with the plurality of tests.
In some embodiments, the techniques described herein relate to a method, wherein assigning the new field label to the particular field includes: presenting a plurality of candidate field labels to a user in a graphical user interface (GUI), the plurality of candidate field labels including the new field label; and obtaining user input through the GUI indicating a selection of the new field label.
In some embodiments, the techniques described herein relate to a method, further including: updating the plurality of sets scores corresponding to the plurality of sets of candidate words based on the user input indicating the selection of the new field label.
In some embodiments, the techniques described herein relate to a method, wherein updating the plurality of sets scores corresponding to the plurality of sets of candidate words based on the user input indicating the selection of the new field label includes: increasing scores associated with words in the plurality of sets of candidate words that are included in the word sequence from which the new field label for the particular field was generated.
In some embodiments, the techniques described herein relate to a method, wherein generating the new field label includes: determining values of a plurality of attributes for the new field label; and generating a data entity definition for the new field label using the values of the plurality of attributes.
In some embodiments, the techniques described herein relate to a method, wherein determining the values of the plurality of attributes for the new field label includes: obtaining values from the particular field; and applying a plurality of tests, associated with the plurality of attributes, to the values from the particular field to obtain the values of the plurality of attributes for the new field label.
In some embodiments, the techniques described herein relate to a method, further including: associating a metadata-driven process with a first field of the set of the dataset's fields at least in part by associating the metadata-driven process with a first field label assigned to the first field.
In some embodiments, the techniques described herein relate to a method, further including applying the metadata-driven process to data from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data from the first field, masks personally identifiable information (PII) stored in the first field.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data stored from the first field, determines whether the data from the first field meets one or more data quality requirements.
In some embodiments, the techniques described herein relate to a method, wherein the metadata-driven process, when applied to the data from the first field, updates the data from the first field.
In some embodiments, the techniques described herein relate to a system for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the system including: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset including data stored in fields to determine field labels for a set of the dataset's fields, the method including: for each particular field in one or more fields in the set of the dataset's fields: identifying, for a set of abbreviations in a name of the particular field, a plurality of sets of candidate words indicated by the set of abbreviations and a corresponding plurality of sets of scores; generating, using the plurality of sets of candidate words and the plurality of sets of scores, a word sequence describing data stored in the particular field; generating, using the word sequence describing data stored in the particular field, a new field label for the particular field; and assigning the new field label to the particular field.
In some embodiments, the techniques described herein relate to a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data from the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the name of the particular field and the NLP, the first set of candidate field labels for the particular field and the field name analysis scores for the first set of candidate field labels includes: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the abbreviation and a corresponding set of similarity scores to obtain sets of candidate words indicated by the abbreviations and corresponding sets of similarity scores; and determining, using the sets of candidate words indicated by the abbreviations and the corresponding sets of similarity scores, the first set of candidate field labels and the field name analysis scores.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the subset of data from the particular field and the tests associated with respective field labels in the field label glossary, the second set of candidate field labels and the field data analysis scores for the second set of candidate field labels includes: applying the tests associated with the respective field labels to the subset of data from the particular field to obtain test results; and determining the second set of candidate field labels and the dataset field analysis scores using the test results obtained from applying the tests.
In some embodiments, the techniques described herein relate to a method, wherein applying the tests associated with the respective field labels to the subset of data from the particular field includes, for each test: accessing a regular expression associated with the test; determining a proportion of the subset of data that meets the regular expression associated with the test; and determining a test result using the proportion of the subset of data that meets the regular expression associated with the test.
In some embodiments, the techniques described herein relate to a method, wherein determining the merged candidate field labels and the corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores includes: identifying a first field label associated with a first one of the field name analysis scores and a first one of the field data analysis scores, the first field label being in the first set of candidate field labels and the second set of candidate field labels; and determining a first merged score for the first field label by adjusting the first field name analysis score using the first field data analysis score to obtain the first merged score.
In some embodiments, the techniques described herein relate to a method, wherein adjusting the first field name analysis score using the first field data analysis score includes: determining a ratio between the first field name analysis score and the first field data analysis score; and adjusting the first field name analysis score using the ratio.
In some embodiments, the techniques described herein relate to a method, wherein adjusting the first field name analysis score using the ratio includes: determining a bias value as a log of the ratio; and adjusting the first field name analysis score using the bias value.
In some embodiments, the techniques described herein relate to a method, wherein determining the merged candidate field labels and the corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores includes: identifying a first field label from the first set of candidate field labels associated with a first one of the field name analysis scores; determining that none of the subset of data passes a test associated with the first field label; and determining a first merged score for the first field label by reducing the first field name analysis score.
In some embodiments, the techniques described herein relate to a method, wherein assigning one of the merged candidate field labels to the particular field using the corresponding scores determined for the candidate field labels includes: automatically selecting, from the merged candidate field labels, a candidate field label using the corresponding scores; and assigning the selected candidate field label to the particular field.
In some embodiments, the techniques described herein relate to a method, wherein assigning one of the merged candidate field labels to the particular field using the corresponding scores includes: presenting at least some of the merged candidate field labels in a graphical user interface (GUI); and receiving, through the GUI, user input indicating selection of a candidate field label to assign to the particular field.
In some embodiments, the techniques described herein relate to a method, further including: determining, using a first field label assigned to a first field in the set of fields, that a first metadata-driven process is associated with the first field label assigned to the first field; and in response to the determining that the first metadata-driven process is associated with the first field label assigned to the first field, triggering application of the first metadata-driven process to data from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates that data stored in the first field contains data to be protected, such as personally identifiable information, without having to access the data stored in the first field, and wherein the first metadata-driven process is a process for protecting the data from the first field, such as anonymizing the data from the first field, restricting access to the data from the first field, and/or de-identifying the data from the first field.
In some embodiments, the techniques described herein relate to a method, wherein the process for anonymizing the data from the first field includes masking of personally identifiable information (PII).
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data from the first field, that data from the first field is of a data format that makes the data not suitable as input for a data processing application, and wherein the first metadata-driven process is a process for: reformatting the data from the first field in accordance with a data format that is suitable as input for the data processing application; and providing the reformatted data as input to the data processing application for execution of the data processing application.
In some embodiments, the techniques described herein relate to a method, wherein the data format of the data from the first field is not suitable as input for the data processing application in that the data processing application would not run, or would run with a malfunction, on the data of that not suitable data format.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data from the first field, that data from the first field depends on data from a second field of the set of the dataset's fields, so that the first and second fields are related by a relationship, and wherein the first metadata-driven process is a process for generating lineage information about the relationship of the first and second fields and providing the generated lineage information to a computer for display as lineage diagram.
In some embodiments, the techniques described herein relate to a method, wherein the relationship between the first and second fields is that data from the second field is computed using data from the first field and/or vice versa.
In some embodiments, the techniques described herein relate to a method, wherein the first field label indicates, without having to access the data stored in the first field, that data from the first field is of a data format that is incompatible as input for a particular version of a data processing application, and wherein the first metadata-driven process is a process for reconfiguring the particular version of the data processing application to obtain a reconfigured data processing application and providing the data from the first field as input to the reconfigured data processing application for execution of the reconfigured data processing application.
In some embodiments, the techniques described herein relate to a method, wherein the data format of the data from the first field causes the data processing application to fail to run or to malfunction.
In some embodiments, the techniques described herein relate to a system for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the system including: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data from the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
In some embodiments, the techniques described herein relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), a first set of candidate field labels for the particular field and field name analysis scores for the first set of candidate field labels; determining, using a subset of data from the particular field and tests associated with respective field labels in the field label glossary, a second set of candidate field labels and field data analysis scores for the second set of candidate field labels; determining merged candidate field labels and corresponding scores using the first set of candidate field labels and the field name analysis scores, and the second set of candidate field labels and the field data analysis scores; and assigning one of the merged candidate field labels to the particular field using the corresponding scores.
In some embodiments, the techniques described herein relate to a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of the field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
In some embodiments, the techniques described herein relate to a method, wherein generating, using the sets of candidate words identified for the abbreviations and the n-gram model indicating the plurality of word collections that appear within the field labels of the field label glossary, the at least one word sequence describing data stored in the particular field includes: combining words from the sets of candidate words to obtain a plurality of word sequences; and filtering, using the n-gram model, the plurality of word sequences to obtain the at least one word sequence.
In some embodiments, the techniques described herein relate to a method, wherein filtering, using the n-gram model, the plurality of word sequences to obtain the at least one word sequence includes: determining, using the n-gram model, for each of the plurality of word sequences, a likelihood that words of the word sequence are collocated; and filtering, using likelihoods determined for the plurality of word sequences, the plurality of word sequences to obtain the at least one word sequence.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels includes: determining that one of the words in the at least one word sequence specifies a particular category of data; determining a target position of the word in the at least one word sequence; and determining the field name analysis scores for the candidate field labels based on the target position of the word in the at least one word sequence.
In some embodiments, the techniques described herein relate to a method, wherein determining the target position of the word in the at least one word sequence includes determining a target position of the word based on a target language for a field label to be assigned to the particular field.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels includes: accessing a sequence position model that is based on an order of words in the at least one word sequence; determining, using the sequence position model, scores for the field labels of the field label glossary; and selecting, using the scores for the field labels of the field label glossary, the candidate field labels from the field label glossary.
In some embodiments, the techniques described herein relate to a method, wherein identifying, for each particular abbreviation in the set of abbreviations, the set of candidate words indicated by the particular abbreviation includes: determining a similarity score between each candidate word in the set of candidate words and the particular abbreviation thereby obtaining sets of similarity scores corresponding to the sets of candidate words.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels includes: identifying words in the at least one word sequence that are present in the sets of candidate words and corresponding similarity scores; and determining, using the words in the at least one word sequence and the corresponding similarity scores, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels.
In some embodiments, the techniques described herein relate to a method, further including generating the n-gram model using the field label glossary.
In some embodiments, the techniques described herein relate to a method, wherein assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels includes automatically assigning the candidate field label to the particular field when a score corresponding to the candidate field label meets a threshold score.
In some embodiments, the techniques described herein relate to a system for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the system including: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of the field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
In some embodiments, the techniques described herein relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; identifying, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation thereby obtaining sets of candidate words; generating, using the sets of candidate words identified for the abbreviations and an n-gram model indicating a plurality of word collections that appear within field labels of the field label glossary, at least one word sequence describing data stored in the particular field; and determining, using the at least one word sequence and the field label glossary, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
In some embodiments, the techniques described herein relate to a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: using at least one computer hardware processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying including: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
In some embodiments, the techniques described herein relate to a method, wherein the measure of similarity includes multiple component measures of similarity and determining the measure of similarity between the particular abbreviation and each of the plurality of words in the glossary to obtain the plurality of similarity scores includes: determining, for the particular abbreviation and the word, the component measures of similarity to obtain values of the component measures of similarity; and determining the measure of similarity between the particular abbreviation and the word using the values of the component measures of similarity.
In some embodiments, the techniques described herein relate to a method, wherein determining the measure of similarity between the particular abbreviation and the word using the values of the component measures of similarity includes: determining the component measures of similarity; and determining the measure of measure similarity as a maximum of the component measures of similarity.
In some embodiments, the techniques described herein relate to a method, wherein the multiple component measures of similarity are selected from a group consisting of cosine similarity, Jaro-Winkler similarity, and Jaro-Winkler similarity modified to scale based on suffix.
In some embodiments, the techniques described herein relate to a method, wherein determining the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores includes: determining one of the multiple component measures of similarity based on a degree to which a suffix of the first word matches a suffix of the particular abbreviation.
In some embodiments, the techniques described herein relate to a method, wherein determining the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores includes: removing vowels from the first word to obtain a vowelless word; and determining one of the multiple component measures of similarity using the vowelless word.
In some embodiments, the techniques described herein relate to a method, wherein determining the measure of similarity between the particular abbreviation and a first word of the plurality of words in the glossary to obtain a first one of the plurality of similarity scores includes: stemming the first word to obtain a word stem; and determining one of the multiple component measures of similarity using the word stem.
In some embodiments, the techniques described herein relate to a method, wherein selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores includes, for each of the plurality of similarity scores: identifying a subset of the plurality of similarity scores that meet a threshold similarity score, the subset of similarity scores associated with a subset of the plurality of words; and selecting the subset of the plurality of words as the set of candidate words for the particular abbreviation.
In some embodiments, the techniques described herein relate to a method, wherein: the candidate field labels include words from the sets of candidate words; and determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular field and the field name analysis scores for the candidate field labels includes: determining, using the sets of candidate words, at least one word sequence describing data stored in the particular field; and determining, using the sets of similarity scores and the at least one word sequence, the field name analysis scores for the candidate field labels.
In some embodiments, the techniques described herein relate to a method, wherein determining, using the sets of similarity scores and the at least one word sequence, the field name analysis scores for the candidate field labels includes, for each of the candidate field labels: identifying words from the sets of candidate words included in the candidate field label; and obtaining similarity scores corresponding to the identified words; and determining a field name analysis score for the candidate field label using the similarity scores corresponding to the identified words.
In some embodiments, the techniques described herein relate to a system including for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the system including: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one processor to perform: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying including: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
In some embodiments, the techniques described herein relate to a non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for processing a dataset including data stored in fields to identify, from a field label glossary, a field label for each field in a set of one or more of the dataset fields of the dataset, the field labels describing data stored in the set of fields, the method including: for each particular field in the set of fields, determining, using a name of the particular field and natural language processing (NLP), candidate field labels for the particular field and field name analysis scores for the candidate field labels, the determining including: identifying a set of abbreviations in the name of the particular field; determining, for each particular abbreviation in the set of abbreviations, a set of candidate words indicated by the particular abbreviation and corresponding set of similarity scores thereby obtaining sets of candidate words and corresponding sets of similarity scores, the identifying including: determining a measure of similarity between the particular abbreviation and each of a plurality of words in a glossary to obtain a plurality of similarity scores for the plurality of words, the measure of similarity between an abbreviation and a word being based on characters in the abbreviation, characters in the word, order of the characters in the abbreviation, and order of the characters in the word; and selecting, using the plurality of similarity scores, the set of candidate words from the plurality of words in the glossary to obtain the set of candidate words for the particular abbreviation and the corresponding set of similarity scores; determining, using the sets of candidate words and the corresponding sets of similarity scores, the candidate field labels for the particular label and the field name analysis scores for the candidate field labels; and assigning one of the candidate field labels to the particular field using the field name analysis scores determined for the candidate field labels.
This application priority to and the benefit of U.S. Provisional Patent Application No. 63/615,443, filed on Dec. 28, 2023, entitled “TECHNIQUES FOR ASSIGNING LABELS TO DATASET FIELDS.” This application also claims priority to and the benefit of U.S. Provisional Patent Application No. 63/682,655, filed on Aug. 13, 2024, entitled “TECHNIQUES FOR ASSIGNING LABELS TO DATASET FIELDS.” The contents of these applications are incorporated by reference in their entirety.
| Number | Date | Country | |
|---|---|---|---|
| 63682655 | Aug 2024 | US | |
| 63615443 | Dec 2023 | US |