An executable program may be configured to read data from one or more datasets during its execution. For example, the datasets may include data stored on a medium that is retrieved by one or more processes of an executable program. Those processes may modify and write the data to one or more output data storage locations. In some cases, it may be desirable to interpret data from a dataset as being associated with particular data fields (also referred to simply as “fields”). The process of interpreting data and determining values of data fields for one or more data records is generally referred to as “parsing” the data. A particular parsing scheme may be defined by the executable program, by the data itself, or by a combination of the program and the data. A parsing scheme, which typically defines how to interpret data for a number of data fields for a number of data records, is sometimes referred to as a “record format.”
In some cases, a data record could be parsed by assuming that data fields in the record are of fixed length. For instance, a date value can always be expressed by eight digits and therefore a “date” data field could be identified by selecting eight characters. In other cases, a data field could have a variable length, and the data can be configured so that a computer process can identify when the field starts and ends by looking at the data.
Data can be configured for variable length fields either via delimiters or by length-prefixing the data. In the delimiter approach, a data field is bounded at one or both ends by a predetermined byte value (or byte sequence) that allows for identification of the bounds of the data field. This approach requires that the data fields not include the character and/or byte value (or sequence)—which is referred to as the “delimiter”—otherwise the computer process would mistakenly identify a point within the data field as being the beginning or end of the data field. The length-prefix approach provides one or more bytes prior to the data field value that indicates to the computer program the length of the data field that is to be read after the length prefix has ended.
According to some aspects, a method is provided of determining a record format for a dataset, the dataset comprising a plurality of bytes, the method comprising, with at least one computing device parsing the dataset using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the first record format, displaying at least some of the values of the one or more data fields in accordance with the first record format via a user interface, displaying a plurality of the sequence of characters via the user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receiving user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, and generating a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the character associated with the selected user interface element.
According to some aspects, a computer system is provided comprising at least one processor, at least one user interface device, and at least one computer readable medium comprising processor-executable instructions that, when executed, cause the at least one processor to parse a dataset comprising a plurality of bytes using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the first record format, display, via the at least one user interface device, at least some of the values of the one or more data fields of the first record format via the at least one user interface, display, via the at least one user interface device, a plurality of the sequence of characters via the at least one user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receive, via the at least one user interface device, user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, and generate a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the character associated with the selected user interface element.
According to some aspects, a computer system is provided comprising at least one processor, means for parsing a dataset comprising a plurality of bytes using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the first record format, means for displaying at least some of the values of the one or more data fields of the first record format via the at least one user interface, means for displaying a portion of the sequence of characters via the at least one user interface as a sequence of user interface elements, wherein each character of the portion of the sequence of characters is presented in sequence as a separate user interface element, means for receiving user input associated with a first user interface element of the sequence of user interface elements, the first user interface element associated with a first character of the sequence of characters, and means for generating a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the first character.
A method of determining a record format for a dataset, the dataset comprising a plurality of bytes, the method comprising, with at least one computing device iteratively receiving user input and generating record formats based upon the user input, said iterative process continuing until receiving user input indicating a most recently generated record format is to be output, said iterative process comprising repeating steps of parsing the dataset using an initial record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the initial record format, displaying at least some of the values of the one or more data fields in accordance with the initial record format via a user interface, displaying a plurality of the sequence of characters via the user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receiving user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, and generating a subsequent record format based on the received input, wherein the subsequent record format is generated to include a data field delimited by the character associated with the selected user interface element.
The foregoing is a non-limiting summary of the invention, which is defined by the attached claims.
Various aspects and embodiments will be described with reference to the following figures. It should be appreciated that the figures are not necessarily drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing.
The inventors have recognized and appreciated that errors made by a data processing system may be efficiently reduced by equipping the data processing system with a tool to assist a user in defining a record format for a dataset. The tool may dynamically analyze contents of the dataset based on real-time feedback provided by the user. The data processing system may apply the defined record format to automatically parse the contents of the dataset, with fewer errors.
The inventors have recognized and appreciated that, in practice, a user tasked with writing a program that parses contents of a dataset does not necessarily know the appropriate record format with which to interpret the contents as intended by the creator of the dataset. Since datasets, whether they include fixed-length and/or variable-length fields, are often prepared to be interpreted as a collection of data fields in a particular manner, a program that parses such a dataset must be written taking into account the intended interpretation before the dataset can be appropriately utilized by the program. Such an interpretation cannot generally be determined simply by looking at the contents.
The inventors have recognized and appreciated that, for datasets containing delimited data fields, the delimiters should be present in the dataset and have developed techniques for generating a user interface that allows a user to identify delimiters based on the content of the dataset. Some conventional interfaces may allow a user to select a delimiter from a pre-defined list of commonly-used delimiter characters (e.g., a comma) and interpret fields from the contents of the dataset as each being delimited by that character. The inventors have recognized, however, that datasets are in practice often constructed to be interpreted using a number of different data field delimiters and/or using unprintable byte values or characters that are not commonly used as delimiters. Without knowing the appropriate record format to parse such a dataset, it can be very difficult for a user to program a data processing system to properly interpret the contents of the dataset. By providing a tool having an interface that allows a user to quickly select a potential delimiter and see the resulting interpretation of the contents of the dataset based on this selection, the user can efficiently generate an appropriate record format.
According to some embodiments, the tool may generate a user interface including a number of user interface elements that each represent a character from a dataset, and that are presented in the order in which they appear in the dataset. A user can provide input to the tool by interacting with each of the user interface elements to convey whether the character represented by the user interface element should be, or should not be, treated as a delimiter of a data field. After each such interaction, the tool may automatically generate a record format that includes a data field defined as being delimited by the identified delimiter. Some or all of the contents of the dataset may be parsed and presented on the user interface in accordance with the record format. The resulting effects of parsing the dataset using this newly generated record may then be examined by visual inspection by a user through the user interface and/or by an automated analysis by the tool. Thus, whether the selected character is, or is not, a delimiter can be quickly determined. Since the characters are displayed in the same order as they appear in the dataset, a user can easily identify which characters are delimiter candidates and, by interacting with the corresponding user interface element of the tool, quickly generate new record formats until the record format used to generate the dataset is determined.
According to some embodiments, the tool's user interface may include a preview of the dataset contents as parsed with the record format defined by the selected delimiters. This preview may be regenerated automatically when any of the displayed delimiters are selected or unselected, or may be regenerated in response to interaction with a user interface element other than the displayed delimiters (e.g., a “refresh” button). In either case, a user selecting or deselecting delimiters from the displayed sequence of characters of the dataset can quickly ascertain the effects upon parsing contents of the dataset and determine whether a character has been inappropriately selected as a delimiter, or whether there is another unselected character that should be selected as a delimiter. Examples of such processes are discussed in further detail below.
As used herein, a “character” of a dataset may be a printable or a non-printable character, and may be represented in the dataset as any number of bits or bytes. For instance, ASCII characters may be represented by a single byte, and include printable characters (e.g., letters, numbers, etc.) as well as non-printable characters (e.g., the byte value of zero). Alternatively, some datasets may be read using character sets that interpret multiple bytes to represent one character. For instance, a UTF-8 character may be represented by one, two, three or four bytes, and could be a printable character or a non-printable character. Datasets may be interpreted using any suitable character set, as the techniques described herein are not so limited. The user interface may represent non-printable characters in any suitable way, including by displaying the byte value of the character (e.g., “\x09” for the tab character) or by displaying a shorthand representation of the character (e.g., “TAB” or “\t” for the tab character).
According to some embodiments, an initial selection state of each of the displayed user elements representing characters of the dataset may be predetermined upon initial generation of the user interface. That is, whether each of the user elements is initially in a selected state, or in an unselected state, may be predetermined. In some embodiments, heuristics may be applied to the dataset to make an initial qualitative estimation of which characters are delimiters, and the corresponding user interface elements of the user interface may be generated to initially be selected, whereas other characters may be generated to initially be unselected. This approach may therefore provide a user with a starting point in selecting the delimiters, which may decrease the time needed for the user to determine the appropriate record format.
Following below are more detailed descriptions of various concepts related to, and embodiments of, techniques for dynamically defining a data record format. It should be appreciated that various aspects described herein may be implemented in any of numerous ways. Examples of specific implementations are provided herein for illustrative purposes only. In addition, the various aspects described in the embodiments below may be used alone or in any combination, and are not limited to the combinations explicitly described herein.
In the example of
When parsing the dataset 101 using the record format 104, a computer-implemented parsing engine may operate in the following manner. Initially, the parsing engine may determine a value of “field 1” in a first record by looking through the characters of the dataset for a “,” character. For instance, the system may read bytes in a sequence from a dataset, such as a flat file or database table, until a byte value of the “,” character is identified. Once this character is found in the dataset (between the “2” and “D” characters), the preceding characters may be identified as the value of “field 1” for the first record, and the parsing engine may then determine a value of “field 2” by looking through the subsequent characters of the dataset for a newline character (sometimes represented by the shorthand “\n”). The system may create a data structure for the records (e.g., in computer memory) and insert the value of each field as it is determined into this data structure. Once the “\n” character is found (between the “s” and “9”), the preceding characters are identified as the value of “field 2” for the first record, and the parsing engine may then attempt to determine a value of “field 1” in a second record. This process may continue until all of the characters in the dataset have been read and the system's record data structure has been filled with data from the dataset.
It is important when parsing a dataset using delimiters that there be no missing delimiters in the data, otherwise the parsing engine would either never find the end of a data field or would produce a data field value that would contain values that were intended by the creator of the dataset to instead be placed within other data fields of the record. Similarly, if the record format is inappropriately defined to include a data field delimited by a character that does not appear in the data file, the parsing engine would never find the end of the data field.
In the example of
First, a system executing a parsing engine determines a value of “field 1” in a first record by looking through the characters of the dataset for a tab character, starting with the first character in the dataset. The first-encountered tab character is located after the “1” and before the “A.” The value of “field 1” is therefore defined to be “1” since this character is the only one between the start of the dataset and the identified delimiter. A value of “field 2” is then determined for the first record by looking through the subsequent characters of the dataset for a comma character, which is located after the “A” and before the “B.” The value of “field 2” is therefore defined to be “A.” In the parsing engine's execution, identification of a value for “field 2” completes a first record and the engine when begins a process of identifying a first field of the second record. The parsing engine determines a value of “field 1” in a second record by looking through the characters of the dataset after the end of the first record (after the comma) for a tab character. This is found after the “2” character and before the “X” character, and as a result the value of “field 1” is therefore defined to be “B and C\n2” where “\n” represents a newline character. Then a value of “field 2” is determined for the second record by looking through the subsequent characters of the dataset for a comma character, but there is no such character. As a result, the parsing engine is unable to determine the bounds of the “field 2” data field of the second record. This may produce an error, either because the data field is identified to have exceeded some predefined maximum field size or because a memory or buffer overflow error occurs. In either case, the dataset is not parsed as intended by the creator of the dataset.
A user faced with the error depicted in
It may be noted that, in some cases, a parsing engine may successfully parse a dataset without producing the type of error illustrated in
To illustrate how the tool as described herein may operate to determine the canonical record format,
A user viewing the user interface 300 shown in
According to some embodiments, to change the record format the user may interact with one of the user interface elements 310 (e.g., by clicking on the element with a mouse pointer) to change its state from selected to unselected, or vice versa. The parsing engine executed by the tool may then reparse the dataset and display the results in user interface element 330; this operation may be performed in response to the user's changing of the state of a user interface element 310, or may be performed in response to the user interacting with another user interface element not shown in the figure (e.g., a button that regenerates the contents of user interface 330 by generating a new record format according to the selected delimiters and reparsing the dataset using this record format).
A user now has visual confirmation that the selected group of delimiters appropriately parse the dataset, as user interface element 330 illustrates values for a number of fields that appear to contain consistent data and generate no errors. In some embodiments, the tool may select a subset of the records to display. In some cases, the tool may parse only a portion of the records in order to display this subset. In some embodiments, a subset of records may be selected by interface elements provided by user interface 300 that enable a user to examine a number of records, which may span across the dataset, to ensure that the dataset is fully parsed from start to finish. For instance, the user interface 300 may depict records from the start, middle and/or end of the dataset, and/or may provide a control that a user may operate to scroll through the records produced by parsing of the dataset using the selected delimiters. Parsing a portion of the records (e.g., the first ten records, the first five records and the last five records, etc.) using the generated record format may efficiently allow the user to obtain visual confirmation that the generated record format appropriately parses the dataset without it being necessary to parse the entire dataset. The user may thereby efficiently select the appropriate delimiters, obtain confirmation of appropriate parsing, and record the resulting record format.
As a result of the above-described process, the tool producing user interface 300 enabled a user to select an appropriate set of delimiters from amongst a finite number of choices. A provisional record format was generated according to this set of delimiters, and feedback was provided through the user interface such that the user could establish whether or not the provisional record format matches the canonical record format. Since the choices of delimiter presented are from the dataset itself, the delimiters of the canonical record format must be present within those choices. Moreover, selection or deselection of a delimiter, and generation of a new provisional record format reflecting the new set of delimiters, can be limited to interaction (e.g., a mouse click) with a single user interface element. Finally, by providing prompt feedback of the results of parsing the dataset with the newly generated provisional record format, the user can obtain direct feedback on the effects the change in delimiter had upon how the data is parsed. Together, these advantages produce a process in which a (potentially complex) record format may be determined quickly and accurately.
In the example of
In the example of
In the example of
In the example of
In the example of
In the example of
In operation, the tool executing the illustrated user interface 400 generates a new provisional record format according to the selection of delimiters identified through user interface element 420 (e.g., generates a new record format whenever the set of selected delimiters changes). When the “Apply” button 432 is activated or otherwise, the dataset may be parsed using the new provisional record format by a parsing engine executed by the tool, and results of said parsing are shown by user interface element 440. Parsing of the dataset by the tool using the most recently generated record format may be performed in response to a change in the selected/unselected state of any of the characters shown by user interface elements 420, and/or in response to activation of the “Apply” button 432.
The illustrative user interface 400 includes a “Clear” button 422 which, when activated, deselects all of the characters as delimiters. The interface 400 also includes a “Suggest” button 424 which, when activated, applies heuristics to determine a set of delimiters that may match the data. These heuristics may sometimes produce the appropriate set of characters, and sometimes may not, but they can be used to at least provide a starting point for a user trying to determine the set of delimiters. Examples of such heuristics are described below.
Method 500 begins in act 504 in which a dataset is parsed by a parsing engine executed by the tool according to a first provisional record format. The dataset may be located on any number of non-transitory computer-readable medium accessible to the system executing method 500, or may be provided as a data stream being received from an external system. In some cases, the dataset may be a file stored by one or more volatile and/or non-volatile computer readable storage media. In some cases, the dataset may be data stored within a database (e.g., the dataset may be a table or view of a database). Irrespective of how or where the dataset is stored, the system executing method 500 executes in act 504 a parsing engine to produce a data structure containing records and data fields by parsing the dataset according to the first provisional record format. The first provisional record format may, in some cases, be an empty or otherwise undefined record format when no delimiters have as yet been selected. In other cases, the first provisional record format may include a single delimited field to separate records from one another (e.g., “\n” delimiter) but may otherwise not identify separate fields within each record.
In act 506, results of parsing the dataset are displayed via a user interface along with a sequence of characters from the dataset. Displaying results of parsing the dataset may include displaying of some or all of the records and/or data fields produced in act 504, and may include displaying additional results, such as error messages or other feedback messages relating to parsing of the dataset, via the user interface. The sequence of characters displayed in act 506 may be displayed in the user interface in an order matching that order in which the characters appear in the dataset.
In some embodiments, a selected or unselected state in the user interface of each character of the sequences of characters displayed in act 506 may be determined according to the first provisional record format. That is, the delimited fields defined by the first provisional record format may imply which of the characters of the dataset being shown in the user interface have been selected as delimiters, and these characters may be displayed in the user interface in act 506 as being in a selected state. A selected state in the user interface may include any visual approach or approaches to visually distinguish the selected characters from the unselected characters.
In act 508, a user may provide input to the user interface that causes one of the sequence of characters to change from an unselected state to a selected state, or from a selected state to an unselected state. This input may be provided using any suitable input device and in any suitable way (e.g., by clicking on a user interface element with a mouse or other input device). In act 510, a second provisional record format is generated by the system based on the set of selected delimiters amongst the displayed sequence of characters (which includes the change in said set that occurred in act 508). This set of selected delimiters will either include a character selected in act 508 or will not include a character that was unselected in act 508. Accordingly, in cases where the second provisional record format is generated without additional selection or deselection of characters, the second provisional record format may differ from the first provisional record format by either including an additional data field delimited by the character selected in act 508 or by not including a data field delimited by the character that was deselected in act 508. Aside from this field the two record formats may be otherwise identical.
In act 512, the dataset is parsed by a parsing engine executed by the tool according to the second provisional record format. The system executing method 500 executes the parsing engine to produce a data structure containing records and data fields by parsing the dataset according to the second record format. In act 514, results of parsing the contents of the dataset in act 512 are displayed via the user interface. Displaying results of parsing the dataset may include displaying of some or all of the records and/or data fields produced in act 512, and may include displaying additional results, such as error messages or other feedback messages relating to parsing of the dataset, via the user interface.
It will be appreciate that method 500 may be repeated any number of times until a user accepts the most recently generated record format. In some embodiments, the user interface may accordingly include one or more controls that, when activated, proceed to a next step in a process that comprises method 500. Such next steps may include recording the accepted record format in a metadata repository or other datastore (e.g., a database) and/or executing a dataflow graph wherein a dataset is parsed using the accepted record format.
Method 600 begins in act 602 in which it is determined that a dataset for which a record format is to be generated contains multiple delimiters, and that therefore the record format may be generated via the techniques described herein. Potential delimiters may be identified from a list of characters that are assumed to be delimiters when they appear in data. As a non-limiting example, potential delimiters may include all characters that are not alphanumeric, a space, a quote, a period, a slash (e.g., “/” or “\”) or a hyphen character. This list of potential delimiters would thus exclude most typical data characters and search for repeated instances of characters that would typically not be found in, for example, business data. Note that such an approach would consider non-printable characters like a newline character a potential delimiter.
In act 602, a first record format is generated by apply heuristics to the dataset. According to some embodiments, the first record format may be generated comprising delimited data fields each delimited by one of the potential delimiters identified in act 602. According to some embodiments, a frequency with which potential delimiters appear in the data file may be analyzed to selected delimiters of the record format. For instance, a potential delimiter that appears significantly more than other potential delimiters in the dataset may have been erroneously identified as a delimiter. According to some embodiments, it may be assumed that records end with a newline character (or a carriage return and a newline). According to some embodiments, a parsing engine may determine whether a candidate record format fully parses the dataset (i.e., parses the dataset into a complete number of records) to determine whether a set of delimiters may be the appropriate set for parsing of the dataset. If the record format does not fully parse the dataset, this indicates the set of delimiters is not the appropriate one.
Irrespective of how the first record format is generated in act 604, in act 606 method 500 is executed and a new record format generated according to selection and/or deselection of characters as delimiters. Act 606 may be repeated any number of times until the user is satisfied with the current set of delimiters, at which point the final record format may be recorded in act 608.
The technology described herein is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the technology described herein include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The computing environment may execute computer-executable instructions, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The technology described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 710 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 710 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 710. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media.
The system memory 730 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 731 and random access memory (RAM) 732. A basic input/output system 733 (BIOS), containing the basic routines that help to transfer information between elements within computer 710, such as during start-up, is typically stored in ROM 731. RAM 732 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 720. By way of example, and not limitation,
The computer 710 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 710 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 780. The remote computer 780 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 710, although only a memory storage device 781 has been illustrated in
When used in a LAN networking environment, the computer 710 is connected to the LAN 771 through a network interface or adapter 770. When used in a WAN networking environment, the computer 710 typically includes a modem 772 or other means for establishing communications over the WAN 773, such as the Internet. The modem 772, which may be internal or external, may be connected to the system bus 721 via the user input interface 760, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 710, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Having thus described several aspects of at least one embodiment of this invention, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art.
Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of the invention. Further, though advantages of the present invention are indicated, it should be appreciated that not every embodiment of the technology described herein will include every described advantage. Some embodiments may not implement any features described as advantageous herein and in some instances one or more of the described features may be implemented to achieve further embodiments. Accordingly, the foregoing description and drawings are by way of example only.
According to some aspects, a method is provided of determining a record format for a dataset, the dataset comprising a plurality of bytes, the method comprising, with at least one computing device parsing the dataset using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields using the sequence of characters in accordance with the first record format, displaying at least some of the values of the one or more data fields in accordance with the first record format via a user interface, displaying a plurality of the sequence of characters via the user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receiving user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, and generating a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the character associated with the selected user interface element, parsing a portion of the dataset using the second record format, displaying results of said parsing of the portion of the dataset using the second record format via the user interface, receiving user input indicating that the second record format is to be recorded, and recording the second record format on at least one computer readable medium.
According to some embodiments, displaying the plurality of the sequence of characters may comprise displaying a contiguous subset of the sequence of characters via the user interface as the sequence of user interface elements, wherein each character of the subset is presented in sequence as a separate user interface element.
According to some embodiments, the method may further comprise determining that the second record format does not fully parse the dataset by identifying a memory overflow or by identifying a parsed record that comprises one or more unpopulated data fields, and wherein displaying the results of the parsing of the dataset using the second record format via the user interface comprises displaying an alert that the second record format does not fully parse the dataset.
According to some embodiments, the method may further comprise determining the first record format based at least in part on one or more heuristics to identify one or more characters as a potential delimiter.
According to some embodiments, determining the first record format may comprise identifying a character of the dataset that is not alphanumeric, a space, a quote, a period, a forward-slash or a hyphen, and generating a data field of the first record format that is delimited by the identified character.
According to some embodiments, the first character may be a non-printable character.
According to some embodiments, the first record format may include only delimited data fields.
According to some embodiments, the user input may cause the at least one computing device to alter the selected user interface element's appearance in the user interface.
According to some embodiments, displaying the results of said parsing of the dataset using the first record format via the user interface may comprise displaying a list of records of the dataset and data field values of the records.
According to some embodiments, the first record format may include a plurality of delimited data fields having a plurality of different delimiters.
According to some aspects, a computer system is provided comprising at least one processor, at least one user interface device, and at least one computer readable medium comprising processor-executable instructions that, when executed, cause the at least one processor to parse a dataset comprising a plurality of bytes using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the first record format, display, via the at least one user interface device, at least some of the values of the one or more data fields of the first record format via the at least one user interface, display, via the at least one user interface device, a plurality of the sequence of characters via the at least one user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receive, via the at least one user interface device, user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, generate a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the character associated with the selected user interface element, parsing a portion of the dataset using the second record format displaying results of said parsing of the portion of the dataset using the second record format via the user interface, receiving user input indicating that the second record format is to be recorded, and recording the second record format on at least one computer readable medium.
According to some embodiments, displaying the plurality of the sequence of characters may comprise displaying a contiguous subset of the sequence of characters via the user interface as the sequence of user interface elements, wherein each character of the subset is presented in sequence as a separate user interface element.
According to some embodiments, the processor-executable instructions may further cause the at least one processor to determine that the second record format does not fully parse the dataset by identifying a memory overflow or by identifying a parsed record that comprises one or more unpopulated data fields, and wherein displaying the results of the parsing of the dataset using the second record format via the user interface comprises displaying an alert that the second record format does not fully parse the dataset.
According to some embodiments, the processor-executable instructions may further cause the at least one processor to determine the first record format based at least in part on one or more heuristics to identify one or more characters as a potential delimiter.
According to some embodiments, determining the first record format may comprise identifying a character of the dataset that is not alphanumeric, a space, a quote, a period, a forward-slash or a hyphen, and generating a data field of the first record format that is delimited by the identified character.
According to some embodiments, determining the first record format may comprise identifying a data record delimiter.
According to some embodiments, the user input may cause the at least one processor to alter the first user interface element's appearance in the user interface.
According to some embodiments, displaying the results of said parsing of the dataset using the first record format via the at least one user interface device may comprise displaying a list of records of the dataset and data field values of the records.
According to some embodiments, the first record format may include a plurality of delimited data fields having a plurality of different delimiters.
According to some aspects, a computer system is provided comprising at least one processor, means for parsing a dataset comprising a plurality of bytes using a first record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the first record format, means for displaying at least some of the values of the one or more data fields of the first record format via the at least one user interface, means for displaying a portion of the sequence of characters via the at least one user interface as a sequence of user interface elements, wherein each character of the portion of the sequence of characters is presented in sequence as a separate user interface element, means for receiving user input associated with a first user interface element of the sequence of user interface elements, the first user interface element associated with a first character of the sequence of characters, means for generating a second record format based on the received input, wherein the second record format is generated to include a data field delimited by the first character, means for parsing a portion of the dataset using the second record format, means for displaying results of said parsing of the portion of the dataset using the second record format via the user interface, means for receiving user input indicating that the second record format is to be recorded, and means for recording the second record format on at least one computer readable medium.
According to some aspects, a method is provided of determining a record format for a dataset, the dataset comprising a plurality of bytes, the method comprising, with at least one computing device iteratively receiving user input and generating record formats based upon the user input, said iterative process continuing until receiving user input indicating a most recently generated record format is to be output, said iterative process comprising repeating steps of parsing the dataset using an initial record format to determine a sequence of characters represented by the plurality of bytes and determining values of one or more data fields in accordance with the initial record format, displaying at least some of the values of the one or more data fields in accordance with the initial record format via a user interface, displaying a plurality of the sequence of characters via the user interface as a sequence of user interface elements, wherein each of the plurality of characters is presented as a separate user interface element, receiving user input selecting a user interface element of the sequence of user interface elements, the selected user interface element being associated with a character of the sequence of characters, generating a subsequent record format based on the received input, wherein the subsequent record format is generated to include a data field delimited by the character associated with the selected user interface element, and ending the iterative process upon receiving the user input indicating a most recently generated record format is to be output, and recording the most recently generated record format on at least one computer readable medium.
The above-described embodiments of the technology described herein can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor. Alternatively, a processor may be implemented in custom circuitry, such as an ASIC, or semi-custom circuitry resulting from configuring a programmable logic device. As yet a further alternative, a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom. As a specific example, some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor. Though, a processor may be implemented using circuitry in any suitable format.
Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.
Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, the invention may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the invention discussed above. As is apparent from the foregoing examples, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above. As used herein, the term “computer-readable storage medium” encompasses only a non-transitory computer-readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine. Alternatively or additionally, the invention may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present invention as discussed above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present invention.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
Various aspects of the present invention may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Also, the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Further, some actions are described as taken by a “user.” It should be appreciated that a “user” need not be a single individual, and that in some embodiments, actions attributable to a “user” may be performed by a team of individuals and/or an individual in combination with computer-assisted tools or other mechanisms.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
The present application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 62/542,631, filed Aug. 8, 2017, titled “Techniques for Dynamically Defining a Data Record Format,” which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62542631 | Aug 2017 | US |