The subject disclosure generally relates to recognizing structure in text, and more particularly to a grammar programming language for recognizing structure in text.
Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The success of XML is evidence that there is significant demand for using text to represent information—this evidence is even more compelling considering the relatively poor readability of XML syntax and the decade-long challenge to make XML-based information easily accessible to programs and stores. The emergence of simpler technologies like JSON and the growing use of meta-programming facilities in Ruby to build textual domain specific languages (DSLs) such as Ruby on Rails or Rake speak to the desire for natural textual representations of information. However, even these technologies limit the expressiveness of the representation by relying on fixed formats to encode all information uniformly, resulting in text that has very few visual cues from the problem domain (much like XML).
The above-described deficiencies of are merely intended to provide an overview of some of the problems of conventional systems, and are not intended to be exhaustive. Other problems with conventional systems and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description.
A simplified summary is provided herein to help enable a basic or general understanding of various aspects of exemplary, non-limiting embodiments that follow in the more detailed description and the accompanying drawings. However, this summary is not intended to represent an extensive or exhaustive overview. Instead, the sole purpose of this summary is to present some concepts related to some exemplary non-limiting embodiments in a simplified form as a prelude to the more detailed description of the various embodiments that follow.
Embodiments of a method, system, and computer product for processing information embedded in a text file with a grammar programming language are described. In various non-limiting embodiments, the method includes receiving a text file having a plurality of input values. Within such embodiment, each of the input values are parsed according to a set of rules. The method also includes compiling a script so as to produce a set of candidate textual shapes such that each of the candidate textual shapes correspond to a potential interpretation of the input values. And finally, the method concludes with providing an output, which may include either a processed value or a textual representation of the text file. Here, the processed value corresponds to a particular textual shape, where the particular textual shape is selected from the candidate textual shapes, and the textual representation includes generic data structures that facilitate providing any of the candidate textual shapes, where the generic data structures are a function of the set of rules.
In another embodiment, a computer-readable storage medium is provided. Within such embodiment, five modules including instructions for executing various tasks are provided. In the first module, instructions are provided for receiving a text file as an input, whereas the second module includes instructions for providing a library of constructs for interpreting a textual shape of the text file. The third module, includes instructions for providing a script editor configured to facilitate generating a script of a grammar programming language in which the script includes constructs from the constructs library. In the fourth module, instructions are provided for compiling the script against the text file so as to generate candidate textual shapes in which each of the candidate textual shapes corresponds to a potential interpretation of the text file. And finally, the fifth module includes instructions for providing an output, which may include either a processed value or a textual representation of the text file. Here again, the processed value corresponds to a particular textual shape, where the particular textual shape is selected from the candidate textual shapes, and the textual representation includes generic data structures that facilitate providing any of the candidate textual shapes, where the generic data structures are a function of the set of rules.
In yet another embodiment, a system for processing information embedded in a text file with a grammar programming language is provided. The system includes means for receiving a text file having a plurality of input values. Within such embodiment, means for parsing each of the input values according to a set of rules is provided. The system also includes a means for identifying a syntactical ambiguity, as well as a means for identifying a token ambiguity. The system further includes means for prioritizing a set of candidate textual shapes in which at least one candidate resolution to the syntactical ambiguity is included in the candidate textual shapes. Also included are a means for resolving the token ambiguity as well as means for compiling a script so as to produce the candidate textual shapes such that each of the candidate textual shapes correspond to a potential interpretation of the input values. And finally, the system includes a means for providing an output, which may include either a processed value or a textual representation of the text file. Here again, the processed value corresponds to a particular textual shape, where the particular textual shape is selected from the candidate textual shapes, and the textual representation includes generic data structures that facilitate providing any of the candidate textual shapes, where the generic data structures are a function of the set of rules.
These and other embodiments are described in more detail below.
Various non-limiting embodiments are further described with reference to the accompanying drawings in which:
Various embodiments are now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be evident, however, that such embodiment(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more embodiments.
In an aspect, a novel grammar programming language (hereinafter sometimes referred to as “Mg”) is provided. As will be discussed in more detail below, particular embodiments described herein enable information to be represented in a textual form that is tuned for both the problem domain and the target audience.
Referring first to
When used as a transformation language, however, Mg scripts may be used to project the textual input of text file 110 into generic data structures that are amenable to further processing or storage such as text file representation 140. Indeed, in an embodiment, data that results from Mg processing is compatible with Mg's sister language, The “Oslo” Modeling Language, “M”, which provides a SQL-compatible schema and query language that can be used to further process the underlying information of text file 110. Here, it should be noted that, although Mg is particularly useful within the context of parsing computer program text, text file 110 may include any file that includes a plurality of characters.
Referring next to
In one aspect, processor component 210 is configured to execute computer-readable instructions related to performing any of a plurality of functions. Such functions may include controlling any of memory component 220, interface component 230, construct library component 240, parser component 250, and/or compiler component 260. Other functions performed by processor component 210 may include analyzing information and/or generating information that can be utilized by any of memory component 220, interface component 230, construct library component 240, parser component 250, and/or compiler component 260. Here, it should also be noted that processor component 210 can be a single processor or a plurality of processors.
In another aspect, memory component 220 is coupled to processor component 210 and configured to store computer-readable instructions executed by processor component 210. Memory component 220 may also be configured to store any of a plurality of other types of data including, for instance, queued text files to be analyzed, compile-time artifacts, etc., as well as data generated by any of interface component 230, construct library component 240, parser component 250, and/or compiler component 260. Memory component 220 can be configured in a number of different configurations, including as random access memory, battery-backed memory, hard disk, magnetic tape, etc. Various features can also be implemented upon memory component 220, such as compression and automatic back up (e.g., use of a Redundant Array of Independent Drives configuration).
As shown, computing system 200 may also include interface component 230. In an embodiment, interface component 230 is coupled to processor component 210 and configured to interface computing system 200 with external entities. For instance, receiving component 630 may be configured to receive text files to be analyzed, as well as to provide a script editor tool for authoring Mg scripts. Interface component 230 may also be configured to display an output to a user, as well as to transmit the output to an external entity (e.g., via a network connection).
In another aspect, computing system 200 also includes construct library 240, as shown. Within such embodiment, construct library 240 includes a plurality of constructs that may be utilized to describe the shape of a textual language. Moreover, construct library 240 provides a user with a plurality of constructs that may be used to author Mg scripts designed to ascertain the particular textual shape of a text file. Such constructs may be utilized to enforce particular rules, including rules designed to resolve potential ambiguities encountered while parsing a text file. A more detailed discussion of various constructs provided in Mg is discussed later.
Computing system 200 may also include parser component 250. In an embodiment, parser component 250 is configured to parse through received text files according to a set of rules, which may include a set of default rules and/or a set of rules explicitly declared by a user. Specifically, parser component 250 is configured to ascertain the textual value of each character, either individually or in combination, so as to determine how such textual value should be represented.
In another aspect, computing system 200 also includes compiler component 260, as shown. In an embodiment, compiler component 260 is coupled to processor component 210 and configured to compile scripts generated by a user. Here, it should be noted that compiler 260 may be configured to compile any of a plurality of types of compile-time artifacts. For instance, in an aspect, a plurality of candidate textual shapes for a given text file might be compiled, wherein such candidate textual shapes correspond to potential interpretations of parsed text values.
Turning to
Referring next to
Referring next to
Referring next to
Referring next to
In another embodiment, lexical ambiguities are resolved using an ambiguity resolution mechanism provided by the parser. Within such embodiment, each time the parser asks the lexer for a token, the parser provides the lexer with an indication of the last token received and which token patterns it is expecting at that time, wherein the lexer restricts the token patterns it considers to that set. The lexer then starts at the next character after the previous returned token and tries to apply each pattern to the subsequent input “greedily.” Each pattern that matches then produces a token at the longest length that the pattern supports. This mechanism may be referred to as a “local max-munch” mechanism because each pattern “max-munches” separately, instead of the whole lexer “max-munching” for the union of all acceptable patterns. For instance, if two or more tokens of different lengths are returned, then the parser will spawn different “threads” of execution for each possible token and now the threads are no longer synchronized at the same character position but can now veer off. Exemplary Mg code for this mechanism may include:
This language operates in the following manner. Upon execution, the two alternatives of “Main” start consuming input, wherein the initial tokens allowed are “Hello” and “EverythingButDash.” Therefore, if “hello” is followed by a whitespace, the first tokens for both “Main” alternatives are satisfied. On the “HelloWorld” path, a “World” token (or interleaves) is expected, whereas a “Dash” token (or interleaves) is expected on the “Gobbler” path. If “world” is seen, the text is consumed, wherein a “Dash” token (or interleaves) is now expected by both the “HelloWorld” path and the “Gobbler” path. Once a “Dash” token is seen, only an “EndHelloWorld” token or an “EndGobbler” token is subsequently expected. Based on whether an “EndHelloWorld” or “EndGobbler” token is seen, one or the other syntax is uniquely matched. As a result, a token like “EverythingButDash” may be defined without overwhelming all lexing (i.e., it is only considered when it is expected as a parse state).
Referring next to
Referring next to
Referring next to
Exemplary Grammar Programming Language
As stated previously, an exemplary grammar language that is compatible with the scope and spirit of the disclosed subject matter is the M Grammar Language (Mg), which was developed by the assignee of the subject application. In addition to Mg, however, it is to be understood that other similar programming languages may be used, and that the utility of the disclosed subject matter is not limited to any single programming language. A brief description of Mg is provided below.
In an embodiment, an Mg-based language definition includes one or more named rules, each of which describe some part of the language. The following fragment is an example of a simple language definition:
The language being specified is named HelloLanguage and it is described by one rule named Main. A language may contain more than one rule; the name Main is used to designate the initial rule that all input documents must match in order to be considered valid with respect to the language.
In one aspect, rules use patterns to describe the set of input values that the rule applies to. The Main rule above has only one pattern, “Hello, world” that describes exactly one legal input value:
Hello, World
If that input is fed to the Mg processor for this language, the processor will report that the input is valid. Any other input will cause the processor to report the input as invalid.
Typically, a rule will use multiple patterns to describe alternative input formats that are logically related. For example, consider the following language:
Here, the Main rule has three patterns—input must conform to one of these patterns in order for the rule to apply. That means that the following is valid:
Red
as well as this:
Green
and this:
Blue
No other input values are valid in this language.
Most patterns in the wild are more expressive than those mentioned thus far—most patterns combine multiple terms. Every pattern consists of a sequence of one or more grammar terms, each of which describes a set of legal text values. Pattern matching has the effect of consuming the input as it sequentially matches the terms in the pattern. Each term in the pattern consumes zero or more initial characters of input—the remainder of the input is then matched against the next term in the pattern. If all of the terms in a pattern cannot be matched, the consumption is “undone” and the original input may be used as a candidate for matching against other patterns within the rule.
A pattern term can either specify a literal value (like in the first example) or the name of another rule. The following language definition matches the same input as the first example:
Like functions in a traditional programming language, rules can be declared to accept parameters. A parameterized rule declares one or more “holes” that must be specified to use the rule. The following is a parameterized rule:
syntax Greeting(salutation, separator)=salutation separator “World”;
To use a parameterized rule, actual rules may simply be provided as arguments to be substituted for the declared parameters:
syntax Main=Greeting(Prefix, “,”);
It should also be noted that a given rule name may be declared multiple times provided each declaration has a different number of parameters. That is, the following is legal:
The selection of which rule is used is determined based on the number of arguments present in the usage of the rule.
A pattern may indicate that a given term may match repeatedly using the standard Kleene operators (e.g., ?, *, and +). For example, consider this language:
This language considers the following all to be valid:
Terms can be grouped using parentheses to indicate that a group of terms must be repeated:
which considers the following to all be valid input:
The use of the +operator indicates that the group of terms must match at least once.
In the previous examples of the HelloLanguage, the pattern term for the comma separator included a trailing space. That trailing space was significant, as it allowed the input text to include a space after the comma:
Hello, World
More importantly, the pattern indicates that the space is not only allowed, but is required. That is, the following input is not valid:
Hello,World
Moreover, exactly one space is required, making this input invalid as well:
Hello, World
To allow any number of spaces to appear either before or after the comma, the rule could have been written like this:
syntax Main=‘Hello’“*‘,’”*‘World’;
While this is correct, in practice most languages have many places where secondary text such as whitespace or comments can be interleaved with constructs that are primary in the language. To simplify specifying such languages, a language may specify one or more named interleave patterns.
An interleave pattern specifies text streams that are not considered part of the primary flow of text. When processing input, the Mg processor implicitly injects interleave patterns between the terms in all syntax patterns. For example, consider this language:
This language now accepts any number of whitespace characters before or after the comma. That is,
are all valid with respect to this language.
Interleave patterns simplify defining languages that have secondary text like whitespace and comments. However, many languages have constructs in which such interleaving needs to be suppressed. To specify that a given rule is not subject to interleave processing, the rule is written as a token rule rather than a syntax rule. Token rules identify the lowest level textual constructs in a language—by analogy token rules identify words and syntax rules identify sentences. Like syntax rules, token rules use patterns to identify sets of input values. Here's a simple token rule:
token BinaryValueToken=(“0”|“1”)+;
It identifies sequences of 0 and 1 characters much like this similar syntax rule:
syntax BinaryValueSyntax=(“0”|“1”)+;
A distinction between the two rules is that interleave patterns do not apply to token rules. That means that if the following interleave rule was in effect:
interleave IgnorableText=“ ”+;
then the following input value:
0 1011 1011
would be valid with respect to the BinaryValueSyntax rule but not with respect to the BinaryValueToken rule, as interleave patterns do not apply to token rules.
Mg also provides a shorthand notation for expressing alternatives that consist of a range of Unicode characters. For example, the following rule:
token AtoF=“A”|“B”|“C”|“D”|“E”|“F”;
can be rewritten using the range operator as follows:
token AtoF=“A”..“F”;
Ranges and alternation can compose to specify multiple non-contiguous ranges:
token AtoGnoD=“A”..“C”|“E”..“G”;
which is equivalent to this longhand form:
token AtoGnoD=“A”|“B”|“C”|“E”|“F”|“G”;
Note that the range operator only works with text literals that are exactly one character in length.
The patterns in token rules have a few additional features that are not valid in syntax rules. Specifically, token patterns can be negated to match anything not included in the set, by using the difference operator (−). The following example combines “difference” with “any.” “Any” matches any single character. The expression below matches any character that is not a vowel:
any−(‘A’|‘E’|‘I’|‘O’|‘U’)
Token rules are named and may be referred to by other rules:
Because token rules are processed before syntax rules, token rules cannot refer to syntax rules:
However, syntax rules may refer to token rules:
The Mg processor treats all literals in syntax patterns as anonymous token rules. That means that the previous example is equivalent to the following:
Operationally, the difference between token rules and syntax rules is when they are processed. Token rules are processed first against the raw character stream to produce a sequence of named tokens. The Mg processor then processes the language's syntax rules against the token stream to determine whether the input is valid and optionally to produce structured data as output. The next section describes how that output is formed.
Mg processing transforms text into structured data. The shape and content of that data is determined by the syntax rules of the language being processed. Each syntax rule consists of a set of productions, each of which consists of a pattern and an optional projection. Patterns were discussed previously and describe a set of legal character sequences that are valid input. Projections describe how the information represented by that input should be produced.
Each production is like a function from text to structured data. The primary way to write projections is to use a simple construction syntax that produces graph-structured data suitable for programs and stores. For example, consider this rule:
This rule has one production that has a pattern that matches “Rock” and a projection that produces the following value (using a notation known as D graphs):
Rules can contain more than one production in order to allow different input to produce very different output. Here's an example of a rule that contains three productions with very different projections:
When a rule with more than one production is processed, the input text is tested against all of the productions in the rule to determine whether the rule applies. If the input text matches the pattern from exactly one of the rule's productions, then the corresponding projection is used to produce the result. In this example, when presented with the input text “Hamster”, the rule would yield the following as a result:
To allow a syntax rule to match no matter what input it is presented with, a syntax rule may specify a production that uses the empty pattern, which will be selected if and only if none of the other productions in the rule match:
When the production with the empty pattern is chosen, no input is consumed as part of the match.
To allow projections to use the input text that was used during pattern matching, pattern terms associate a variable name with individual pattern terms by prefixing the pattern with an identifier separated by a colon. These variable names are then made available to the projection. For example, consider this language:
Given this input value:
Red, Blue
The Mg processor would produce this output:
Like all projection expressions discussed thus far, literal values may appear in the output graph. A set of literal types supported by Mg and a few examples follow:
Text literals—“ABC”, ‘ABC’
Integer literals—25, −34
Real literals—0.0, −5.0E15
Logical literals—true, false
Null literal—null
The projections discussed thus far all attach a label to each graph node in the output (e.g., Gradient, Start, etc.). The label is optional and can be omitted:
syntax Naked=t1:First t2:Second=>{t1,t2};
The label can be an arbitrary string—to allow labels to be escaped, one uses the id operator:
syntax Fancy=t1:First t2:Second=>id(“Label with Spaces!”){t1,t2};
The id operator works with either literal strings or with variables that are bound to input text:
syntax Fancy=name:Name t1:First t2:Second=>id(name){t1,t2};
Using id with variables allows the labeling of the output data to be driven dynamically from input text rather than statically defined in the language. This example works when the variable name is bound to a literal value. If the variable was bound to a structured node that was returned by another rule, that node's label can be accessed using the labelof operator:
syntax Fancier p:Point=>id(labelof(p)){1,2,3};
The labelof operator returns a string that can be used both in the id operator as well as a node value.
The projection expressions shown so far have no notion of order. That is, this projection expression:
A{X{100},Y{200}}
is semantically equivalent to this:
A{Y{200},X{100}}
and implementations of Mg are not required to preserve the order specified by the projection. To indicate that order is significant and must be preserved, brackets are used rather than braces. This means that this projection expression:
A[X{100},Y{200}]
is not semantically equivalent to this:
A[Y{200},X{100}]
The use of brackets is common when the sequential nature of information is important and positional access is desired in downstream processing.
Sometimes it is useful to splice the nodes of a value together into a single collection. The valuesof operator will return the values of a node (labeled or unlabeled) as top-level values that are then combinable with other values as values of new node.
Here, valuesof(list) returns the all the values of the list node, combinable with “a” to form a new list.
Productions that do not specify a projection get the default projection. For example, consider the following language that does not specify productions:
When presented with the input “Blue on Green” the language processor returns the following output:
Main[Gradient[“Red”,“on”,“Green”]]]
These default semantics allows grammars to be authored rapidly while still yielding understandable output. However, in practice explicit projection expressions provide language designers complete control over the shape and contents of the output.
All of the examples shown so far have been “loose Mg” that is taken out of context. To write a legal Mg document, all source text must appear in the context of a module definition. A module defines a top-level namespace for any languages that are defined. Below is an exemplary module definition:
In this example, the module defines one language named Literals.Number. Modules may refer to declarations in other modules by using an import directive to name the module containing the referenced declarations. For a declaration to be referenced by other modules, the declaration must be explicitly exported using an export directive. For example, consider the following module:
Note that only MyLanguage1 is visible to other modules. This makes the following definition of HerModule legal:
As this example shows, modules may have circular dependencies.
Referring next to lexical structure, it should be noted that an Mg program may include one or more source files, known formally as compilation units. A compilation unit file is an ordered sequence of Unicode characters. Compilation units typically have a one-to-one correspondence with files in a file system, but this correspondence is not required. For maximal portability, it is recommended that files in a file system be encoded with the UTF-8 encoding.
Conceptually speaking, a program may be compiled using four steps. First a lexical analysis is made, which translates a stream of Unicode input characters into a stream of tokens. In an embodiment, lexical analysis evaluates and executes pre-processing directives. Second, a syntactic analysis is made, which translates the stream of tokens into an abstract syntax tree. Third, a semantic analysis is made, which resolves all symbols in the abstract syntax tree, type checks the structure and generates a semantic graph. And Fourth, a code generation step is included, which generates instructions from the semantic graph for some target runtime, producing an image. Further tools may link images and load them into a runtime.
Referring next to grammars, it should be noted that hereinafter the syntax of the Mg programming language will be presented using two grammars. A lexical grammar defines how Unicode characters are combined to form line terminators, white space, comments, tokens, and pre-processing directives, whereas a syntactic grammar defines how the tokens resulting from the lexical grammar are combined to form Mg programs.
In an embodiment, the lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that non-terminal symbol into sequences of non-terminal or terminal symbols. In grammar productions, NonTerminal symbols are shown in italic type, and terminal, symbols are shown in a fixed-width font. The first line of a grammar production is the name of the non-terminal symbol being defined, followed by a colon. Each successive indented line contains a possible expansion of the non-terminal given as a sequence of non-terminal or terminal symbols. For example, the production:
defines an IdentifierVerbatim to consist of the token “[”, followed by IdentifierVerbatimCharacters, followed by the token “]”.
When there is more than one possible expansion of a non-terminal symbol, the alternatives are listed on separate lines. For example, the production:
defines DecimalDigits to either consist of a DecimalDigit or consist of DecimalDigits followed by a DecimalDigit. In other words, the definition is recursive and specifies that a decimal-digits list consists of one or more decimal digits.
A subscripted suffix “opt” may be used to indicate an optional symbol. The production:
is shorthand for:
and defines a DecimalLiteral to consist of an IntegerLiteral followed by a ‘.’ a DecimalDigit and by optional DecimalDigits.
Alternatives are normally listed on separate lines, though in cases where there are many alternatives, the phrase “one of” may precede a list of expansions given on a single line. This is simply shorthand for listing each of the alternatives on a separate line. For example, the production:
is shorthand for:
Conversely, exclusions are designated with the phrase “none of”. For example, the production:
permits all characters except ‘″’, ‘\’, and new line characters.
Referring next to lexical grammar, it should be noted that the terminal symbols of the lexical grammar are the characters of the Unicode character set, and the lexical grammar specifies how characters are combined to form tokens, white space, and comments. Every source file in an Mg program must conform to the Input production of the lexical grammar.
Referring next to lexical grammar, it should be noted the terminal symbols of the syntactic grammar are the tokens defined by the lexical grammar, and the syntactic grammar specifies how tokens are combined to form Mg programs. Every source file in an Mg program must conform to the CompilationUnit production of the syntactic grammar.
Referring next to lexical analysis, the Input production defines the lexical structure of an Mg source file. Each source file in an Mg program must conform to this lexical grammar production.
Four basic elements make up the lexical structure of an Mg source file: line terminators, white space, comments, and tokens. Of these basic elements, only tokens are significant in the syntactic grammar of an Mg program.
The lexical processing of an Mg source file includes reducing the file into a sequence of tokens which becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, but otherwise these lexical elements have no impact on the syntactic structure of an Mg program. When several lexical grammar productions match a sequence of characters in a source file, the lexical processing always forms the longest possible lexical element. For example, the character sequence // is processed as the beginning of a single-line comment because that lexical element is longer than a single/token.
Line terminators divide the characters of an Mg source file into lines.
For compatibility with source code editing tools that add end-of-file markers, and to enable a source file to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to every compilation unit:
Referring next to comments, it should be appreciated that two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters // and extend to the end of the source line. Delimited comments start with the characters /* and end with the characters */. Delimited comments may span multiple lines.
Comments do not nest. The character sequences /* and */ have no special meaning within a // comment, and the character sequences // and /* have no special meaning within a delimited comment.
Also, comments are not processed within text literals. For instance, the following example:
shows three single-line comments, whereas the following example:
includes one delimited comment.
In an embodiment, whitespace is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.
With respect to tokens, it should be noted that there are several kinds of tokens: identifiers, keywords, literals, operators, and punctuators. White space and comments are not tokens, though they act as separators for tokens.
With respect to identifiers, a regular identifier begins with a letter or underscore and then any sequence of letter, underscore, dollar sign, or digit. An escaped identifier is enclosed in square brackets. It contains any sequence of Text literal characters.
—
Referring next to keywords, A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when escaped with square brackets [ ].
The following keywords are reserved for future use:
checkpoint identifier nest override new virtual partial
With respect to literals, it should be noted that a literal is a source code representation of a value. Literals may be ascribed with a type to override the default type ascription.
It should also be noted that decimal literals may be used to write real-number values.
Examples of decimal literals include:
Integer literals may be used to write integral values.
Examples of integer literals include:
Logical literals may be used to write logical values.
Examples of logical literals are:
Referring next to text literals, Mg supports two forms of Text literals: regular text literals and verbatim text literals. In certain contexts, text literals must be of length one (single characters). However, Mg does not distinguish syntactically between strings and characters.
A regular text literal consists of zero or more characters enclosed in single or double quotes, as in “hello” or ‘hello’, and may include both simple escape sequences (such as \t for the tab character), and hexadecimal and Unicode escape sequences. A verbatim Text literal includes a ‘commercial at’ character (@) followed by a single- or double-quote character (′ or ″), zero or more characters, and a closing quote character that matches the opening one. A simple example is @“hello”. In a verbatim text literal, the characters between the delimiters are interpreted exactly as they occur in the compilation unit, the only exception being a SingleQuoteEscapeSequence or a DoubleQuoteEscapeSequence, depending on the opening quote. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim text literals. A verbatim text literal may span multiple lines. A simple escape sequence represents a Unicode character encoding, as described in the Table T-1 below.
Since Mg uses a 16-bit encoding of Unicode code points in Text values, a Unicode character in the range U+10000 to U+10FFFF is not considered a Text literal of length one (a single character), but is represented using a Unicode surrogate pair in a Text literal.
Unicode characters with code points above 0x10FFFF are not supported. Multiple translations are not performed. For instance, the text literal \u005Cu005C is equivalent to \u005C rather than \. The Unicode value U+005C is the character \. A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following the prefix.
Examples of text literals include:
The null literal is equal to no other value.
An example of the null literal is:
null
In an embodiment, there are several kinds of operators and punctuators. Operators are used in expressions to describe operations involving one or more operands. For example, the expression a+b uses the + operator to add the two operands a and b. Punctuators are for grouping and separating.
In one aspect, Pre-processing directives provide the ability to conditionally skip sections of source files, to report error and warning conditions, and to delineate distinct regions of source code as a separate pre-processing step.
The following pre-processing directives are available:
A pre-processing directive may always occupy a separate line of source code and may always begins with a # character and a pre-processing directive name. White space may occur before the # character and between the # character and the directive name. A source line containing a #define, #undef, #if, #else, or #endif directive may end with a single-line comment. Delimited comments (the /* */ style of comments) are not permitted on source lines containing pre-processing directives. Pre-processing directives are neither tokens nor part of the syntactic grammar of Mg. However, pre-processing directives can be used to include or exclude sequences of tokens and can in that way affect the meaning of an Mg program. For example, after pre-processing the source text:
results in the exact same sequence of tokens as the source text:
Thus, whereas lexically, the two programs are quite different, syntactically, they are identical.
Conditional compilation functionality is provided by the #if, #else, and #endif directives is controlled through pre-processing expressions and conditional compilation symbols.
A conditional compilation symbol has two possible states: defined or undefined. At the beginning of the lexical processing of a source file, a conditional compilation symbol is undefined unless it has been explicitly defined by an external mechanism (such as a command-line compiler option). When a #define directive is processed, the conditional compilation symbol named in that directive becomes defined in that source file. The symbol remains defined until an #undef directive for that same symbol is processed, or until the end of the source file is reached. An implication of this is that #define and #undef directives in one source file have no effect on other source files in the same program.
When referenced in a pre-processing expression, a defined conditional compilation symbol has the Logical value true, and an undefined conditional compilation symbol has the Logical value false. There is no requirement that conditional compilation symbols be explicitly declared before they are referenced in pre-processing expressions. Instead, undeclared symbols are simply undefined and thus have the value false. In an embodiment, conditional compilation symbols can only be referenced in #define and #undef directives and in pre-processing expressions.
Pre-processing expressions can occur in #if directives. The operators !, ==, !=, && and ∥ are permitted in pre-processing expressions, and parentheses may be used for grouping.
When referenced in a pre-processing expression, a defined conditional compilation symbol has the Logical value true, and an undefined conditional compilation symbol has the Logical value false.
Evaluation of a pre-processing expression always yields a Logical value. The rules of evaluation for a pre-processing expression are the same as those for a constant expression, except that the only user-defined entities that can be referenced are conditional compilation symbols.
Declaration directives are used to define or undefine conditional compilation symbols.
The processing of a #define directive causes the given conditional compilation symbol to become defined, starting with the source line that follows the directive. Likewise, the processing of an #undef directive causes the given conditional compilation symbol to become undefined, starting with the source line that follows the directive.
A #define may define a conditional compilation symbol that is already defined, without there being any intervening #undef for that symbol. The example below defines a conditional compilation symbol A and then defines it again.
A #undef may “undefine” a conditional compilation symbol that is not defined. The example below defines a conditional compilation symbol A and then undefines it twice; although the second #undef has no effect, it is still valid.
Conditional compilation directives are used to conditionally include or exclude portions of a source file.
As indicated by the syntax, conditional compilation directives must be written as sets consisting of, in order, an #if directive, zero or one #else directive, and an #endif directive. Between the directives are conditional sections of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete sets.
A PPConditional selects at most one of the contained ConditionalSections for normal lexical processing:
The selected ConditionalSection, if any, is processed as a normal InputSection: the source code contained in the section must adhere to the lexical grammar; tokens are generated from the source code in the section; and pre-processing directives in the section have the prescribed effects.
The remaining ConditionalSections, if any, are processed as SkippedSections: except for pre-processing directives, the source code in the section need not adhere to the lexical grammar; no tokens are generated from the source code in the section; and pre-processing directives in the section must be lexically correct but are not otherwise processed. Within a ConditionalSection that is being processed as a Skipped-Section, any nested ConditionalSections (contained in nested #if . . . #endif and #region . . . #end region constructs) are also processed as SkippedSections.
Except for pre-processing directives, skipped source code is not subject to lexical analysis. For example, the following is valid despite the unterminated comment in the #else section:
Note, that pre-processing directives are required to be lexically correct even in skipped sections of source code.
Pre-processing directives are not processed when they appear inside multi-line input elements. For example, the program:
generates a language which recognizes the value:
In peculiar cases, the set of pre-processing directives that is processed might depend on the evaluation of the PPExpression. The example:
always produces the same token stream (syntax Q=empty;), regardless of whether or not X is defined. If X is defined, the only processed directives are #if and #endif, due to the multi-line comment. If X is undefined, then three directives (#if, #else, #endif) are part of the directive set.
Referring next to text pattern expressions, it should be noted that text pattern expressions perform operations on the sets of possible text values that one or more terms recognize.
With respect to primary expressions, it should be appreciated that a primary expression may be a text literal, a reference to a syntax or token rule, an expression indicating a repeated sequence of primary expressions of a specified length, an expression indicating any of a continuous range of characters, or an inline sequence of pattern declarations. The following grammar reflects this structure.
A character class is a compact syntax for a range of continuous characters. This expression requires that the text literals be of length 1 and that the Unicode offset of the right operand be greater than that of the left.
The expression “0”. “9” is equivalent to:
“0”|“1”|“2”|“3”|“4”|“5”|“6”|“7”|“8”|“9”|
A reference primary is the name of another rule possibly with arguments for parameterized rules. All rules defined within the same language can be accessed without qualification.
Note that whitespace between a rule name and its arguments list is significant to discriminate between a reference to a parameterized rule and a reference without parameters and an inline rule. In a reference to a parameterized rule, no whitespace is permitted between the identifier and the arguments.
In an embodiment, repetition operators recognize a primary expression repeated a specified number of times. The number of repetitions can be stated as a (possibly open) integer range or using one of the Kleene operators, ?, +, *.
The left operand of . . must be greater than zero and less than the right operand of . . , if present.
“A” ? is equivalent to “A”#0 . . 1
“A”+ is equivalent to “A”1 . .
“A”* is equivalent to “A”#0 . .
An inline rule may also be provided as a means to group pattern declarations together as a term.
An inline rule is typically used in conjunction with a range operator:
“A” (“,” “A”)*
recognizes 1 or more “A” s separated by commas. Although syntactically legal, variable bindings within inline rules are not accessible within the constructor of the containing production.
The “any” term is a wildcard that matches any text value of length 1.
Any:
The error production enables error recover. Consider the following example:
The language recognizes the text “Hello, Hello, Hello” as expected and produces the following default output:
The text “Hello,hello,Hello” is not in the language because the second “h” is not capitalized (and case sensitivity is true). However, rather than stop at “h”, the language processor matches “h” to the error token, then matches “e” to the error token, etc. Until it reaches the comma. At this point the text conforms to the language and normal processing can continue. The language process reports the position of the errors and produces the following output:
Hello occurs twice instead of three times as above and the text the error token matched is returned as error [“hello”].
Referring next to term operators, it should be noted that a primary term expression can be thought of as the set of possible text values that it recognizes. The term operators perform the standard set difference, intersection, and negation operations on these sets. (Pattern declarations perform the union operation with |.)
Inverse requires every value in the set of possible text values to be of length 1.
(“11”|“12”)−(“12”|“13”) recognizes “11”.
(“11”|“12”) & (“12”|“13”) recognizes “12”.
(“11”|“12”) is an error.
(“11”|“2”) recognizes any text value of length 1 other than “1” or “2”.
Referring next to productions, it should be appreciated that a production is a pattern and an optional constructor. Each production is a scope. The pattern may establish variable bindings which can be referenced in the constructor. A production can be qualified with a precedence that is used to resolve a tie if two productions match the same text.
A pattern declaration is a sequence of term declarations or the built-in pattern empty which matches “ ”.
A term declaration includes a pattern expression with an optional variable binding, precedence and attributes. The built-in term error is used for error recovery.
A variable associates a name with the output from a term which can be used in the constructor. The error term is used in conjunction with the checkpoint rule modifier to facilitate error recovery.
A term constructor is the syntax for defining the output of a production. A node in a term constructor can be, for example, an atom including a literal, a reference to another term, or an operation on a reference; an ordered collection of successors with an optional label; or an unordered collection of successors with an optional label. The following grammar mirrors this structure.
Each production defines a scope. The variables referenced in a constructor must be defined within the same production's pattern. Variables defined in other productions in the same rule cannot be referenced. The same variable name can be used across alternatives in the same rule. Consider three alternatives for encoding the output of the same production. First, the default constructor:
Processing the text “1+2” yields:
This output reflects the structure of the grammar and may not be the most useful form for further processing. The second alternative cleans the output up considerably:
Processing the text “1+2” with this language yields:
Add[1, 2]
This grammar uses three common patterns: productions with a single term are passed through (this is done for the single production in Main and the first production in E); a label, Add, is used to designate the operator; and position is used to distinguish the left and right operand. The third alternative uses a record like structure to give the operands names:
Processing the text “1+2” with this language yields:
Add{Left{1}, Right{2}}
Although somewhat more verbose than the prior alternative, this output does not rely on ordering and forces consumers to explicitly name Left or Right operands. Although either option works, this has proven to be more flexible and less error prone.
Referring next to constructor operators, constructor operators allow a constructor to use a variable reference as a label, extract the successors of a variable reference or extract the label of a variable reference. For instance, consider generalizing the example above to support multiple operators. This could be done by adding a new production for each operator −, *, /, . Alternatively a single rule can be established to match these operators and the output of that rule can be used as a label using id:
Processing the text “1+2” with this language yields the same result as above.
Processing “½” yields:
Divide {Left{1}, Right{2}}
This language illustrates the id operator.
The valuesof operator extract the successors of a variable reference. It is used to flatten nested output structures. For instance, consider the language:
Processing the text “1, 2, 3” with this language yields:
The following grammar uses valuesof and the pass through pattern above to simplify the output:
Processing the text “1, 2, 3” with this language yields:
DigitList[1, 2, 3]
This output represents the same information more concisely.
If a constructor is not defined for a production the language process defines a default constructor. For a given production, the default projection is formed as follows. First, the label for the result is the name of the production's rule. Next, the successors of the result are an ordered sequence constructed from each term in the pattern. Then, * and ? create an unlabeled sequence with the elements. A “( )” then results in an anonymous definition. Namely, if it contains constructors (a:A=>a), then the output is the output of the constructor. Otherwise, if there are no constructors, then the default rule applied on the anonymous definition and the output is enclosed in square brackets [A's result]. It should then be noted that token rules do not permit a constructor to be specified and output text values. Also, interleave rules do not permit a constructor to be specified and do not produce output. For instance, consider the following language:
Given the text “123” the default output of the language processor follows:
The Mg language processor is tolerant of such ambiguity as it is recognizing subsequences of text. However, it is an error to produce more than one output for an entire text value. Precedence qualifiers on productions or terms determine which of the several outputs should be returned. With respect to production precedence, consider, for example, the classic dangling else problem as represented in the following language:
Given the input “if then if then else”, two different output are possible. Either the else binds to the first if-then:
Or it binds to the second if-then:
The following language produces the output immediately above, binding the else to the second if-then.
Switching the precedence values produces the first output.
With respect to term precedence, consider a simple expression language which recognizes:
2+3+4
5*6*7
2+3*4
2̂3̂4
The result of these expressions can depend on the order in which the operators are reduced. 2+3+4 yields 9 whether 2+3 is evaluated first or 3+4 is evaluated first. Likewise, 5*6*7 yields 210 regardless of the order of evaluation. However, this is not the case for 2+3*4. If 2+3 is evaluated first yielding 5, 5*4 yields 20. While if 3*4 is evaluated first yielding 12, 2+12 yields 14. This difference manifests itself in the output of the following grammar:
“2+3*4” can result in two outputs:
According to conventional rules, the result of this expression is 14 because multiplication is performed before addition. This is expressed in Mg by assigning “*” a higher precedence than “+”. In this case the result of an expression changed with the order of evaluation of different operators.
The order of evaluation of a single operator can matter as well. Consider 234. This could result in either 84 or 281. In term of output, there are two possibilities:
In this case the issue is not which of several different operators to evaluate first but which in a sequence of operators to evaluate first, the leftmost or the right most. The rule in this case is less well established but most languages choose to evaluate the rightmost “̂” first yielding 2̂81 in this example.
The following grammar implements these rules using term precedence qualifiers. Term precedence qualifiers may only be applied to literals or references to token rules.
“̂” is qualified with right(3). right indicates that the rightmost in a sequence should be grouped together first. 3 is the highest precedence, so “̂” will be grouped most strongly.
Referring next to rules, a rule is a named collection of alternative productions. There are three kinds of rules: syntax, token, and interleave. A text value conforms to a rule if it conforms to any one of the productions in the rule. If a text value conforms to more than one production in the rule, then the rule is ambiguous.
The three different kinds of rules differ in how they treat ambiguity and how they handle their output.
The rule Main below recognizes the two text values “Hello” and “Goodbye”.
With respect to token rules, token rules recognize a restricted family of languages. However, token rules can be negated, intersected and subtracted which is not the case for syntax rules. Attempting to perform these operations on a syntax rule results in an error. The output from a token rule is the text matched by the token. No constructor may be defined.
Token rules do not permit precedence directives in the rule body. They have a built in protocol to deal with ambiguous productions. A language processor attempts to match all tokens in the language against a text value starting with the first character, then the first two, etc. If two or more productions within the same token or two different tokens can match the beginning of a text value, a token rule will choose the production with the longest match. If all matches are exactly the same length, the language processor will choose a token rule marked final if present. If no token rule is marked final, all the matches succeed and the language processor evaluates whether each alternative is recognized in a larger context. The language processor retains all of the matches and begins attempting to match a new token starting with the first character that has not already been matched.
An identifier modifier may also be included, which applies only to tokens. It is used to lower the precedence of language identifiers so they do not conflict with language keywords.
In an embodiment, syntax rules recognize all languages that Mg is capable of defining. The main start rule must be a syntax rule. Syntax rules allow all precedence directives and may have constructors.
Interleave rules may also be provided. An interleave rule recognizes the same family of languages as a token rule and also cannot have constructors. Further, interleave rules cannot have parameters and the name of an interleave rule cannot be references. Text that matches an interleave rule is excluded from further processing. The following example demonstrates whitespace handling with an interleave rule:
This language recognizes the text value “Hello World”. It also recognizes “Hello world”, “Hello world”, “Hello world”, and “HelloWorld”. It does not recognize “He llo world” because “He” does not match any token.
An inline rule may also be provided, which is an anonymous rule embedded within the pattern of a production. The inline rule is processed as any other rule however it cannot be reused since it does not have a name. Variables defined within an inline rule are scoped to their productions as usual. A variable may be bound to the output of an inline rule as with any pattern.
In the following Example 1 and Example 2 recognize the same language and produce the same output. Example 1 uses a named rule AppleOrOrange while Example 2 states the same rule inline.
Rule parameters may also be included in which a rule defines parameters which can be used within the body of the rule.
A single rule identifier may have multiple definitions with different numbers of parameters. The following example uses List(Content, Separator) to define List(content) with a default separator of “,”.
This language will recognize “Hello”, “Hello,Hello”, “Hello,Hello,Hello”, etc.
A language may also be provided which is a named collection of rules for imposing structure on text.
The language that follows recognizes the single text value “Hello World”:
It should be appreciated that a language may consist of any number of rules. The following language recognizes the single text value “Hello World”:
The three rules Hello, world, and whitespace recognize the three single text values “Hello”, “world”, and “ ” respectively. The rule Main combines these three rules in sequence. Main is the distinguished start rule for a language. A language recognizes a text value if and only if Main recognizes a value. Also, the output for Main is the output for the language.
It should also be noted that rules are members of a language. A language can use rules defined in another language using member access notation. The Helloworld language recognizes the single text value “Hello world” using rules defined in the words language:
All rules defined within the same module are accessible in this way. In an embodiment, rules defined in other modules must be exported and imported.
Referring next to modules, it should be noted that an Mg module is a scope which contains declarations of languages (§Error! Reference source not found.). Declarations exported by an imported module are made available in the importing module. Thus, modules override lexical scoping that otherwise governs Mg symbol resolution. Modules themselves do not nest. In an embodiment, several modules may be contained within a Compilation Unit, typically a text file.
A ModuleDeclaration is a named container/scope for language declarations.
Each ModuleDeclaration has a QualifiedIdentifier that uniquely qualifies the declarations contained by the module. Each ModuleMemberDeclaration may be referenced either by its Identifier or by its fully qualified name by concatenating the QualifiedIdentifier of the ModuleDeclaration with the Identifier of the ModuleMemberDeclaration (separated by a period). For example, given the following ModuleDeclaration:
The fully qualified name of the language is BaseDefinitions.Logical, or using escaped identifiers, [BaseDefinitions].[Logical]. It is always legal to use a fully qualified name where the name of a declaration is expected. Modules are not hierarchical or nested. That is, there is no implied relationship between modules whose QualifiedIdentifier share a common prefix. For example, consider these two declarations:
Module A. B is in error, as it does not contain a declaration for the identifier L. That is, the members of Module A are not implicitly imported into Module A.B.
In an embodiment, Mg uses ImportDirectives and ExportDirectives to explicitly control which declarations may be used across module boundaries.
A ModuleDeclaration contains zero or more ExportDirectives, each of which makes a ModuleMemberDeclaration available to declarations outside of the current module. A ModuleDeclaration contains zero or more ImportDirectives, each of which names a ModuleDeclaration whose declarations may be referenced by the current module. A ModuleMemberDeclaration may only reference declarations in the current module and declarations that have an explicit ImportDirective in the current module. An ImportDirective is not transitive, that is, importing module A does not import the modules that A imports. For example, consider this ModuleDeclaration:
The definition Language.Core.Internal may only be referenced from within the module Language.Core. The definition Language.Core.Base may be referenced in any module that has an ImportDirective for module Language. Core, as shown in this example:
The example above uses the fully qualified name to refer to Language.Core.Base. An ImportDirective may also specify an ImportAlias that provides a replacement Identifier for the imported declaration:
An ImportAlias replaces the name of the imported declaration. That means that the following is an error:
It is legal for two or more ImportDirectives to import the same declaration, provided they specify distinct aliases. For a given compilation episode, at most one ImportDirective may use a given alias.
If an ImportDirective imports a module without specifying an alias, the declarations in the imported module may be referenced without the qualification of the module name. That means the following is also legal.
When two modules contain same-named declarations, there is a potential for ambiguity. The potential for ambiguity is not an error—ambiguity errors are detected lazily as part of resolving references. For instance, consider the following two modules:
It is legal to import both modules either with or without providing an alias:
This is legal because ambiguity is only an error for references, not declarations. That means that the following is a compile-time error:
This example can be made legal either by fully qualifying the reference to L:
or by adding an alias to one or both of the ImportDirectives:
An ImportDirective may either import all exported declarations from a module or only a selected subset of them. The latter is enabled by specifying ImportMembers as part of the directive. For example, Module Plot2D imports only Point2D and PointPolar from the Module Geometry:
An ImportDirective that contains an ImportMember only imports the named declarations from that module. This means that the following is a compilation error because module Plot3D references Geo3D which is not imported from module Geometry:
An ImportDirective that contains an ImportAlias on a selected imported member assigns the replacement name to the imported declaration, hiding the original export name.
Aliasing an individual imported member is useful to resolve occasional conflicts between imports. Aliasing an entire imported module is useful to resolve a systemic conflict. For example, when importing two modules, where one is a different version of the other, it is likely to get many conflicts. Aliasing at member level would lead to a correspondingly long list of alias declarations.
Referring next to attributes, it should be noted that attributes provide metadata which can be used to interpret the language feature they modify.
In an embodiment a casesensitive attribute controls whether tokens are matched with our without case sensitivity. The default value is true. The following language recognizes “Hello world”, “HELLO world”, and “hELLo worLD”.
One of ordinary skill in the art can appreciate that the various embodiments described herein can be implemented in connection with any computer or other client or server device, which can be deployed as part of a computer network or in a distributed computing environment, and can be connected to any kind of data store. In this regard, the various embodiments described herein can be implemented in any computer system or environment having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units. This includes, but is not limited to, an environment with server computers and client computers deployed in a network environment or a distributed computing environment, having remote or local storage.
Distributed computing provides sharing of computer resources and services by communicative exchange among computing devices and systems. These resources and services include the exchange of information, cache storage and disk storage for objects, such as files. These resources and services also include the sharing of processing power across multiple processing units for load balancing, expansion of resources, specialization of processing, and the like. Distributed computing takes advantage of network connectivity, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, a variety of devices may have applications, objects or resources that may cooperate to perform one or more aspects of any of the various embodiments of the subject disclosure.
Each object 1110, 1112, etc. and computing objects or devices 1120, 1122, 1124, 1126, 1128, etc. can communicate with one or more other objects 1110, 1112, etc. and computing objects or devices 1120, 1122, 1124, 1126, 1128, etc. by way of the communications network 1140, either directly or indirectly. Even though illustrated as a single element in
There are a variety of systems, components, and network configurations that support distributed computing environments. For example, computing systems can be connected together by wired or wireless systems, by local networks or widely distributed networks. Currently, many networks are coupled to the Internet, which provides an infrastructure for widely distributed computing and encompasses many different networks, though any network infrastructure can be used for exemplary communications made incident to the column based encoding and query processing as described in various embodiments.
Thus, a host of network topologies and network infrastructures, such as client/server, peer-to-peer, or hybrid architectures, can be utilized. The “client” is a member of a class or group that uses the services of another class or group to which it is not related. A client can be a process, i.e., roughly a set of instructions or tasks, that requests a service provided by another program or process. The client process utilizes the requested service without having to “know” any working details about the other program or the service itself.
In a client/server architecture, particularly a networked system, a client is usually a computer that accesses shared network resources provided by another computer, e.g., a server. In the illustration of
A server is typically a remote computer system accessible over a remote or local network, such as the Internet or wireless network infrastructures. The client process may be active in a first computer system, and the server process may be active in a second computer system, communicating with one another over a communications medium, thus providing distributed functionality and allowing multiple clients to take advantage of the information-gathering capabilities of the server. Any software objects utilized pursuant to the column based encoding and query processing can be provided standalone, or distributed across multiple computing devices or objects.
In a network environment in which the communications network/bus 1140 is the Internet, for example, the servers 1110, 1112, etc. can be Web servers with which the clients 1120, 1122, 1124, 1126, 1128, etc. communicate via any of a number of known protocols, such as the hypertext transfer protocol (HTTP). Servers 1110, 1112, etc. may also serve as clients 1120, 1122, 1124, 1126, 1128, etc., as may be characteristic of a distributed computing environment.
As mentioned, advantageously, the techniques described herein can be applied to any device where it is desirable to query large amounts of data quickly. It should be understood, therefore, that handheld, portable and other computing devices and computing objects of all kinds are contemplated for use in connection with the various embodiments, i.e., anywhere that a device may wish to scan or process huge amounts of data for fast and efficient results. Accordingly, the below general purpose remote computer described below in
Although not required, embodiments can partly be implemented via an operating system, for use by a developer of services for a device or object, and/or included within application software that operates to perform one or more functional aspects of the various embodiments described herein. Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Those skilled in the art will appreciate that computer systems have a variety of configurations and protocols that can be used to communicate data, and thus, no particular configuration or protocol should be considered limiting.
With reference to
Computer 1210 typically includes a variety of computer readable media and can be any available media that can be accessed by computer 1210. The system memory 1230 may include computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and/or random access memory (RAM). By way of example, and not limitation, memory 1230 may also include an operating system, application programs, other program modules, and program data.
A user can enter commands and information into the computer 1210 through input devices 1240. A monitor or other type of display device is also connected to the system bus 1222 via an interface, such as output interface 1250. In addition to a monitor, computers can also include other peripheral output devices such as speakers and a printer, which may be connected through output interface 1250.
The computer 1210 may operate in a networked or distributed environment using logical connections to one or more other remote computers, such as remote computer 1270. The remote computer 1270 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, or any other remote media consumption or transmission device, and may include any or all of the elements described above relative to the computer 1210. The logical connections depicted in
As mentioned above, while exemplary embodiments have been described in connection with various computing devices and network architectures, the underlying concepts may be applied to any network system and any computing device or system in which it is desirable to compress large scale data or process queries over large scale data.
Also, there are multiple ways to implement the same or similar functionality, e.g., an appropriate API, tool kit, driver code, operating system, control, standalone or downloadable software object, etc. which enables applications and services to use the efficient encoding and querying techniques. Thus, embodiments herein are contemplated from the standpoint of an API (or other software object), as well as from a software or hardware object that provides column based encoding and/or query processing. Thus, various embodiments described herein can have aspects that are wholly in hardware, partly in hardware and partly in software, as well as in software.
The word “exemplary” is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited by such examples. In addition, any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs, nor is it meant to preclude equivalent exemplary structures and techniques known to those of ordinary skill in the art. Furthermore, to the extent that the terms “includes,” “has,” “contains,” and other similar words are used in either the detailed description or the claims, for the avoidance of doubt, such terms are intended to be inclusive in a manner similar to the term “comprising” as an open transition word without precluding any additional or other elements.
As mentioned, the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. As used herein, the terms “component,” “system” and the like are likewise intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
The aforementioned systems have been described with respect to interaction between several components. It can be appreciated that such systems and components can include those components or specified sub-components, some of the specified components or sub-components, and/or additional components, and according to various permutations and combinations of the foregoing. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and that any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.
In view of the exemplary systems described supra, methodologies that may be implemented in accordance with the described subject matter will be better appreciated with reference to the flowcharts of the various figures. While for purposes of simplicity of explanation, the methodologies are shown and described as a series of blocks, it is to be understood and appreciated that the claimed subject matter is not limited by the order of the blocks, as some blocks may occur in different orders and/or concurrently with other blocks from what is depicted and described herein. Where non-sequential, or branched, flow is illustrated via flowchart, it can be appreciated that various other branches, flow paths, and orders of the blocks, may be implemented which achieve the same or a similar result. Moreover, not all illustrated blocks may be required to implement the methodologies described hereinafter.
In addition to the various embodiments described herein, it is to be understood that other similar embodiments can be used or modifications and additions can be made to the described embodiment(s) for performing the same or equivalent function of the corresponding embodiment(s) without deviating therefrom. Still further, multiple processing chips or multiple devices can share the performance of one or more functions described herein, and similarly, storage can be effected across a plurality of devices. Accordingly, the invention should not be limited to any single embodiment, but rather should be construed in breadth, spirit and scope in accordance with the appended claims.
This application claims the benefit of U.S. Provisional Patent application Ser. No. 61/103,156 entitled “SYSTEM AND METHOD FOR RECOGNIZING STRUCTURE IN TEXT,” which was filed Oct. 6, 2008. The entirety of the aforementioned application is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61103156 | Oct 2008 | US |