Millions of people worldwide use spreadsheets, and the like, for storing and manipulating data. These data manipulation scenarios often involve converting a large quantity of input information from one format to another format, or entail performing computations on the input information to produce a desired output. Typically, these tasks are accomplished manually or with the use of small, often one-off, applications that are either created by the end-user or by a programmer for the end-user.
Inductive synthesis and combination framework technique embodiments described herein generally perform string transformations involving lookup operations in one or more relational tables, either alone or in combination with other non-lookup operations. In one exemplary embodiment where the lookup table string transformations are employed alone, a relational table lookup expression language is established which includes a set of grammar rules defining expressions therein. A synthesis procedure is then generated that learns a set of expressions in the relational table lookup expression language. These learned expressions produce a prescribed output from one or more input string variables using the aforementioned relational table or tables, and are derived using a set of one or more input-output examples. Each input-output example includes one or more input string variables and the prescribed output. Once the synthesis procedure is generated, one or more input string variables of a same type found in the set of input-output examples is received, and the prescribed output is produced using the synthesis procedure.
In an exemplary embodiment where the lookup table string transformations are employed in combination with other non-lookup operations, the aforementioned relational table lookup expression language is accessed. In addition, a second expression language associated with string transformations that do not involve lookup operations in a relational table is accessed. Like the relational table lookup expression language, the second expression language includes a set of grammar rules defining expressions therein. The relational table lookup expression language and the second expression language are then combined to establish a combined expression language. A synthesis procedure that learns a set of expressions in the combined expression language is generated. These learned expressions produce a prescribed output from one or more input string variables using lookup operations and non-lookup operations, and is based on a set of one or more input-output examples. As before, each input-output example includes one or more input string variables and the prescribed output. Once the synthesis procedure is generated, one or more input string variables of a same type found in the set of input-output examples is received, and the prescribed output is produced using the synthesis procedure.
It should be noted that this Summary is provided to introduce a selection of concepts, in a simplified form, that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of inductive synthesis and combination framework technique embodiments reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the technique.
In general, the inductive synthesis and combination framework technique embodiments described herein involve performing semantic transformations on strings, and more particularly, transformations on strings involving lookup in relational tables and transformations on strings representing one or more other data-types in addition to relational table lookups.
More particularly, it is observed that semantic transformations can often be expressed as combination of transformations involving lookup in relational tables and other transformation such as syntactic transformations, number transformations, and so on. Given this, in the sections to follow, a semantic string lookup transformation language, which can be used to generate an inductive synthesis procedure that synthesizes a set of transformations involving lookup operations that are consistent with the given set of input-output examples, will be described. This will be followed by a description of a combination framework for combining the lookup transformation language and its synthesis procedure, with other transformation languages and their associated synthesis procedures. The resulting combined synthesis procedures enable the combination framework to synthesize transformations on a rich variety of data-types.
The inductive synthesis and combination framework technique embodiments described herein in general present a programming by example technology that allows end-users to automate repetitive tasks, such as converting spreadsheet input data into a prescribed output by simply providing one or more input-output examples. Thus, a tool is provided that is ready to be deployed for use by end-users in the real world.
A string transformation expression language L describes expressions e that map an input state σ, which holds values for m string variables v1, . . . , vm (for example, denoting the multiple input columns in a spreadsheet), to a single output string s, such that,
e:(String x . . . x String)→String
The above formalism can also be used for string processing tasks that require generating a tuple of n strings as an output by simply solving n independent problems.
An expression language L as defined for the purposes of the present disclosure is characterized by the following components:
1) A set of grammar rules R; and
2) A start symbol e, which is a uniquely distinguished non-terminal symbol occurring in R.
A synthesis procedure “Synthesize” for an expression language L learns the set of expressions in L that are consistent with a given set of input-output examples. The synthesis procedure is based on a framework characterized by following components:
1) A data-structure D for succinctly representing sets of expressions in language L. This is important because the number of expressions that are consistent with a given input-output example(s) may be huge and their explicit representation is not feasible in many cases. D itself is described using a set of grammar rules {tilde over (R)} with start symbol {tilde over (e)};
2) A “GenerateStr” or learning procedure for computing the set of expressions (represented using data-structure D) that are consistent with a given input-output example; and
3) An Intersect procedure for intersecting two sets of expressions (represented using data-structure D). The Intersect procedure is also described using a set of rules.
The synthesis procedure Synthesize involves invoking the GenerateStr procedure on each input-output example, and intersecting the results using the Intersect procedure as follows:
It is noted that with regard to the GenerateStr procedure, it is defined to be sound and k-complete as follows. Let {tilde over (e)}t=GenerateStr(σ,s). GenerateStr is defined to be sound if:
∀etε[[{tilde over (e)}t]]:[[et]]σ=s
GenerateStr is defined to be complete if {tilde over (e)}t includes all expressions that are consistent with the input-output example (σ,s), and GenerateStr is defined to be k-complete if {tilde over (e)}t includes all expressions of depth at most k that are consistent with the input-output example (σ,s).
With regard to the Intersect procedure, it is defined to be sound and complete as follows. Let {tilde over (e)}″=Intersect({tilde over (e)},{tilde over (e)}′). Intersect is defined to be sound and complete if f[[{tilde over (e)}″]]=[[{tilde over (e)}]]∩[[{tilde over (e)}′]].
String transformations often require access to real-world knowledge. In some cases, this knowledge can be modeled as relational tables (e.g., a function that maps telephone country codes numbers to country names). In many cases, this knowledge is already available in the form of existing spreadsheet tables, organizational databases, relational data available over the web, etc. A language Lt for performing string transformations that require performing lookup operations in given relational tables will now be described.
The syntax of an expression language for lookup transformations over a given set of relational tables T is defined in one embodiment as follows:
Expression et:=vi
|Select(C,T,b)
Boolean Condition b:=p1 . . . pn
Predicate p:=C=s
|C=et
In accordance with the foregoing syntax, an expression et is either an input string variable vi, or a select expression Select(C,T,b), where T is a table identifier and C is a column identifier within that table. The Boolean condition b is a conjunction of various predicates p1 . . . pn and is often treated as a set of predicates {p1, . . . , pn} for notational convenience. A predicate p is an equality comparison between the content of some column of some table with a constant or an expression. The symbol s denotes a string constant.
In addition, the semantics of this expression language for lookup transformations is defined in one embodiment as follows:
where set S is the result of the relational algebra query “select C from T where b”. It is noted that the notation Choose(S) is used to refer to non-deterministic selection of any element from S (in case S is not a singleton set).
The set of all expressions in language Lt that are consistent with a given input-output example can be exponential in the size of the reference table T. This set can be represented succinctly using the data structure syntax and semantics described below. More particularly, the syntax of the data structure for lookup transformations is defined in one embodiment as follows:
{tilde over (e)}
t:=({tilde over (η)}1,{tilde over (η)}2,Progs) where Progs:({tilde over (η)}1∪{tilde over (η)}2)→f
f:=v
i|Select(C,T,{tilde over (b)})
{tilde over (b)}:=({β1}i,β)
β:=q1 . . . qn
q:=C=s|C=η
Thus, the data structure consists of a tuple ({tilde over (η)}1, {tilde over (η)}2,Progs) where {tilde over (η)}1 and {tilde over (η)}2 are a set of nodes η, and Progs[η] represents a set of expressions from language Lt. Two aspects of this data structure are: (a) the use of temporary nodes {tilde over (η)}2 to achieve sharing among different expressions (e.g., similar to use of extra variables during compilation to perform the optimization of common sub-expression elimination, except in this case done across different programs or expressions); and (b) exploiting the conjunctive normal form (CNF) of boolean conditions to represent a huge set {tilde over (b)} of conditions by simply maintaining few minimal sets {βi}i and the maximum set β.
The semantics of the data structure for lookup transformations is defined in one embodiment as follows:
[[{tilde over (η)}1,{tilde over (η)}2,Progs]]={et|etε[[Progs[η]]],ηε{tilde over (η)}1}
[[vi]]={vi}
[[Select(C,T,{tilde over (b)})]]={Select(C,T,b)|bε[[{tilde over (b)}]]}
[[{βi}i,β]]={p1 . . . pn|pjε[[qj]],{qj}j=βi∪β′,β′⊂β}
[[C=s]]={C=s}
[[C=η]]={C=et|etε[[Progs[η]]]}
In one embodiment, the GenerateStrt procedure for lookup transformations is as follows:
The GenerateStrt procedure operates by iteratively computing a set of nodes {tilde over (η)} and updating two maps Progs and Id in the loop at Line 5. The map Id associates every node η to its corresponding source, and is used to avoid generation of duplicate nodes corresponding to the same source (using the check at Line 10). Id[η] is either an input variable vi or some table cell (T,C,r). The helper function val(η) converts every node to the corresponding string value. The map Progs associates every node η to a set of expressions (of depth at most k steps), each of which evaluates to val(η) on the input state σ. The purpose of each iteration of the loop at Line 5 is to perform an iterative forward reachability analysis of the string values that can be generated in a single step (i.e., using a single Select expression) from the string values computed in previous steps, with the base case being the values of the input string variables.
Each iteration of the loop at Line 5 results in consideration of expressions whose depth is one larger than the set of expressions considered in the previous step. The depth of the expressions in language Lt can be as much as the total number of entries in all of the relational tables combined. Since it has not been observed that any intended program has a large depth in practice, the depth consideration is limited to a parameter k (which is set to 5 for in one tested embodiment) for efficiency reasons. One might be tempted to use the predicate (sε{val(η)}|η{tilde over (η)})V({tilde over (η)}Old={tilde over (η)}) as a termination condition for the loop. However, this has two issues. One is that it may happen co-incidentally that the output s might be computable by a program of depth smaller than the depth of the intended program on a given example, and in that case the procedure would fail to discover the correct program. On the other hand, it might also happen that the intended program does not belong to the language Lt, in which case the search would fail, but possibly only after consideration of all expressions whose depth is as large as the total number of entries in all relational tables combined together.
The procedure GenerateBool({tilde over (η)},T,r) generates the set of all boolean conditions q, each of which uniquely identifies row r in table T. In other words, the condition q is satisfied by row r of table T, but is not satisfied by any other row of table T. The set b computed at line 1 denotes the set of all predicates q that are satisfied by row r of table T.
In one embodiment, the Intersectt procedure for intersecting the sets of expressions computed by the GenerateStrt procedure is defined as follows:
Intersectt(({tilde over (η)}1,{tilde over (η)}2,Progs),({tilde over (η)}′1, {tilde over (η)}′2,Progs′))=({tilde over (η)}1×{tilde over (η)}′1, {tilde over (η)}2×{tilde over (η)}′2,Progs″)
where
Project1(βs)⊃βi,Project2(βs)⊃β′j}i,j
Intersectt(C=s,C=s)=C=s
Intersectt(C=η,C=η′)=C=η,η′
Projectk(C=s)=C=s
Project1(C=η,η′)=C=η
Project2(C=η,η′)=C=η′
Note that the procedure GenerateStrt is sound and k-complete, and the procedure Intersectt is sound and complete.
The foregoing aspects of the inductive synthesis and combination framework technique embodiments described herein can be realized in one general implementation outlined in
With regard to generating the synthesis procedure in the foregoing process, in one implementation shown in
The aforementioned intersect procedure is employed to intersect the expressions generated from each input-output example by the string generation procedure (as exemplified by the previously described Intersectt procedure embodiment which intersect the expressions generated by the GenerateStrt procedure embodiment). Generally, the represented set of expressions and the intersected sets of expressions are intersected in pairs until a single, comprehensive intersected set of expressions is obtained. Thus, in view of the procedure outlined in
It is also possible to modify the process of
The foregoing processes can be employed to learn a lookup table string transformation for the following scenario. The goal of this example is to output a cost value of an item from the item name and date in the input columns of a spreadsheet 400 shown in
The expression synthesized using the inductive synthesis and combination framework technique embodiments described herein is as follows:
Select(Cost,CostRec,Date=v2Item-Id=et,
where et=Select(Item-Id,MarkupRec,Item-Name=v1))
A combination framework will now be presented for combining the previously described lookup transformation language and its synthesis procedures, with other transformation languages and their associated synthesis procedures. More particularly, given the above-described language Lt, it is possible to construct a generic learning procedure for the combination of Lt with any other string transformation language.
For example, but without limitation, the second expression language can be a number transformation expression language that performs formatting and rounding operations on numbers, or a syntactic transformation expression language that performs syntactic string transformations. Further, the second expression language can be a combined language itself. For instance, but without limitation, the second language can be a combined expression language that performs formatting and rounding operations on numbers as well as syntactic string transformations.
Let La be an expression language whose grammar consists of rules Ra with start symbol ea. Without loss of generality, it is assumed that the grammar rules of different languages do not share non-terminals.
In general, the combination of Lt with another string transformation language La produces a third language Lt⊕La, whose expression grammar consists of rules Rt∪Ra along with the following new rules, and with e as the start symbol:
e
t:=let u=ea in et
e
a:=let u=et in ea
e:=e
t
|e
a
The semantics of the let rule is defined as:
[[let u=ea in et]]σ=[[et]]σ′ where σ′σ[u←[[ea]]σ]
It will now be shown how the synthesis procedure for language Lt can be extended to a synthesis procedure for the combination language Lt⊕La where the synthesis procedure for La is considered a black-box.
The data-structure for representing sets of expressions in language Lt⊕La consists of the union of rules {tilde over (R)}t and {tilde over (R)}a along with the following additional rules:
q:=C={tilde over (e)}
a
{tilde over (e)}
a:=let u={tilde over (e)}t in {tilde over (e)}a
where the set of variables that occur in {tilde over (e)}a come from the set {tilde over (η)}2.
The semantics of these rules are as follows:
[[C={tilde over (e)}a]]={C=(let u=ea[ηi←eti]i in u)|eaε[[{tilde over (e)}a]],etiε[[Progs(ηi)]]}
[[let u={tilde over (e)}t in {tilde over (e)}a]]={let u=et in ea|etε[[{tilde over (e)}t]],eaε[[{tilde over (e)}a]]}
The GenerateStr procedure for language Lt⊕La consists of the following two generalizations to the GenerateStrt procedure of language Lt described previously. Let s=T[C,r] and σ be an input state that maps η to val(η) for all ηεη2.
The condition “val(η)=T[C,r]” at Line 7 in GenerateStrt is replaced by the condition “GenerateStra(σ,s) contains any expression that uses a variable from {tilde over (η)}diff”. In addition, the following set is added to β at Line 1 in helper function GenerateBool:{C=GenerateStra(σ, s)}.
The Intersect procedure here consists of the union of rules for the Intersectt procedure of language Lt, rules for Intersecta procedure of language La, and the following additional rules.
Intersectt(let u={tilde over (e)}a in f, let u′={tilde over (e)}′a in f)=let u=Intersect({tilde over (e)}a,{tilde over (e)}′a) in Intersectt(f,f′[u/u′])
Intersect(let u={tilde over (e)}t in {tilde over (e)}a, let u′={tilde over (e)}′t in {tilde over (e)}′a)=let u=Intersectt({tilde over (e)}t,{tilde over (e)}′t) in Intersect({tilde over (e)}a,{tilde over (e)}′a[u/u′])
1.3.3 Exemplary Process for Performing Combining Lookup Table String Transformations with Other String Transformations
The foregoing combination aspects of the inductive synthesis and combination framework technique embodiments described herein can be realized in one general implementation outlined in
With regard to combining the relational table lookup expression language and the second expression language to establish a combined expression language in the foregoing process, in one implementation, this involves combining a first set of grammar rules associated with the relational table lookup expression language and a second set of grammar rules associated with the second expression language without repeating any of the rules. In addition, whenever an expression in the relational table lookup expression language also involves a non-lookup operation, a second expression language expression corresponding to the non-lookup operation is included in the relational table lookup language expression. And, whenever an expression in the second expression language also involves a lookup operation, a relational table lookup language expression corresponding to the lookup operation is included in the second language expression.
It is further noted that the synthesis procedure for the combined expression language is generated in the same way as described previously in connection with the relational table lookup language. The only exceptions are the few additions and substitutions identified above for the data structure, GenerateStr procedure and Intersect procedure employed with the Lt⊕La language.
The foregoing processes can be employed to learn a combined lookup table string transformation and other transformations for the following scenario. In this case the other transformations involved a combined expression language directed to rounding numbers and syntactic operations. The goal of this example is to perform a set of currency conversion tasks for reimbursement purposes that inputs the number of dollars or other currency to be converted, the type of currency the money to be converted into and the date of the conversion. These inputs (v1, v2, v3) are provided in the first three columns 802, 804, 806, respectively, of the spreadsheet 800 shown in
The inductive synthesis and combination framework technique embodiments described herein can be used to generate a synthesis procedure to produce the desired output (shown in bold in the output column of the last four rows of the
The desired transformation is synthesized as a combination of syntactic transformation, lookup transformation over the provided currency conversion table, and number transformation for rounding off the currency conversion rate to two decimal digits. The desired transformation can be expressed in a combined expression language Lt⊕La (where La is a given combined number and syntactic transformation language) as follows:
Concatenate(Round1(u,0,0.01,),ConstStr(“*”),SubStr2(v1,NumTok,1))
where
Note that the Concatenate expression comes from the combined number and syntactic transformation language La where the first term rounds the number “u” to two decimal places, the second term introduces the symbol “*”, and the third term parses the number portion (NumTok) of the v1 input. The number u is the conversion rate looked-up in the CurrTab table, and is defined by the embedded Lt language expression “Select(ExRate, CurrTab, [bDst=v2Date=v3])”. This expression selects the number in the ExRate column (C) of the CurrTab table (T), that satisfy the conditions (q) specified as “[bDst=v2Date=v3]”. Note that b refers to an embedded combined number and syntactic transformation language La expression defined as “(Src=SubStr2(v1,alphaTok,1))”, which parses the text portion (alphaTok) of the v1 input.
The inductive synthesis and combination framework technique embodiments described herein are operational within numerous types of general purpose or special purpose computing system environments or configurations.
For example,
To allow a device to implement the inductive synthesis and combination framework technique embodiments described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by
In addition, the simplified computing device of
The simplified computing device of
Retention of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying some or all of the various inductive synthesis and combination framework technique embodiments described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the inductive synthesis and combination framework technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
The combination framework described previously was employed to combining the previously described lookup transformation language and its synthesis procedures with other transformation languages and their associated synthesis procedures. However, it is noted that in general any expression language and its synthesis procedures can be combined with another transformation language using the combination framework by employing the same procedures as described in connection with combining the lookup transformation language with another language.
It is further noted that any or all of the aforementioned embodiments throughout the description may be used in any combination desired to form additional hybrid embodiments. In addition, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.