source: trunk/yao/share/antlr-2.7.7/doc/trees.html

Last change on this file was 1, checked in by lnalod, 15 years ago

Initial import of YAO sources

File size: 40.1 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
2<html>
3<head>
4        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
5        <title>ANTLR Tree Construction</title>
6</head>
7<body bgcolor="#FFFFFF" text="#000000">
8<h1><a name="_bb1">ANTLR Tree Construction</a></h1>
9<p>
10        ANTLR helps you build intermediate form trees, or abstract syntax trees
11        (ASTs), by providing grammar annotations that indicate what tokens are to
12        be treated as subtree roots, which are to be leaves, and which are to be
13        ignored with respect to tree construction.&nbsp; As with PCCTS 1.33, you
14        may manipulate trees using tree grammar actions.     
15</p>
16
17<p>
18        It is often the case that programmers either have existing
19        tree definitions or need a special physical structure, thus,
20        preventing ANTLR from specifically defining the implementation
21        of AST nodes. ANTLR specifies only an interface describing
22        minimum behavior. Your tree implementation must implement this
23        interface so ANTLR knows how to work with your trees. Further,
24        you must tell the parser the name of your tree nodes or
25        provide a tree &quot;factory&quot; so that ANTLR knows how to
26        create nodes with the correct type (rather than hardcoding in
27        a <tt>new AST()</tt> expression everywhere). &nbsp; ANTLR can
28        construct and walk any tree that satisfies the AST
29        interface.&nbsp; A number of common tree definitions are
30        provided.  Unfortunately, ANTLR cannot parse XML DOM trees since our
31        method names conflict (e.g., <tt>getFirstChild()</tt>); ANTLR was here
32        first &lt;wink>.  Argh!
33</p>
34
35<h2><a name="_bb2"></a><a name="Notation">Notation</a></h2>
36<p>
37        In this and other documents, tree structures are represented by a
38        LISP-like notation, for example:
39</p>
40<pre><tt>#(A B C)</tt></pre>
41<p>
42        is a tree with A at the root, and children B and C. This notation can be
43        nested to describe trees of arbitrary structure, for example:
44</p>
45<pre><tt>#(A B #(C D E))</tt></pre>
46<p>
47        is a tree with A at the root, B as a first child, and an entire subtree
48        as the second child. The subtree, in turn, has C at the root and D,E as
49        children.
50</p>
51<h2><a name="_bb3"></a><a name="Controlling AST construction">Controlling AST construction</a></h2>
52<p>
53        AST construction in an ANTLR Parser, or AST transformation in a
54        Tree-Parser, is turned on and off by the <a href="options.html#buildAST">
55        <tt>buildAST</tt> option</a>.
56</p>
57<p>
58        From an AST construction and walking point of view, ANTLR considers all
59        tree nodes to look the same (i.e., they appear to be homogeneous).&nbsp;
60        Through a tree factory or by specification, however, you can instruct ANTLR
61        to create nodes of different types. &nbsp; See the section below on
62        heterogeneous trees.
63</p>
64<h2><a name="_bb4"></a><a name="Grammar annotations for building ASTs">Grammar annotations for building ASTs</a></h2> <h3><a name="_bb5"></a><a name="Leaf nodes">Leaf nodes</a></h3>
65<p>
66        ANTLR assumes that any nonsuffixed token reference or token-range is a
67        leaf node in the resulting tree for the enclosing rule. If no suffixes at
68        all are specified in a grammar, then a Parser will construct a linked-list
69        of the tokens (a degenerate AST), and a Tree-Parser will copy the input
70        AST.
71</p>
72<h3><a name="_bb6"></a><a name="Root nodes">Root nodes</a></h3>
73<p>
74        Any token suffixed with the &quot;<tt>^</tt>&quot; operator is
75        considered a root token. A tree node is constructed for that token and is
76        made the root of whatever portion of the tree has been built
77</p>
78<pre><tt>a : A B^ C^ ;</tt></pre>
79<p>
80        results in tree <tt>#(C #(B A))</tt>.
81</p>
82<p>
83        First A is matched and made a lonely child, followed by B which is made the parent of the current tree, A. Finally, C is matched and made the parent of the current tree, making it the parent of the B node. Note that the same rule without any operators results in the flat tree <tt>A B C</tt>.
84</p>
85<h3><a name="_bb7"></a><a name="Turning off standard tree construction">Turning off standard tree construction</a></h3>
86<p>
87        Suffix a token reference with &quot;<tt>!</tt>&quot; to prevent
88        incorporation of the node for that token into the resulting tree (the AST
89        node for the token is still constructed and may be referenced in actions,
90        it is just not added to the result tree automatically). Suffix a rule
91        reference &quot;<tt>!</tt>&quot; to indicate that the tree constructed by
92        the invoked rule should not be linked into the tree constructed for the
93        current rule.
94</p>
95<p>
96        Suffix a rule definition with &quot;<tt>!</tt>&quot; to indicate that
97        tree construction for the rule is to be turned off. Rules and tokens
98        referenced within that rule still create ASTs, but they are not linked into
99        a result tree. The following rule does no automatic tree construction.
100        Actions must be used to set the return AST value, for example:
101</p>
102<pre><tt>begin!
103    :   INT PLUS i:INT
104        { #begin = #(PLUS INT i); }
105    ;</tt></pre>
106<p>
107        For finer granularity, prefix alternatives with &quot;<tt>!</tt>&quot;
108        to shut off tree construction for that alternative only. This granularity
109        is useful, for example, if you have a large number of alternatives and you
110        only want one to have manual tree construction:
111</p>
112<pre><tt>stat:
113        ID EQUALS^ expr   // auto construction
114    ... some alternatives ...
115    |!  RETURN expr
116        {#stat = #([IMAGINARY_TOKEN_TYPE] expr);}
117    ... more alternatives ...
118    ;</tt> </pre> <h3><a name="_bb8"></a><a name="Tree and tree node construction">Tree node construction</a></h3>
119<p>
120        With automatic tree construction off (but with <code>buildAST</code>
121        on), you must construct your own tree nodes and combine them into tree
122        structures within embedded actions. There are several ways to create a tree
123        node in an action:
124<ol>
125        <li>
126                use <tt>new <i>T</i>(<i>arg</i>)</tt> where <i>T</i> is your tree
127                node type and <i>arg</i> is either a single token type, a token type and
128                token text, or a <tt>Token</tt>.
129        </li>
130        <li>
131                use <tt>ASTFactory.create(<i>arg</i>)</tt> where <i>T</i> is your
132                tree node type and <i>arg</i> is either a single token type, a token type
133                and token text, or a <tt>Token</tt>. Using the factory is more
134                general than creating a new node directly, as it defers the node-type
135                decision to the factory, and can be easily changed for the entire
136                grammar.
137        </li>
138        <li>
139                use the shorthand notation #[TYPE] or #[TYPE,&quot;text&quot;] or
140                #[TYPE,&quot;text&quot;,ASTClassNameToConstruct]. The shorthand notation
141                results in a call to ASTFactory.create() with any specified arguments.
142        </li>
143        <li>
144                use the shorthand notation #<i>id</i>, where <i>id</i> is either a
145                token matched in the rule, a label, or a rule-reference.
146        </li>
147</ol>
148<p>
149        To construct a tree structure from a set of nodes, you can set the
150        first-child and next-sibling references yourself or call the factory
151        <tt>make</tt> method or use <tt>#(...)</tt> notation described below.
152</p>
153<h3><a name="_bb9"></a><a name="ActionTranslation">AST Action Translation</a></h3>
154<p>
155        In parsers and tree parsers with <tt>buildAST</tt> set to true, ANTLR
156        will translate portions of user actions in order to make it easier to build
157        ASTs within actions. In particular, the following constructs starting with
158        '#' will be translated:
159<dl>
160        <dt>
161                <tt>#<i>label</i></tt>
162        </dt>
163        <dd>
164                The AST associated with a labeled token-reference or rule-reference
165                may be accessed as <tt>#<i>label</i></tt>. The translation is to a
166                variable containing the AST node built from that token, or the AST
167                returned from the       rule.
168        </dd>
169        <dt>
170                <tt>#<i>rule</i></tt>
171        </dt>
172        <dd>
173                When <i>rule</i> is the name of the enclosing rule, ANTLR will
174                translate this into the variable containing the result AST for the rule.
175                This allows you to set the return AST for a rule or examine it from
176                within an action. This can be used when AST generation is on or
177                suppressed for the rule or alternate. For example:
178                <pre><tt>r! : a:A       { #r = #a; }</tt></pre>
179
180        </dd>
181        <dd>
182                <font face="Times New Roman">Setting the return tree is very useful
183                in combination with normal tree construction because you can have
184                ANTLR do all the work of building a tree and then add an imaginary
185                root node such as:</font>
186        </dd>
187        <dd>
188                &nbsp;
189        </dd>
190        <dd>
191<pre><tt>decl : ( TYPE ID )+
192       { #decl = #([DECL,&quot;decl&quot;], #decl); }
193     ;</tt></pre>
194        </dd>
195        <dd>
196                ANTLR allows you to assign to <tt>#rule</tt> anywhere within an
197      alternative of the rule. ANTLR ensures that references of and
198                assignments to <tt>#rule</tt> within an action force the parser's
199                internal AST construction variables into a stable state. After you
200                assign to <tt>#rule</tt>, the state of the parser's automatic AST
201                construction variables will be set as if ANTLR had generated the tree
202                rooted at <tt>#rule</tt>. For example, any children nodes added after
203                the action will be added to the children of <tt>#rule</tt>.
204        </dd>
205        <dt>
206                <tt>#<i>label</i>_in</tt>
207        </dt>
208        <dd>
209                In a tree parser, the <b>input</b> AST associated with a labeled
210                token reference or rule reference may be accessed as
211                <tt>#<i>label</i>_in</tt>. The translation is to a variable containing the
212                input-tree AST node from which the rule or token was extracted. Input
213                variables are seldom used. You almost always want to use
214                <tt>#<i>label</i></tt> instead of <tt>#<i>label</i>_in</tt>.
215        </dd>
216        <dt>
217                &nbsp;
218        </dt>
219        <dt>
220                <tt>#<i>id</i></tt>
221        </dt>
222        <dd>
223                ANTLR supports the translation of unlabeled token references as a
224                shorthand notation, as long as the token is unique within the scope
225                of a single alternative. In these cases, the use of an unlabeled
226                token reference identical to using a label. For example, this:
227<pre><tt>
228r! : A { #r = #A; }
229</tt></pre>
230                <p>
231                        is equivalent to:
232                </p>
233<pre><tt>
234r! : a:A { #r = #a; }</tt></pre>
235        </dd>
236        <dd>
237                <tt>#<i>id</i>_in</tt> is given similar treatment to
238                <tt>#<i>label</i>_in.</tt>
239        </dd>
240        <dt>
241                &nbsp;
242        </dt>
243        <dt>
244                <tt>#[<i>TOKEN_TYPE</i>]</tt> or <tt>#[<i>TOKEN_TYPE</i>,&quot;text&quot;] or #[TYPE,&quot;text&quot;,ASTClassNameToConstruct]</tt>
245        </dt>
246        <dd>
247                AST node constructor shorthand. The translation is a call to the <tt>ASTFactory.create()</tt> method.&nbsp; For example, <tt>#[T]</tt> is translated to: <pre><tt>ASFFactory.create(T)</tt></pre>
248        </dd>
249        <dt>
250                <tt>#(<i>root</i>, <i>c1</i>, ..., <i>cn</i>)</tt>
251        </dt>
252        <dd>
253                AST tree construction shorthand. ANTLR looks for the comma character
254to separate the tree arguments. Commas within method call tree elements are
255handled properly; i.e., an element of &quot;<tt>foo(#a,34)</tt>&quot; is ok
256and will not conflict with the comma separator between the other tree
257elements in the tree. This tree construct is translated to a &quot;make
258tree&quot; call. The &quot;make-tree&quot; call is complex due to the need
259to simulate variable arguments in languages like Java, but the result will
260be something like: <pre><tt>ASTFactory.make(<i>root</i>, <i>c1</i>, ...,
261<i>cn</i>);</tt></pre>
262                <p>
263                        In addition to the translation of the <tt>#(...)</tt> as a whole,
264the root and each child <tt><i>c1</i>..<i>cn</i></tt> will be translated.
265Within the context of a <tt>#(...)</tt> construct, you may use:
266                <ul>
267                        <li>
268                                <i><tt>id</tt></i> or <i><tt>label</tt></i> as a shorthand for
269                                <tt>#<i>id</i></tt> or <i><tt>#label</tt></i>.
270                        </li>
271                        <li>
272                                <tt>[...]</tt> as a shorthand for <tt>#[...]</tt>.
273                        </li>
274                        <li>
275                                <tt>(...)</tt> as a shorthand for <tt>#(...)</tt>.
276                        </li>
277                </ul>
278        </dd>
279</dl>
280<p>
281        The target code generator performs this translation with the help of a
282special lexer that parses the actions and asks the code-generator to create
283appropriate substitutions for each translated item. This lexer might impose
284some restrictions on label names (think of C/C++ preprocessor directives)
285</p>
286<h2><a name="_bb10"></a><a name="Invoking parsers that build trees">Invoking parsers that build trees</a></h2>
287<p>
288        Assuming that you have defined a lexer <tt>L</tt> and a parser <tt>P</tt> in your grammar, you can invoke them sequentially on the system input stream as follows.
289</p>
290<pre><tt><i>L</i> lexer = new <i>L</i>(System.in);
291<i>P</i> parser = new <i>P</i>(lexer);
292parser.setASTNodeType(&quot;MyAST&quot;);
293parser.<i>startRule</i>();</tt>   </pre>
294<p>
295        If you have set <tt>buildAST=true</tt> in your parser grammar, then it will build an AST, which can be accessed via <tt>parser.getAST()</tt>. If you have defined a tree parser called <tt>T</tt>, you can invoke it with:
296</p>
297<pre><tt>T walker = new T();
298walker.<i>startRule</i>(parser.getAST()); // walk tree</tt>  </pre>
299<p>
300        If, in addition, you have set <tt>buildAST=true</tt> in your tree-parser to turn on transform mode, then you can access the resulting AST of the tree-walker:
301</p>
302<pre><tt>AST results = walker.getAST();
303DumpASTVisitor visitor = new DumpASTVisitor();
304visitor.visit(results);</tt></pre>
305<p>
306        Where <tt>DumpASTVisitor</tt> is a predefined <tt>ASTVisitor</tt> implementation that simply prints the tree to the standard output.
307</p>
308<p>
309        You can also use get a LISP-like print out of a tree via
310</p>
311<pre>String s = parser.getAST().toStringList();</pre> <h2><a name="_bb11"></a><a name="AST Factories">AST Factories</a></h2>
312<p>
313        ANTLR uses a factory pattern to create and connect AST nodes. This is done to primarily to separate out the tree construction facility from the parser, but also gives you a hook in between the parser and the tree node construction.&nbsp; Subclass <tt>ASTFactory</tt> to alter the <tt>create</tt> methods.
314</p>
315<p>
316        If you are only interested in specifying the AST node type at runtime, use the
317</p>
318<pre><tt>setASTNodeType(String className)</tt></pre>
319<p>
320        method on the parser or factory.&nbsp; By default, trees are constructed of nodes of type <tt>antlr.CommonAST</tt>.  (You must use the fully-qualified class name).
321</p>
322
323<p>
324You can also specify a different class name for each token type to generate heterogeneous trees:
325
326<pre>
327/** Specify an "override" for the Java AST object created for a
328 *  specific token.  It is provided as a convenience so
329 *  you can specify node types dynamically.  ANTLR sets
330 *  the token type mapping automatically from the tokens{...}
331 *  section, but you can change that mapping with this method.
332 *  ANTLR does it's best to statically determine the node
333 *  type for generating parsers, but it cannot deal with
334 *  dynamic values like #[LT(1)].  In this case, it relies
335 *  on the mapping.  Beware differences in the tokens{...}
336 *  section and what you set via this method.  Make sure
337 *  they are the same.
338 *
339 *  Set className to null to remove the mapping.
340 *
341 *  @since 2.7.2
342 */
343public void setTokenTypeASTNodeType(int tokenType, String className)
344        throws IllegalArgumentException;
345</pre>
346
347<p>
348        The ASTFactory has some generically useful methods:
349</p>
350<pre>
351/** Copy a single node with same Java AST objec type.
352 *  Ignore the tokenType->Class mapping since you know
353 *  the type of the node, t.getClass(), and doing a dup.
354 *
355 *  clone() is not used because we want all AST creation
356 *  to go thru the factory so creation can be
357 *  tracked.  Returns null if t is null.
358 */
359public AST dup(AST t);</pre>
360<pre>
361/** Duplicate tree including siblings
362 * of root.
363 */
364public AST dupList(AST t);</pre> <pre>/**Duplicate a tree, assuming this is a
365 * root node of a tree--duplicate that node
366 * and what's below; ignore siblings of root
367 * node.
368 */
369public AST dupTree(AST t);</pre> <h2><a name="Heterogeneous ASTs">Heterogeneous ASTs</a></h2>
370<p>
371        Each node in an AST must encode information about the kind of node it is; for example, is it an ADD operator or a leaf node such as an INT?&nbsp; There are two ways to encode this: with a token type or with a Java (or C++ etc...) class type.&nbsp; In other words, do you have a single class type with numerous token types or no token types and numerous classes?&nbsp; For lack of better terms, I (Terence) have been calling ASTs with a single class type <em>homogeneous</em> trees and ASTs with many class types <em>heterogeneous</em> trees.
372</p>
373<p>
374        The only reason to have a different class type for the various kinds of nodes is for the case where you want to execute a bunch of hand-coded tree walks or your nodes store radically different kinds of data.&nbsp; The example I use below demonstrates an expression tree where each node overrides <font face="Courier New">value()</font> so that <font face="Courier New">root.value()</font> is the result of evaluating the input expression. &nbsp; From the perspective of building trees and walking them with a generated tree parser, it is best to consider every node as an identical AST node.&nbsp; Hence, the schism that exists between the hetero- and homogeneous AST camps.
375</p>
376<p>
377        ANTLR supports both kinds of tree nodes--at the same time!&nbsp; If you do nothing but turn on the &quot;<font face="Courier New">buildAST=true</font>&quot; option, you get a homogeneous tree.&nbsp; Later, if you want to use physically separate class types for some of the nodes, just specify that in the grammar that builds the tree.&nbsp; Then you can have the best of both worlds--the trees are built automatically, but you can apply different methods to and store different data in the various nodes.&nbsp; Note that the structure of the tree is unaffected; just the type of the nodes changes.
378</p>
379<p>
380        ANTLR applies a &quot;scoping&quot; sort of algorithm for determining the class type of a particular AST node that it needs to create.&nbsp; The default type is <font face="Courier New">CommonAST</font> unless, prior to parser invocation, you override that with a call to:
381</p>
382<pre>  <em>myParser</em>.setASTNodeType(&quot;<em>com.acme.MyAST</em>&quot;);</pre>
383<p>
384where you must use a fully qualified class name.
385<p>
386        In the grammar, you can override the default class type by setting the type for nodes created from a particular input token.&nbsp; Use the element option <font face="Courier New">&lt;AST=<em>typename</em>&gt;</font> in the <font face="Courier New">tokens</font> section:
387</p>
388<pre>tokens {
389    PLUS&lt;AST=PLUSNode&gt;;
390    ...
391}</pre>
392<p>
393        You may further override the class type by annotating a particular token reference in your parser grammar:
394</p>
395<pre>anInt : INT&lt;AST=INTNode&gt; ;</pre>
396<p>
397        This reference override is super useful for tokens such as <font face="Courier New">ID</font> that you might want converted to a <font face="Courier New">TYPENAME</font> node in one context and a <font face="Courier New">VARREF</font> in another context.
398</p>
399<p>
400        ANTLR uses the AST factory to create all AST nodes even if it knows the specific type. &nbsp; In other words, ANTLR generates code similar to the following:
401</p>
402<pre>ANode tmp1_AST = (ANode)astFactory.create(LT(1),"ANode");
403</pre>
404
405from
406
407<pre>a : A&lt;AST=ANode&gt; ;</pre>.
408
409<h3><a name="An Expression Tree Example"><font size="3">An Expression Tree Example</font></a></h3>
410<p>
411        <font size="3">This example includes a parser that constructs expression ASTs, the usual lexer, and some AST node class definitions.</font>
412</p>
413<p>
414        <font size="3">Let's start by describing the AST structure and node types. &nbsp; Expressions have plus and multiply operators and integers.&nbsp; The operators will be subtree roots (nonleaf nodes) and integers will be leaf nodes.&nbsp; For example, input 3+4*5+21 yields a tree with structure:</font>
415</p>
416<p>
417        (&nbsp; + (&nbsp; +&nbsp; 3 (&nbsp; *&nbsp; 4&nbsp; 5 ) )&nbsp; 21 )
418</p>
419<p>
420        or:
421</p>
422<pre>  +
423  |
424  +--21
425  |
426  3--*
427     |
428     4--5</pre>
429<p>
430        All AST nodes are subclasses of <font face="Courier New">CalcAST</font>, which are <font face="Courier New">BaseAST</font>'s that also answer method <font face="Courier New">value()</font>. &nbsp; Method <font face="Courier New">value()</font> evaluates the tree starting at that node.&nbsp; Naturally, for integer nodes, <font face="Courier New">value()</font> will simply return the value stored within that node.&nbsp; Here is <font face="Courier New">CalcAST:</font>
431</p>
432<pre>public abstract class CalcAST
433    extends antlr.BaseAST
434{
435    public abstract int value();
436}</pre>
437<p>
438        The AST operator nodes must combine the results of computing the value of their two subtrees.&nbsp; They must perform a depth-first walk of the tree below them.&nbsp; For fun and to make the operations more obvious, the operator nodes define left() and right() instead, making them appear even more different than the normal child-sibling tree representation.&nbsp; Consequently, these expression trees can be treated as both homogeneous child-sibling trees and heterogeneous expression trees.
439</p>
440<pre>public abstract class BinaryOperatorAST extends
441    CalcAST
442{
443    /** Make me look like a heterogeneous tree */
444    public CalcAST left() {
445        return (CalcAST)getFirstChild();
446    }
447
448    public CalcAST right() {
449        CalcAST t = left();
450        if ( t==null ) return null;
451        return (CalcAST)t.getNextSibling();
452    }
453}</pre>
454<p>
455        The simplest node in the tree looks like:
456</p>
457<pre>import antlr.BaseAST;
458import antlr.Token;
459import antlr.collections.AST;
460import java.io.*;
461
462/** A simple node to represent an INT */
463public class INTNode extends CalcAST {
464    int v=0;
465
466    public INTNode(Token tok) {
467        v = Integer.parseInt(tok.getText());
468    }
469
470    /** Compute value of subtree; this is
471     *  heterogeneous part :)
472     */
473    public int value() {
474        return v;
475    }
476
477    public String toString() {
478        return &quot; &quot;+v;
479    }
480
481    // satisfy abstract methods from BaseAST
482    public void initialize(int t, String txt) {
483    }
484    public void initialize(AST t) {
485    }
486    public void initialize(Token tok) {
487    }
488}</pre>
489<p>
490        The operators derive from <font face="Courier New">BinaryOperatorAST</font> and define <font face="Courier New">value()</font> in terms of <font face="Courier New">left()</font> and <font face="Courier New">right()</font>.&nbsp; For example, here is <font face="Courier New">PLUSNode</font>:
491</p>
492<pre>import antlr.BaseAST;
493import antlr.Token;
494import antlr.collections.AST;
495import java.io.*;
496
497/** A simple node to represent PLUS operation */
498public class PLUSNode extends BinaryOperatorAST {
499    public PLUSNode(Token tok) {
500    }
501
502    /** Compute value of subtree;
503     * this is heterogeneous part :)
504     */
505    public int value() {
506        return left().value() + right().value();
507    }
508
509    public String toString() {
510        return &quot; +&quot;;
511    }
512
513    // satisfy abstract methods from BaseAST
514    public void initialize(int t, String txt) {
515    }
516    public void initialize(AST t) {
517    }
518    public void initialize(Token tok) {
519    }
520}</pre>
521<p>
522        The parser is pretty straightforward except that you have to add the options to tell ANTLR what node types you want to create for which token matched on the input stream. &nbsp; The <font face="Courier New">tokens</font> section lists the operators with element option AST appended to their definitions.&nbsp; This tells ANTLR to build <font face="Courier New">PLUSNode</font> objects for any <font face="Courier New">PLUS</font> tokens seen on the input stream, for example.&nbsp; For demonstration purposes, <font face="Courier New">INT</font> is not included in the <font face="Courier New">tokens</font> section--the specific token references is suffixed with the element option to specify that nodes created from that <font face="Courier New">INT</font> should be of type <font face="Courier New">INTNode</font> (of course, the effect is the same as there is only that one reference to <font face="Courier New">INT</font>).
523</p>
524<pre>class CalcParser extends Parser;
525options {
526    buildAST = true; // uses CommonAST by default
527}
528
529// define a bunch of specific AST nodes to build.
530// can override at actual reference of tokens in
531// grammar below.
532tokens {
533    PLUS&lt;AST=PLUSNode&gt;;
534    STAR&lt;AST=MULTNode&gt;;
535}
536
537expr:   mexpr (PLUS^ mexpr)* SEMI!
538    ;
539
540mexpr
541    :   atom (STAR^ atom)*
542    ;
543
544// Demonstrate token reference option
545atom:   INT&lt;AST=INTNode&gt;
546    ;</pre>
547<p>
548        Invoking the parser is done as usual.&nbsp; Computing the value of the resulting AST is accomplished by simply calling method <font face="Courier New">value()</font> on the root.
549</p>
550<pre>import java.io.*;
551import antlr.CommonAST;
552import antlr.collections.AST;
553
554class Main {
555    public static void main(String[] args) {
556        try {
557            CalcLexer lexer =
558                new CalcLexer(
559                  new DataInputStream(System.in)
560                );
561            CalcParser parser =
562                new CalcParser(lexer);
563            // Parse the input expression
564            parser.expr();
565            CalcAST t = (CalcAST)parser.getAST();
566
567            System.out.println(t.toStringTree());
568
569            // Compute value and return
570            int r = t.value();
571            System.out.println(&quot;value is &quot;+r);
572        } catch(Exception e) {
573            System.err.println(&quot;exception: &quot;+e);
574            e.printStackTrace();
575        }
576    }
577}</pre>
578<p>
579        For completeness, here is the lexer:
580</p>
581<pre>class CalcLexer extends Lexer;
582
583WS  :   (' '
584    |   '\t'
585    |   '\n'
586    |   '\r')
587        { $setType(Token.SKIP); }
588    ;
589
590LPAREN: '(' ;
591
592RPAREN: ')' ;
593
594STAR:   '*' ;
595
596PLUS:   '+' ;
597
598SEMI:   ';' ;
599
600protected
601DIGIT
602    :   '0'..'9' ;
603
604INT :   (DIGIT)+ ;</pre> <h3><a name="Describing Heterogeneous Trees With Grammars">Describing Heterogeneous Trees With Grammars</a></h3>
605<p>
606        So what's the difference between this approach and default homogeneous tree construction?&nbsp; The big difference is that you need a tree grammar to describe the expression tree and compute resulting values.&nbsp; But, that's a good thing as it's &quot;executable documentation&quot; and negates the need to handcode the tree parser (the <font face="Courier New">value()</font> methods).&nbsp; If you used homogeneous trees, here is all you would need beyond the parser/lexer to evaluate the expressions:&nbsp; [<em>This code comes from the <font face="Courier New">examples/java/calc</font> directory</em>.]
607</p>
608<pre>class CalcTreeWalker extends TreeParser;
609
610expr returns [float r]
611{
612    float a,b;
613    r=0;
614}
615    :   #(PLUS a=expr b=expr)   {r = a+b;}
616    |   #(STAR a=expr b=expr)   {r = a*b;}
617    |   i:INT
618        {r = (float)
619         Integer.parseInt(i.getText());}
620    ;</pre>
621<p>
622        Because Terence wants you to use tree grammars even when constructing heterogeneous ASTs (to avoid handcoding methods that implement a depth-first-search), implement the following methods in your various heterogeneous AST node class definitions:
623</p>
624<pre>    /** Get the token text for this node */
625    public String getText();
626    /** Get the token type for this node */
627    public int getType();</pre>
628<p>
629        That is how you can use heterogeneous trees with a tree grammar.&nbsp; Note that your token types must match the <font face="Courier New">PLUS</font> and <font face="Courier New">STAR</font> token types imported from your parser.&nbsp; I.e., make sure <font face="Courier New">PLUSNode.getType()</font> returns <font face="Courier New">CalcParserTokenTypes.PLUS</font>. &nbsp; The token types are generated by ANTLR in interface files that look like:
630</p>
631<pre>public interface CalcParserTokenTypes {
632        ...
633        int PLUS = 4;
634        int STAR = 5;
635        ...
636}</pre> <h2><a name="AST Serialization">AST (XML) Serialization</a></h2>
637<p>
638        [<font size="2">Oliver Zeigermann <a href="mailto:olli@zeigermann.de">olli@zeigermann.de</a> provided the initial implementation of this serialization.&nbsp; His <a href="http://www.zeigermann.de/xtal.html">XTAL</a> XML translation code is worth checking out; particularly for reading XML-serialized ASTs back in.]</font>
639</p>
640<p>
641        For a variety of reasons, you may want to store an AST or pass it to another program or computer.&nbsp; Class antlr.BaseAST is Serializable using the Java code generator, which means you can write ASTs to the disk using the standard Java stuff.&nbsp; You can also write the ASTs out in XML form using the following methods from <font face="Courier New">BaseAST</font>:
642<ul>
643        <li>
644                <font face="Courier New">public void xmlSerialize(Writer out)</font>
645        </li>
646        <li>
647                <font face="Courier New">public void xmlSerializeNode(Writer out)</font>
648        </li>
649        <li>
650                <font face="Courier New">public void xmlSerializeRootOpen(Writer out)</font>
651        </li>
652        <li>
653                <font face="Courier New">public void xmlSerializeRootClose(Writer out)</font>
654        </li>
655</ul>
656<p>
657        All methods throw <font face="Courier New">IOException</font>.
658</p>
659<p>
660        You can override <font face="Courier New">xmlSerializeNode</font> and so on to change the way nodes are written out.&nbsp; By default the serialization uses the class type name as the tag name and has attributes <font face="Courier New">text</font> and <font face="Courier New">type</font> to store the text and token type of the node.
661</p>
662<p>
663        The output from running the simple heterogeneous tree example, examples/java/heteroAST, yields:
664</p>
665<pre> (  + (  +  3 (  *  4  5 ) )  21 )
666&lt;PLUS&gt;&lt;PLUS&gt;&lt;int&gt;3&lt;/int&gt;&lt;MULT&gt;
667&lt;int&gt;4&lt;/int&gt;&lt;int&gt;5&lt;/int&gt;
668&lt;/MULT&gt;&lt;/PLUS&gt;&lt;int&gt;21&lt;/int&gt;&lt;/PLUS&gt;
669value is 44</pre>
670<p>
671        The LISP-form of the tree shows the structure and contents.&nbsp; The various heterogeneous nodes override the open and close tags and change the way leaf nodes are serialized to use <font face="Courier New">&lt;int&gt;<em>value</em>&lt;/int&gt;</font> instead of tag attributes of a single node.
672</p>
673<p>
674        Here is the code that generates the XML:
675</p>
676<pre>Writer w = new OutputStreamWriter(System.out);
677t.xmlSerialize(w);
678w.write(&quot;\n&quot;);
679w.flush();</pre> <h2><a name="_bb12">AST enumerations</a></h2>
680<p>
681        The AST <tt>findAll</tt> and <tt>findAllPartial</tt> methods return enumerations of tree nodes that you can walk.&nbsp; Interface
682</p>
683<pre>antlr.collections.ASTEnumeration</pre>
684<p>
685        and
686</p>
687<pre>class antlr.Collections.impl.ASTEnumerator</pre>
688<p>
689        implement this functionality.&nbsp; Here is an example:
690</p>
691<pre>// Print out all instances of
692// <em>a-subtree-of-interest
693// </em>found within tree 't'.
694ASTEnumeration enum;
695enum = t.findAll(<em>a-subtree-of-interest</em>);
696while ( enum.hasMoreNodes() ) {
697  System.out.println(
698    enum.nextNode().toStringList()
699  );
700}</pre> <h2><a name="_bb13"></a><a name="A few examples">A few examples</a></h2> <pre><tt>
701sum :term ( PLUS^ term)*
702    ;</tt> </pre>
703<p>
704        The &quot;<tt>^</tt>&quot; suffix on the <tt>PLUS</tt> tells ANTLR to create an additional node and place it as the root of whatever subtree has been constructed up until that point for rule <tt>sum</tt>. The subtrees returned by the <tt>term</tt> references are collected as children of the addition nodes.&nbsp; If the subrule is not matched, the associated nodes would not be added to the tree. The rule returns either the tree matched for the first <tt>term</tt> reference or a <tt>PLUS</tt>-rooted tree.
705</p>
706<p>
707        The grammar annotations should be viewed as operators, not static specifications. In the above example, each iteration of the (...)* will create a new PLUS root, with the previous tree on the left, and the tree from the new <tt>term</tt> on the right, thus preserving the usual associatively for &quot;+&quot;.
708</p>
709<p>
710        Look at the following rule that turns off default tree construction.
711</p>
712<pre><tt>decl!:
713    modifiers type ID SEMI;
714        { #decl = #([DECL], ID, ([TYPE] type),
715                    ([MOD] modifiers) ); }
716    ;</tt></pre>
717<p>
718        In this example, a declaration is matched. The resulting AST has an &quot;imaginary&quot; <tt>DECL</tt> node at the root, with three children. The first child is the <tt>ID</tt> of the declaration. The second child is a subtree with an imaginary <tt>TYPE</tt> node at the root and the AST from the <tt>type</tt> rule as its child. The third child is a subtree with an imaginary <tt>MOD</tt> at the root and the results of the <tt>modifiers</tt> rule as its child.
719</p>
720<h2><a name="_bb14"></a><a name="Labeled subrules">Labeled subrules</a></h2>
721<p>
722        [<big><i>THIS WILL NOT BE IMPLEMENTED AS LABELED SUBRULES...We'll do something else eventually.</i></big>]
723</p>
724<p>
725        In 2.00 ANTLR, each rule has exactly one tree associated with it. Subrules simply add elements to the tree for the enclosing rule, which is normally what you want. For example, expression trees are easily built via:
726</p>
727<pre><tt>
728expr: ID ( PLUS^ ID )*
729    ;
730</tt>    </pre>
731<p>
732        However, many times you want the elements of a subrule to produce a tree that is independent of the rule's tree. Recall that exponents must be computed before coefficients are multiplied in for exponent terms. The following grammar matches the correct syntax.
733</p>
734<pre><tt>
735// match exponent terms such as &quot;3*x^4&quot;
736eterm
737    :   expr MULT ID EXPONENT expr
738    ;
739</tt>    </pre>
740<p>
741        However, to produce the correct AST, you would normally split the <tt>ID EXPONENT expr</tt> portion into another rule like this:
742</p>
743<pre><tt>
744eterm:
745    expr MULT^ exp
746    ;
747
748exp:
749        ID EXPONENT^ expr
750    ;
751</tt>    </pre>
752<p>
753        In this manner, each operator would be the root of the appropriate subrule. For input <tt>3*x^4</tt>, the tree would look like:
754</p>
755<pre><tt>
756#(MULT 3 #(EXPONENT ID 4))
757</tt>    </pre>
758<p>
759        However, if you attempted to keep this grammar in the same rule:
760</p>
761<pre><tt>
762eterm
763    :   expr MULT^ (ID EXPONENT^ expr)
764    ;
765</tt>    </pre>
766<p>
767        both &quot;<tt>^</tt>&quot; root operators would modify the same tree yielding
768</p>
769<pre><tt>
770#(EXPONENT #(MULT 3 ID) 4)
771</tt>    </pre>
772<p>
773        This tree has the operators as roots, but they are associated with the wrong operands.
774</p>
775<p>
776        Using a labeled subrule allows the original rule to generate the correct tree.
777</p>
778<pre><tt>
779eterm
780    :   expr MULT^ e:(ID EXPONENT^ expr)
781    ;
782</tt>    </pre>
783<p>
784        In this case, for the same input <tt>3*x^4</tt>, the labeled subrule would build up its own subtree and make it the operand of the <tt>MULT</tt> tree of the <tt>eterm</tt> rule. The presence of the label alters the AST code generation for the elements within the subrule, making it operate more like a normal rule. Annotations of &quot;<tt>^</tt>&quot; make the node created for that token reference the root of the tree for the <tt>e</tt> subrule.
785</p>
786<p>
787        Labeled subrules have a result AST that can be accessed just like the result AST for a rule. For example, we could rewrite the above decl example using labeled subrules (note the use of <tt>!</tt> at the start of the subrules to suppress automatic construction for the subrule):
788</p>
789<pre><tt>
790decl!:
791    m:(! modifiers { #m = #([MOD] modifiers); } )
792    t:(! type { #t = #([TYPE] type); } )
793    ID
794    SEMI;
795    { #decl = #( [DECL] ID t m ); }
796    ;
797</tt>    </pre>
798<p>
799        What about subrules that are closure loops? The same rules apply to a closure subrule--there is a single tree for that loop that is built up according to the AST operators annotating the elements of that loop. For example, consider the following rule.
800</p>
801<pre><tt>
802term:   T^ i:(OP^ expr)+
803    ;
804</tt>    </pre>
805<p>
806        For input <tt>T OP A OP B OP C</tt>, the following tree structure would be created:
807</p>
808<pre><tt>
809#(T #(OP #(OP #(OP A) B) C) )
810</tt>    </pre>
811<p>
812        which can be drawn graphically as
813</p>
814<pre><tt>
815T
816|
817OP
818|
819OP--C
820|
821OP--B
822|
823A
824</tt>    </pre>
825<p>
826        The first important thing to note is that each iteration of the loop in the subrule operates on the same tree. The resulting tree, after all iterations of the loop, is associated with the subrule label. The result tree for the above labeled subrule is:
827</p>
828<pre><tt>
829#(OP #(OP #(OP A) B) C)
830</tt>    </pre>
831<p>
832        The second thing to note is that, because <tt>T</tt> is matched first and there is a root operator after it in the rule, <tt>T</tt> would be at the bottom of the tree if it were not for the label on the subrule.
833</p>
834<p>
835        Loops will generally be used to build up lists of subtree. For example, if you want a list of polynomial assignments to produce a sibling list of <tt>ASSIGN</tt> subtrees, then the following rule you would normally split into two rules.
836</p>
837<pre><tt>
838interp
839    :   ( ID ASSIGN poly &quot;;&quot; )+
840    ;
841</tt>    </pre>
842<p>
843        Normally, the following would be required
844</p>
845<pre><tt>
846interp
847    :   ( assign )+
848    ;
849assign
850    :   ID ASSIGN^ poly &quot;;&quot;!
851    ;
852</tt>    </pre>
853<p>
854        Labeling a subrule allows you to write the above example more easily as:
855</p>
856<pre><tt>
857interp
858    :   ( r:(ID ASSIGN^ poly &quot;;&quot;) )+
859    ;
860</tt>    </pre>
861<p>
862        Each recognition of a subrule results in a tree and if the subrule is nested in a loop, all trees are returned as a list of trees (i.e., the roots of the subtrees are siblings). If the labeled subrule is suffixed with a &quot;<tt>!</tt>&quot;, then the tree(s) created by the subrule are not linked into the tree for the enclosing rule or subrule.
863</p>
864<p>
865        Labeled subrules within labeled subrules result in trees that are linked into the surrounding subrule's tree. For example, the following rule results in a tree of the form <tt>X #( A #(B C) D) Y</tt>.
866</p>
867<pre><tt>
868a   :   X r:( A^ s:(B^ C) D) Y
869    ;
870</tt>    </pre>
871<p>
872        Labeled subrules within nonlabeled subrules result in trees that are linked into the surrounding rule's tree. For example, the following rule results in a tree of the form <tt>#(A X #(B C) D Y)</tt>.
873</p>
874<pre><tt>
875a   :   X ( A^ s:(B^ C) D) Y
876    ;</tt>    </pre> <h2><a name="_bb15"></a><a name="Reference nodes">Reference nodes</a></h2>
877<p>
878        <b>Not implemented.</b> A node that does nothing but refer to another node in the tree. Nice for embedding the same tree in multiple lists.
879</p>
880<h2><a name="_bb16"></a><a name="Required AST functionality and form">Required AST functionality and form</a></h2>
881<p>
882        The data structure representing your trees can have any form or type name as long as they implement the <tt>AST</tt> interface:
883</p>
884<pre><tt>package antlr.collections;
885
886/** Minimal AST node interface used by ANTLR
887 *  AST generation and tree-walker.
888 */
889public interface AST {
890    /** Get the token type for this node */
891    public int getType();
892
893    /** Set the token type for this node */
894    public void setType(int ttype);
895
896    /** Get the token text for this node */
897    public String getText();
898
899    /** Set the token text for this node */
900    public void setText(String text);
901
902    /** Get the first child of this node;
903     *  null if no children */
904    public AST getFirstChild();
905
906    /** Set the first child of a node */
907    public void setFirstChild(AST c);
908
909    /** Get the next sibling in line after this
910     * one
911     */
912    public AST getNextSibling();
913
914    /** Set the next sibling after this one */
915    public void setNextSibling(AST n);
916
917    /** Add a (rightmost) child to this node */
918    public void addChild(AST node);</tt></pre> <pre>    /** Are two nodes exactly equal? */
919    public boolean equals(AST t);</pre> <pre>    /** Are two lists of nodes/subtrees exactly
920     *  equal in structure and content? */
921    public boolean equalsList(AST t);</pre> <pre>    /** Are two lists of nodes/subtrees
922     *  partially equal? In other words, 'this'
923     *  can be bigger than 't'
924     */
925    public boolean equalsListPartial(AST t);</pre> <pre>    /** Are two nodes/subtrees exactly equal? */
926    public boolean equalsTree(AST t);</pre> <pre>    /** Are two nodes/subtrees exactly partially
927     *  equal? In other words, 'this' can be
928     *  bigger than 't'.
929     */
930    public boolean equalsTreePartial(AST t);</pre> <pre>    /** Return an enumeration of all exact tree
931     * matches for tree within 'this'.
932     */
933    public ASTEnumeration findAll(AST tree);</pre> <pre>    /** Return an enumeration of all partial
934     *  tree matches for tree within 'this'.
935     */
936    public ASTEnumeration findAllPartial(
937        AST subtree);</pre> <pre>    /** Init a node with token type and text */
938    public void initialize(int t, String txt);</pre> <pre>    /** Init a node using content from 't' */
939    public void initialize(AST t);</pre> <pre>    /** Init a node using content from 't' */
940    public void initialize(Token t);</pre> <pre>    /** Convert node to printable form */
941    public String toString();</pre> <pre>    /** Treat 'this' as list (i.e.,
942     *  consider 'this'
943     *  siblings) and convert to printable
944     *  form
945     */
946    public String toStringList();</pre> <pre>    /** Treat 'this' as tree root
947     *  (i.e., don't consider
948     *  'this' siblings) and convert
949     *   to printable form */
950    public String toStringTree();<tt>
951}</tt></pre>
952<p>
953        This scheme does not preclude the use of heterogeneous trees versus homogeneous trees. However, you will need to write extra code to create heterogeneous trees (via a subclass of <tt>ASTFactory</tt>) or by specifying the node types at the token reference sites or in the <font face="Courier New">tokens</font> section, whereas the homogeneous trees are free.
954</p>
955<pre><font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/trees.html#2 $</font></pre>
956</body>
957</html>
Note: See TracBrowser for help on using the repository browser.