Context Navigation

trees.html

Last change on this file was 1, checked in by lnalod, 15 years ago
Initial import of YAO sources
File size: 40.1 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
2	<html>
3	<head>
4	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
5	<title>ANTLR Tree Construction</title>
6	</head>
7	<body bgcolor="#FFFFFF" text="#000000">
8	<h1><a name="_bb1">ANTLR Tree Construction</a></h1>
9	<p>
10	ANTLR helps you build intermediate form trees, or abstract syntax trees
11	(ASTs), by providing grammar annotations that indicate what tokens are to
12	be treated as subtree roots, which are to be leaves, and which are to be
13	ignored with respect to tree construction.  As with PCCTS 1.33, you
14	may manipulate trees using tree grammar actions.
15	</p>
16
17	<p>
18	It is often the case that programmers either have existing
19	tree definitions or need a special physical structure, thus,
20	preventing ANTLR from specifically defining the implementation
21	of AST nodes. ANTLR specifies only an interface describing
22	minimum behavior. Your tree implementation must implement this
23	interface so ANTLR knows how to work with your trees. Further,
24	you must tell the parser the name of your tree nodes or
25	provide a tree "factory" so that ANTLR knows how to
26	create nodes with the correct type (rather than hardcoding in
27	a <tt>new AST()</tt> expression everywhere).   ANTLR can
28	construct and walk any tree that satisfies the AST
29	interface.  A number of common tree definitions are
30	provided. Unfortunately, ANTLR cannot parse XML DOM trees since our
31	method names conflict (e.g., <tt>getFirstChild()</tt>); ANTLR was here
32	first <wink>. Argh!
33	</p>
34
35	<h2><a name="_bb2"></a><a name="Notation">Notation</a></h2>
36	<p>
37	In this and other documents, tree structures are represented by a
38	LISP-like notation, for example:
39	</p>
40	<pre><tt>#(A B C)</tt></pre>
41	<p>
42	is a tree with A at the root, and children B and C. This notation can be
43	nested to describe trees of arbitrary structure, for example:
44	</p>
45	<pre><tt>#(A B #(C D E))</tt></pre>
46	<p>
47	is a tree with A at the root, B as a first child, and an entire subtree
48	as the second child. The subtree, in turn, has C at the root and D,E as
49	children.
50	</p>
51	<h2><a name="_bb3"></a><a name="Controlling AST construction">Controlling AST construction</a></h2>
52	<p>
53	AST construction in an ANTLR Parser, or AST transformation in a
54	Tree-Parser, is turned on and off by the <a href="options.html#buildAST">
55	<tt>buildAST</tt> option</a>.
56	</p>
57	<p>
58	From an AST construction and walking point of view, ANTLR considers all
59	tree nodes to look the same (i.e., they appear to be homogeneous).
60	Through a tree factory or by specification, however, you can instruct ANTLR
61	to create nodes of different types.   See the section below on
62	heterogeneous trees.
63	</p>
64	<h2><a name="_bb4"></a><a name="Grammar annotations for building ASTs">Grammar annotations for building ASTs</a></h2> <h3><a name="_bb5"></a><a name="Leaf nodes">Leaf nodes</a></h3>
65	<p>
66	ANTLR assumes that any nonsuffixed token reference or token-range is a
67	leaf node in the resulting tree for the enclosing rule. If no suffixes at
68	all are specified in a grammar, then a Parser will construct a linked-list
69	of the tokens (a degenerate AST), and a Tree-Parser will copy the input
70	AST.
71	</p>
72	<h3><a name="_bb6"></a><a name="Root nodes">Root nodes</a></h3>
73	<p>
74	Any token suffixed with the "<tt>^</tt>" operator is
75	considered a root token. A tree node is constructed for that token and is
76	made the root of whatever portion of the tree has been built
77	</p>
78	<pre><tt>a : A B^ C^ ;</tt></pre>
79	<p>
80	results in tree <tt>#(C #(B A))</tt>.
81	</p>
82	<p>
83	First A is matched and made a lonely child, followed by B which is made the parent of the current tree, A. Finally, C is matched and made the parent of the current tree, making it the parent of the B node. Note that the same rule without any operators results in the flat tree <tt>A B C</tt>.
84	</p>
85	<h3><a name="_bb7"></a><a name="Turning off standard tree construction">Turning off standard tree construction</a></h3>
86	<p>
87	Suffix a token reference with "<tt>!</tt>" to prevent
88	incorporation of the node for that token into the resulting tree (the AST
89	node for the token is still constructed and may be referenced in actions,
90	it is just not added to the result tree automatically). Suffix a rule
91	reference "<tt>!</tt>" to indicate that the tree constructed by
92	the invoked rule should not be linked into the tree constructed for the
93	current rule.
94	</p>
95	<p>
96	Suffix a rule definition with "<tt>!</tt>" to indicate that
97	tree construction for the rule is to be turned off. Rules and tokens
98	referenced within that rule still create ASTs, but they are not linked into
99	a result tree. The following rule does no automatic tree construction.
100	Actions must be used to set the return AST value, for example:
101	</p>
102	<pre><tt>begin!
103	: INT PLUS i:INT
104	{ #begin = #(PLUS INT i); }
105	;</tt></pre>
106	<p>
107	For finer granularity, prefix alternatives with "<tt>!</tt>"
108	to shut off tree construction for that alternative only. This granularity
109	is useful, for example, if you have a large number of alternatives and you
110	only want one to have manual tree construction:
111	</p>
112	<pre><tt>stat:
113	ID EQUALS^ expr // auto construction
114	... some alternatives ...
115	\|! RETURN expr
116	{#stat = #([IMAGINARY_TOKEN_TYPE] expr);}
117	... more alternatives ...
118	;</tt> </pre> <h3><a name="_bb8"></a><a name="Tree and tree node construction">Tree node construction</a></h3>
119	<p>
120	With automatic tree construction off (but with <code>buildAST</code>
121	on), you must construct your own tree nodes and combine them into tree
122	structures within embedded actions. There are several ways to create a tree
123	node in an action:
124	<ol>
125	<li>
126	use <tt>new <i>T</i>(<i>arg</i>)</tt> where <i>T</i> is your tree
127	node type and <i>arg</i> is either a single token type, a token type and
128	token text, or a <tt>Token</tt>.
129	</li>
130	<li>
131	use <tt>ASTFactory.create(<i>arg</i>)</tt> where <i>T</i> is your
132	tree node type and <i>arg</i> is either a single token type, a token type
133	and token text, or a <tt>Token</tt>. Using the factory is more
134	general than creating a new node directly, as it defers the node-type
135	decision to the factory, and can be easily changed for the entire
136	grammar.
137	</li>
138	<li>
139	use the shorthand notation #[TYPE] or #[TYPE,"text"] or
140	#[TYPE,"text",ASTClassNameToConstruct]. The shorthand notation
141	results in a call to ASTFactory.create() with any specified arguments.
142	</li>
143	<li>
144	use the shorthand notation #<i>id</i>, where <i>id</i> is either a
145	token matched in the rule, a label, or a rule-reference.
146	</li>
147	</ol>
148	<p>
149	To construct a tree structure from a set of nodes, you can set the
150	first-child and next-sibling references yourself or call the factory
151	<tt>make</tt> method or use <tt>#(...)</tt> notation described below.
152	</p>
153	<h3><a name="_bb9"></a><a name="ActionTranslation">AST Action Translation</a></h3>
154	<p>
155	In parsers and tree parsers with <tt>buildAST</tt> set to true, ANTLR
156	will translate portions of user actions in order to make it easier to build
157	ASTs within actions. In particular, the following constructs starting with
158	'#' will be translated:
159	<dl>
160	<dt>
161	<tt>#<i>label</i></tt>
162	</dt>
163	<dd>
164	The AST associated with a labeled token-reference or rule-reference
165	may be accessed as <tt>#<i>label</i></tt>. The translation is to a
166	variable containing the AST node built from that token, or the AST
167	returned from the rule.
168	</dd>
169	<dt>
170	<tt>#<i>rule</i></tt>
171	</dt>
172	<dd>
173	When <i>rule</i> is the name of the enclosing rule, ANTLR will
174	translate this into the variable containing the result AST for the rule.
175	This allows you to set the return AST for a rule or examine it from
176	within an action. This can be used when AST generation is on or
177	suppressed for the rule or alternate. For example:
178	<pre><tt>r! : a:A { #r = #a; }</tt></pre>
179
180	</dd>
181	<dd>
182	<font face="Times New Roman">Setting the return tree is very useful
183	in combination with normal tree construction because you can have
184	ANTLR do all the work of building a tree and then add an imaginary
185	root node such as:</font>
186	</dd>
187	<dd>
188
189	</dd>
190	<dd>
191	<pre><tt>decl : ( TYPE ID )+
192	{ #decl = #([DECL,"decl"], #decl); }
193	;</tt></pre>
194	</dd>
195	<dd>
196	ANTLR allows you to assign to <tt>#rule</tt> anywhere within an
197	alternative of the rule. ANTLR ensures that references of and
198	assignments to <tt>#rule</tt> within an action force the parser's
199	internal AST construction variables into a stable state. After you
200	assign to <tt>#rule</tt>, the state of the parser's automatic AST
201	construction variables will be set as if ANTLR had generated the tree
202	rooted at <tt>#rule</tt>. For example, any children nodes added after
203	the action will be added to the children of <tt>#rule</tt>.
204	</dd>
205	<dt>
206	<tt>#<i>label</i>_in</tt>
207	</dt>
208	<dd>
209	In a tree parser, the <b>input</b> AST associated with a labeled
210	token reference or rule reference may be accessed as
211	<tt>#<i>label</i>_in</tt>. The translation is to a variable containing the
212	input-tree AST node from which the rule or token was extracted. Input
213	variables are seldom used. You almost always want to use
214	<tt>#<i>label</i></tt> instead of <tt>#<i>label</i>_in</tt>.
215	</dd>
216	<dt>
217
218	</dt>
219	<dt>
220	<tt>#<i>id</i></tt>
221	</dt>
222	<dd>
223	ANTLR supports the translation of unlabeled token references as a
224	shorthand notation, as long as the token is unique within the scope
225	of a single alternative. In these cases, the use of an unlabeled
226	token reference identical to using a label. For example, this:
227	<pre><tt>
228	r! : A { #r = #A; }
229	</tt></pre>
230	<p>
231	is equivalent to:
232	</p>
233	<pre><tt>
234	r! : a:A { #r = #a; }</tt></pre>
235	</dd>
236	<dd>
237	<tt>#<i>id</i>_in</tt> is given similar treatment to
238	<tt>#<i>label</i>_in.</tt>
239	</dd>
240	<dt>
241
242	</dt>
243	<dt>
244	<tt>#[<i>TOKEN_TYPE</i>]</tt> or <tt>#[<i>TOKEN_TYPE</i>,"text"] or #[TYPE,"text",ASTClassNameToConstruct]</tt>
245	</dt>
246	<dd>
247	AST node constructor shorthand. The translation is a call to the <tt>ASTFactory.create()</tt> method.  For example, <tt>#[T]</tt> is translated to: <pre><tt>ASFFactory.create(T)</tt></pre>
248	</dd>
249	<dt>
250	<tt>#(<i>root</i>, <i>c1</i>, ..., <i>cn</i>)</tt>
251	</dt>
252	<dd>
253	AST tree construction shorthand. ANTLR looks for the comma character
254	to separate the tree arguments. Commas within method call tree elements are
255	handled properly; i.e., an element of "<tt>foo(#a,34)</tt>" is ok
256	and will not conflict with the comma separator between the other tree
257	elements in the tree. This tree construct is translated to a "make
258	tree" call. The "make-tree" call is complex due to the need
259	to simulate variable arguments in languages like Java, but the result will
260	be something like: <pre><tt>ASTFactory.make(<i>root</i>, <i>c1</i>, ...,
261	<i>cn</i>);</tt></pre>
262	<p>
263	In addition to the translation of the <tt>#(...)</tt> as a whole,
264	the root and each child <tt><i>c1</i>..<i>cn</i></tt> will be translated.
265	Within the context of a <tt>#(...)</tt> construct, you may use:
266	<ul>
267	<li>
268	<i><tt>id</tt></i> or <i><tt>label</tt></i> as a shorthand for
269	<tt>#<i>id</i></tt> or <i><tt>#label</tt></i>.
270	</li>
271	<li>
272	<tt>[...]</tt> as a shorthand for <tt>#[...]</tt>.
273	</li>
274	<li>
275	<tt>(...)</tt> as a shorthand for <tt>#(...)</tt>.
276	</li>
277	</ul>
278	</dd>
279	</dl>
280	<p>
281	The target code generator performs this translation with the help of a
282	special lexer that parses the actions and asks the code-generator to create
283	appropriate substitutions for each translated item. This lexer might impose
284	some restrictions on label names (think of C/C++ preprocessor directives)
285	</p>
286	<h2><a name="_bb10"></a><a name="Invoking parsers that build trees">Invoking parsers that build trees</a></h2>
287	<p>
288	Assuming that you have defined a lexer <tt>L</tt> and a parser <tt>P</tt> in your grammar, you can invoke them sequentially on the system input stream as follows.
289	</p>
290	<pre><tt><i>L</i> lexer = new <i>L</i>(System.in);
291	<i>P</i> parser = new <i>P</i>(lexer);
292	parser.setASTNodeType("MyAST");
293	parser.<i>startRule</i>();</tt> </pre>
294	<p>
295	If you have set <tt>buildAST=true</tt> in your parser grammar, then it will build an AST, which can be accessed via <tt>parser.getAST()</tt>. If you have defined a tree parser called <tt>T</tt>, you can invoke it with:
296	</p>
297	<pre><tt>T walker = new T();
298	walker.<i>startRule</i>(parser.getAST()); // walk tree</tt> </pre>
299	<p>
300	If, in addition, you have set <tt>buildAST=true</tt> in your tree-parser to turn on transform mode, then you can access the resulting AST of the tree-walker:
301	</p>
302	<pre><tt>AST results = walker.getAST();
303	DumpASTVisitor visitor = new DumpASTVisitor();
304	visitor.visit(results);</tt></pre>
305	<p>
306	Where <tt>DumpASTVisitor</tt> is a predefined <tt>ASTVisitor</tt> implementation that simply prints the tree to the standard output.
307	</p>
308	<p>
309	You can also use get a LISP-like print out of a tree via
310	</p>
311	<pre>String s = parser.getAST().toStringList();</pre> <h2><a name="_bb11"></a><a name="AST Factories">AST Factories</a></h2>
312	<p>
313	ANTLR uses a factory pattern to create and connect AST nodes. This is done to primarily to separate out the tree construction facility from the parser, but also gives you a hook in between the parser and the tree node construction.  Subclass <tt>ASTFactory</tt> to alter the <tt>create</tt> methods.
314	</p>
315	<p>
316	If you are only interested in specifying the AST node type at runtime, use the
317	</p>
318	<pre><tt>setASTNodeType(String className)</tt></pre>
319	<p>
320	method on the parser or factory.  By default, trees are constructed of nodes of type <tt>antlr.CommonAST</tt>. (You must use the fully-qualified class name).
321	</p>
322
323	<p>
324	You can also specify a different class name for each token type to generate heterogeneous trees:
325
326	<pre>
327	/** Specify an "override" for the Java AST object created for a
328	* specific token. It is provided as a convenience so
329	* you can specify node types dynamically. ANTLR sets
330	* the token type mapping automatically from the tokens{...}
331	* section, but you can change that mapping with this method.
332	* ANTLR does it's best to statically determine the node
333	* type for generating parsers, but it cannot deal with
334	* dynamic values like #[LT(1)]. In this case, it relies
335	* on the mapping. Beware differences in the tokens{...}
336	* section and what you set via this method. Make sure
337	* they are the same.
338	*
339	* Set className to null to remove the mapping.
340	*
341	* @since 2.7.2
342	*/
343	public void setTokenTypeASTNodeType(int tokenType, String className)
344	throws IllegalArgumentException;
345	</pre>
346
347	<p>
348	The ASTFactory has some generically useful methods:
349	</p>
350	<pre>
351	/** Copy a single node with same Java AST objec type.
352	* Ignore the tokenType->Class mapping since you know
353	* the type of the node, t.getClass(), and doing a dup.
354	*
355	* clone() is not used because we want all AST creation
356	* to go thru the factory so creation can be
357	* tracked. Returns null if t is null.
358	*/
359	public AST dup(AST t);</pre>
360	<pre>
361	/** Duplicate tree including siblings
362	* of root.
363	*/
364	public AST dupList(AST t);</pre> <pre>/**Duplicate a tree, assuming this is a
365	* root node of a tree--duplicate that node
366	* and what's below; ignore siblings of root
367	* node.
368	*/
369	public AST dupTree(AST t);</pre> <h2><a name="Heterogeneous ASTs">Heterogeneous ASTs</a></h2>
370	<p>
371	Each node in an AST must encode information about the kind of node it is; for example, is it an ADD operator or a leaf node such as an INT?  There are two ways to encode this: with a token type or with a Java (or C++ etc...) class type.  In other words, do you have a single class type with numerous token types or no token types and numerous classes?  For lack of better terms, I (Terence) have been calling ASTs with a single class type <em>homogeneous</em> trees and ASTs with many class types <em>heterogeneous</em> trees.
372	</p>
373	<p>
374	The only reason to have a different class type for the various kinds of nodes is for the case where you want to execute a bunch of hand-coded tree walks or your nodes store radically different kinds of data.  The example I use below demonstrates an expression tree where each node overrides <font face="Courier New">value()</font> so that <font face="Courier New">root.value()</font> is the result of evaluating the input expression.   From the perspective of building trees and walking them with a generated tree parser, it is best to consider every node as an identical AST node.  Hence, the schism that exists between the hetero- and homogeneous AST camps.
375	</p>
376	<p>
377	ANTLR supports both kinds of tree nodes--at the same time!  If you do nothing but turn on the "<font face="Courier New">buildAST=true</font>" option, you get a homogeneous tree.  Later, if you want to use physically separate class types for some of the nodes, just specify that in the grammar that builds the tree.  Then you can have the best of both worlds--the trees are built automatically, but you can apply different methods to and store different data in the various nodes.  Note that the structure of the tree is unaffected; just the type of the nodes changes.
378	</p>
379	<p>
380	ANTLR applies a "scoping" sort of algorithm for determining the class type of a particular AST node that it needs to create.  The default type is <font face="Courier New">CommonAST</font> unless, prior to parser invocation, you override that with a call to:
381	</p>
382	<pre> <em>myParser</em>.setASTNodeType("<em>com.acme.MyAST</em>");</pre>
383	<p>
384	where you must use a fully qualified class name.
385	<p>
386	In the grammar, you can override the default class type by setting the type for nodes created from a particular input token.  Use the element option <font face="Courier New"><AST=<em>typename</em>></font> in the <font face="Courier New">tokens</font> section:
387	</p>
388	<pre>tokens {
389	PLUS<AST=PLUSNode>;
390	...
391	}</pre>
392	<p>
393	You may further override the class type by annotating a particular token reference in your parser grammar:
394	</p>
395	<pre>anInt : INT<AST=INTNode> ;</pre>
396	<p>
397	This reference override is super useful for tokens such as <font face="Courier New">ID</font> that you might want converted to a <font face="Courier New">TYPENAME</font> node in one context and a <font face="Courier New">VARREF</font> in another context.
398	</p>
399	<p>
400	ANTLR uses the AST factory to create all AST nodes even if it knows the specific type.   In other words, ANTLR generates code similar to the following:
401	</p>
402	<pre>ANode tmp1_AST = (ANode)astFactory.create(LT(1),"ANode");
403	</pre>
404
405	from
406
407	<pre>a : A<AST=ANode> ;</pre>.
408
409	<h3><a name="An Expression Tree Example"><font size="3">An Expression Tree Example</font></a></h3>
410	<p>
411	<font size="3">This example includes a parser that constructs expression ASTs, the usual lexer, and some AST node class definitions.</font>
412	</p>
413	<p>
414	<font size="3">Let's start by describing the AST structure and node types.   Expressions have plus and multiply operators and integers.  The operators will be subtree roots (nonleaf nodes) and integers will be leaf nodes.  For example, input 3+4*5+21 yields a tree with structure:</font>
415	</p>
416	<p>
417	(  + (  +  3 (  *  4  5 ) )  21 )
418	</p>
419	<p>
420	or:
421	</p>
422	<pre> +
423	\|
424	+--21
425	\|
426	3--*
427	\|
428	4--5</pre>
429	<p>
430	All AST nodes are subclasses of <font face="Courier New">CalcAST</font>, which are <font face="Courier New">BaseAST</font>'s that also answer method <font face="Courier New">value()</font>.   Method <font face="Courier New">value()</font> evaluates the tree starting at that node.  Naturally, for integer nodes, <font face="Courier New">value()</font> will simply return the value stored within that node.  Here is <font face="Courier New">CalcAST:</font>
431	</p>
432	<pre>public abstract class CalcAST
433	extends antlr.BaseAST
434	{
435	public abstract int value();
436	}</pre>
437	<p>
438	The AST operator nodes must combine the results of computing the value of their two subtrees.  They must perform a depth-first walk of the tree below them.  For fun and to make the operations more obvious, the operator nodes define left() and right() instead, making them appear even more different than the normal child-sibling tree representation.  Consequently, these expression trees can be treated as both homogeneous child-sibling trees and heterogeneous expression trees.
439	</p>
440	<pre>public abstract class BinaryOperatorAST extends
441	CalcAST
442	{
443	/** Make me look like a heterogeneous tree */
444	public CalcAST left() {
445	return (CalcAST)getFirstChild();
446	}
447
448	public CalcAST right() {
449	CalcAST t = left();
450	if ( t==null ) return null;
451	return (CalcAST)t.getNextSibling();
452	}
453	}</pre>
454	<p>
455	The simplest node in the tree looks like:
456	</p>
457	<pre>import antlr.BaseAST;
458	import antlr.Token;
459	import antlr.collections.AST;
460	import java.io.*;
461
462	/** A simple node to represent an INT */
463	public class INTNode extends CalcAST {
464	int v=0;
465
466	public INTNode(Token tok) {
467	v = Integer.parseInt(tok.getText());
468	}
469
470	/** Compute value of subtree; this is
471	* heterogeneous part :)
472	*/
473	public int value() {
474	return v;
475	}
476
477	public String toString() {
478	return " "+v;
479	}
480
481	// satisfy abstract methods from BaseAST
482	public void initialize(int t, String txt) {
483	}
484	public void initialize(AST t) {
485	}
486	public void initialize(Token tok) {
487	}
488	}</pre>
489	<p>
490	The operators derive from <font face="Courier New">BinaryOperatorAST</font> and define <font face="Courier New">value()</font> in terms of <font face="Courier New">left()</font> and <font face="Courier New">right()</font>.  For example, here is <font face="Courier New">PLUSNode</font>:
491	</p>
492	<pre>import antlr.BaseAST;
493	import antlr.Token;
494	import antlr.collections.AST;
495	import java.io.*;
496
497	/** A simple node to represent PLUS operation */
498	public class PLUSNode extends BinaryOperatorAST {
499	public PLUSNode(Token tok) {
500	}
501
502	/** Compute value of subtree;
503	* this is heterogeneous part :)
504	*/
505	public int value() {
506	return left().value() + right().value();
507	}
508
509	public String toString() {
510	return " +";
511	}
512
513	// satisfy abstract methods from BaseAST
514	public void initialize(int t, String txt) {
515	}
516	public void initialize(AST t) {
517	}
518	public void initialize(Token tok) {
519	}
520	}</pre>
521	<p>
522	The parser is pretty straightforward except that you have to add the options to tell ANTLR what node types you want to create for which token matched on the input stream.   The <font face="Courier New">tokens</font> section lists the operators with element option AST appended to their definitions.  This tells ANTLR to build <font face="Courier New">PLUSNode</font> objects for any <font face="Courier New">PLUS</font> tokens seen on the input stream, for example.  For demonstration purposes, <font face="Courier New">INT</font> is not included in the <font face="Courier New">tokens</font> section--the specific token references is suffixed with the element option to specify that nodes created from that <font face="Courier New">INT</font> should be of type <font face="Courier New">INTNode</font> (of course, the effect is the same as there is only that one reference to <font face="Courier New">INT</font>).
523	</p>
524	<pre>class CalcParser extends Parser;
525	options {
526	buildAST = true; // uses CommonAST by default
527	}
528
529	// define a bunch of specific AST nodes to build.
530	// can override at actual reference of tokens in
531	// grammar below.
532	tokens {
533	PLUS<AST=PLUSNode>;
534	STAR<AST=MULTNode>;
535	}
536
537	expr: mexpr (PLUS^ mexpr)* SEMI!
538	;
539
540	mexpr
541	: atom (STAR^ atom)*
542	;
543
544	// Demonstrate token reference option
545	atom: INT<AST=INTNode>
546	;</pre>
547	<p>
548	Invoking the parser is done as usual.  Computing the value of the resulting AST is accomplished by simply calling method <font face="Courier New">value()</font> on the root.
549	</p>
550	<pre>import java.io.*;
551	import antlr.CommonAST;
552	import antlr.collections.AST;
553
554	class Main {
555	public static void main(String[] args) {
556	try {
557	CalcLexer lexer =
558	new CalcLexer(
559	new DataInputStream(System.in)
560	);
561	CalcParser parser =
562	new CalcParser(lexer);
563	// Parse the input expression
564	parser.expr();
565	CalcAST t = (CalcAST)parser.getAST();
566
567	System.out.println(t.toStringTree());
568
569	// Compute value and return
570	int r = t.value();
571	System.out.println("value is "+r);
572	} catch(Exception e) {
573	System.err.println("exception: "+e);
574	e.printStackTrace();
575	}
576	}
577	}</pre>
578	<p>
579	For completeness, here is the lexer:
580	</p>
581	<pre>class CalcLexer extends Lexer;
582
583	WS : (' '
584	\| '\t'
585	\| '\n'
586	\| '\r')
587	{ $setType(Token.SKIP); }
588	;
589
590	LPAREN: '(' ;
591
592	RPAREN: ')' ;
593
594	STAR: '*' ;
595
596	PLUS: '+' ;
597
598	SEMI: ';' ;
599
600	protected
601	DIGIT
602	: '0'..'9' ;
603
604	INT : (DIGIT)+ ;</pre> <h3><a name="Describing Heterogeneous Trees With Grammars">Describing Heterogeneous Trees With Grammars</a></h3>
605	<p>
606	So what's the difference between this approach and default homogeneous tree construction?  The big difference is that you need a tree grammar to describe the expression tree and compute resulting values.  But, that's a good thing as it's "executable documentation" and negates the need to handcode the tree parser (the <font face="Courier New">value()</font> methods).  If you used homogeneous trees, here is all you would need beyond the parser/lexer to evaluate the expressions:  [<em>This code comes from the <font face="Courier New">examples/java/calc</font> directory</em>.]
607	</p>
608	<pre>class CalcTreeWalker extends TreeParser;
609
610	expr returns [float r]
611	{
612	float a,b;
613	r=0;
614	}
615	: #(PLUS a=expr b=expr) {r = a+b;}
616	\| #(STAR a=expr b=expr) {r = a*b;}
617	\| i:INT
618	{r = (float)
619	Integer.parseInt(i.getText());}
620	;</pre>
621	<p>
622	Because Terence wants you to use tree grammars even when constructing heterogeneous ASTs (to avoid handcoding methods that implement a depth-first-search), implement the following methods in your various heterogeneous AST node class definitions:
623	</p>
624	<pre> /** Get the token text for this node */
625	public String getText();
626	/** Get the token type for this node */
627	public int getType();</pre>
628	<p>
629	That is how you can use heterogeneous trees with a tree grammar.  Note that your token types must match the <font face="Courier New">PLUS</font> and <font face="Courier New">STAR</font> token types imported from your parser.  I.e., make sure <font face="Courier New">PLUSNode.getType()</font> returns <font face="Courier New">CalcParserTokenTypes.PLUS</font>.   The token types are generated by ANTLR in interface files that look like:
630	</p>
631	<pre>public interface CalcParserTokenTypes {
632	...
633	int PLUS = 4;
634	int STAR = 5;
635	...
636	}</pre> <h2><a name="AST Serialization">AST (XML) Serialization</a></h2>
637	<p>
638	[<font size="2">Oliver Zeigermann <a href="mailto:olli@zeigermann.de">olli@zeigermann.de</a> provided the initial implementation of this serialization.  His <a href="http://www.zeigermann.de/xtal.html">XTAL</a> XML translation code is worth checking out; particularly for reading XML-serialized ASTs back in.]</font>
639	</p>
640	<p>
641	For a variety of reasons, you may want to store an AST or pass it to another program or computer.  Class antlr.BaseAST is Serializable using the Java code generator, which means you can write ASTs to the disk using the standard Java stuff.  You can also write the ASTs out in XML form using the following methods from <font face="Courier New">BaseAST</font>:
642	<ul>
643	<li>
644	<font face="Courier New">public void xmlSerialize(Writer out)</font>
645	</li>
646	<li>
647	<font face="Courier New">public void xmlSerializeNode(Writer out)</font>
648	</li>
649	<li>
650	<font face="Courier New">public void xmlSerializeRootOpen(Writer out)</font>
651	</li>
652	<li>
653	<font face="Courier New">public void xmlSerializeRootClose(Writer out)</font>
654	</li>
655	</ul>
656	<p>
657	All methods throw <font face="Courier New">IOException</font>.
658	</p>
659	<p>
660	You can override <font face="Courier New">xmlSerializeNode</font> and so on to change the way nodes are written out.  By default the serialization uses the class type name as the tag name and has attributes <font face="Courier New">text</font> and <font face="Courier New">type</font> to store the text and token type of the node.
661	</p>
662	<p>
663	The output from running the simple heterogeneous tree example, examples/java/heteroAST, yields:
664	</p>
665	<pre> ( + ( + 3 ( * 4 5 ) ) 21 )
666	<PLUS><PLUS><int>3</int><MULT>
667	<int>4</int><int>5</int>
668	</MULT></PLUS><int>21</int></PLUS>
669	value is 44</pre>
670	<p>
671	The LISP-form of the tree shows the structure and contents.  The various heterogeneous nodes override the open and close tags and change the way leaf nodes are serialized to use <font face="Courier New"><int><em>value</em></int></font> instead of tag attributes of a single node.
672	</p>
673	<p>
674	Here is the code that generates the XML:
675	</p>
676	<pre>Writer w = new OutputStreamWriter(System.out);
677	t.xmlSerialize(w);
678	w.write("\n");
679	w.flush();</pre> <h2><a name="_bb12">AST enumerations</a></h2>
680	<p>
681	The AST <tt>findAll</tt> and <tt>findAllPartial</tt> methods return enumerations of tree nodes that you can walk.  Interface
682	</p>
683	<pre>antlr.collections.ASTEnumeration</pre>
684	<p>
685	and
686	</p>
687	<pre>class antlr.Collections.impl.ASTEnumerator</pre>
688	<p>
689	implement this functionality.  Here is an example:
690	</p>
691	<pre>// Print out all instances of
692	// <em>a-subtree-of-interest
693	// </em>found within tree 't'.
694	ASTEnumeration enum;
695	enum = t.findAll(<em>a-subtree-of-interest</em>);
696	while ( enum.hasMoreNodes() ) {
697	System.out.println(
698	enum.nextNode().toStringList()
699	);
700	}</pre> <h2><a name="_bb13"></a><a name="A few examples">A few examples</a></h2> <pre><tt>
701	sum :term ( PLUS^ term)*
702	;</tt> </pre>
703	<p>
704	The "<tt>^</tt>" suffix on the <tt>PLUS</tt> tells ANTLR to create an additional node and place it as the root of whatever subtree has been constructed up until that point for rule <tt>sum</tt>. The subtrees returned by the <tt>term</tt> references are collected as children of the addition nodes.  If the subrule is not matched, the associated nodes would not be added to the tree. The rule returns either the tree matched for the first <tt>term</tt> reference or a <tt>PLUS</tt>-rooted tree.
705	</p>
706	<p>
707	The grammar annotations should be viewed as operators, not static specifications. In the above example, each iteration of the (...)* will create a new PLUS root, with the previous tree on the left, and the tree from the new <tt>term</tt> on the right, thus preserving the usual associatively for "+".
708	</p>
709	<p>
710	Look at the following rule that turns off default tree construction.
711	</p>
712	<pre><tt>decl!:
713	modifiers type ID SEMI;
714	{ #decl = #([DECL], ID, ([TYPE] type),
715	([MOD] modifiers) ); }
716	;</tt></pre>
717	<p>
718	In this example, a declaration is matched. The resulting AST has an "imaginary" <tt>DECL</tt> node at the root, with three children. The first child is the <tt>ID</tt> of the declaration. The second child is a subtree with an imaginary <tt>TYPE</tt> node at the root and the AST from the <tt>type</tt> rule as its child. The third child is a subtree with an imaginary <tt>MOD</tt> at the root and the results of the <tt>modifiers</tt> rule as its child.
719	</p>
720	<h2><a name="_bb14"></a><a name="Labeled subrules">Labeled subrules</a></h2>
721	<p>
722	[<big><i>THIS WILL NOT BE IMPLEMENTED AS LABELED SUBRULES...We'll do something else eventually.</i></big>]
723	</p>
724	<p>
725	In 2.00 ANTLR, each rule has exactly one tree associated with it. Subrules simply add elements to the tree for the enclosing rule, which is normally what you want. For example, expression trees are easily built via:
726	</p>
727	<pre><tt>
728	expr: ID ( PLUS^ ID )*
729	;
730	</tt> </pre>
731	<p>
732	However, many times you want the elements of a subrule to produce a tree that is independent of the rule's tree. Recall that exponents must be computed before coefficients are multiplied in for exponent terms. The following grammar matches the correct syntax.
733	</p>
734	<pre><tt>
735	// match exponent terms such as "3*x^4"
736	eterm
737	: expr MULT ID EXPONENT expr
738	;
739	</tt> </pre>
740	<p>
741	However, to produce the correct AST, you would normally split the <tt>ID EXPONENT expr</tt> portion into another rule like this:
742	</p>
743	<pre><tt>
744	eterm:
745	expr MULT^ exp
746	;
747
748	exp:
749	ID EXPONENT^ expr
750	;
751	</tt> </pre>
752	<p>
753	In this manner, each operator would be the root of the appropriate subrule. For input <tt>3*x^4</tt>, the tree would look like:
754	</p>
755	<pre><tt>
756	#(MULT 3 #(EXPONENT ID 4))
757	</tt> </pre>
758	<p>
759	However, if you attempted to keep this grammar in the same rule:
760	</p>
761	<pre><tt>
762	eterm
763	: expr MULT^ (ID EXPONENT^ expr)
764	;
765	</tt> </pre>
766	<p>
767	both "<tt>^</tt>" root operators would modify the same tree yielding
768	</p>
769	<pre><tt>
770	#(EXPONENT #(MULT 3 ID) 4)
771	</tt> </pre>
772	<p>
773	This tree has the operators as roots, but they are associated with the wrong operands.
774	</p>
775	<p>
776	Using a labeled subrule allows the original rule to generate the correct tree.
777	</p>
778	<pre><tt>
779	eterm
780	: expr MULT^ e:(ID EXPONENT^ expr)
781	;
782	</tt> </pre>
783	<p>
784	In this case, for the same input <tt>3*x^4</tt>, the labeled subrule would build up its own subtree and make it the operand of the <tt>MULT</tt> tree of the <tt>eterm</tt> rule. The presence of the label alters the AST code generation for the elements within the subrule, making it operate more like a normal rule. Annotations of "<tt>^</tt>" make the node created for that token reference the root of the tree for the <tt>e</tt> subrule.
785	</p>
786	<p>
787	Labeled subrules have a result AST that can be accessed just like the result AST for a rule. For example, we could rewrite the above decl example using labeled subrules (note the use of <tt>!</tt> at the start of the subrules to suppress automatic construction for the subrule):
788	</p>
789	<pre><tt>
790	decl!:
791	m:(! modifiers { #m = #([MOD] modifiers); } )
792	t:(! type { #t = #([TYPE] type); } )
793	ID
794	SEMI;
795	{ #decl = #( [DECL] ID t m ); }
796	;
797	</tt> </pre>
798	<p>
799	What about subrules that are closure loops? The same rules apply to a closure subrule--there is a single tree for that loop that is built up according to the AST operators annotating the elements of that loop. For example, consider the following rule.
800	</p>
801	<pre><tt>
802	term: T^ i:(OP^ expr)+
803	;
804	</tt> </pre>
805	<p>
806	For input <tt>T OP A OP B OP C</tt>, the following tree structure would be created:
807	</p>
808	<pre><tt>
809	#(T #(OP #(OP #(OP A) B) C) )
810	</tt> </pre>
811	<p>
812	which can be drawn graphically as
813	</p>
814	<pre><tt>
815	T
816	\|
817	OP
818	\|
819	OP--C
820	\|
821	OP--B
822	\|
823	A
824	</tt> </pre>
825	<p>
826	The first important thing to note is that each iteration of the loop in the subrule operates on the same tree. The resulting tree, after all iterations of the loop, is associated with the subrule label. The result tree for the above labeled subrule is:
827	</p>
828	<pre><tt>
829	#(OP #(OP #(OP A) B) C)
830	</tt> </pre>
831	<p>
832	The second thing to note is that, because <tt>T</tt> is matched first and there is a root operator after it in the rule, <tt>T</tt> would be at the bottom of the tree if it were not for the label on the subrule.
833	</p>
834	<p>
835	Loops will generally be used to build up lists of subtree. For example, if you want a list of polynomial assignments to produce a sibling list of <tt>ASSIGN</tt> subtrees, then the following rule you would normally split into two rules.
836	</p>
837	<pre><tt>
838	interp
839	: ( ID ASSIGN poly ";" )+
840	;
841	</tt> </pre>
842	<p>
843	Normally, the following would be required
844	</p>
845	<pre><tt>
846	interp
847	: ( assign )+
848	;
849	assign
850	: ID ASSIGN^ poly ";"!
851	;
852	</tt> </pre>
853	<p>
854	Labeling a subrule allows you to write the above example more easily as:
855	</p>
856	<pre><tt>
857	interp
858	: ( r:(ID ASSIGN^ poly ";") )+
859	;
860	</tt> </pre>
861	<p>
862	Each recognition of a subrule results in a tree and if the subrule is nested in a loop, all trees are returned as a list of trees (i.e., the roots of the subtrees are siblings). If the labeled subrule is suffixed with a "<tt>!</tt>", then the tree(s) created by the subrule are not linked into the tree for the enclosing rule or subrule.
863	</p>
864	<p>
865	Labeled subrules within labeled subrules result in trees that are linked into the surrounding subrule's tree. For example, the following rule results in a tree of the form <tt>X #( A #(B C) D) Y</tt>.
866	</p>
867	<pre><tt>
868	a : X r:( A^ s:(B^ C) D) Y
869	;
870	</tt> </pre>
871	<p>
872	Labeled subrules within nonlabeled subrules result in trees that are linked into the surrounding rule's tree. For example, the following rule results in a tree of the form <tt>#(A X #(B C) D Y)</tt>.
873	</p>
874	<pre><tt>
875	a : X ( A^ s:(B^ C) D) Y
876	;</tt> </pre> <h2><a name="_bb15"></a><a name="Reference nodes">Reference nodes</a></h2>
877	<p>
878	<b>Not implemented.</b> A node that does nothing but refer to another node in the tree. Nice for embedding the same tree in multiple lists.
879	</p>
880	<h2><a name="_bb16"></a><a name="Required AST functionality and form">Required AST functionality and form</a></h2>
881	<p>
882	The data structure representing your trees can have any form or type name as long as they implement the <tt>AST</tt> interface:
883	</p>
884	<pre><tt>package antlr.collections;
885
886	/** Minimal AST node interface used by ANTLR
887	* AST generation and tree-walker.
888	*/
889	public interface AST {
890	/** Get the token type for this node */
891	public int getType();
892
893	/** Set the token type for this node */
894	public void setType(int ttype);
895
896	/** Get the token text for this node */
897	public String getText();
898
899	/** Set the token text for this node */
900	public void setText(String text);
901
902	/** Get the first child of this node;
903	* null if no children */
904	public AST getFirstChild();
905
906	/** Set the first child of a node */
907	public void setFirstChild(AST c);
908
909	/** Get the next sibling in line after this
910	* one
911	*/
912	public AST getNextSibling();
913
914	/** Set the next sibling after this one */
915	public void setNextSibling(AST n);
916
917	/** Add a (rightmost) child to this node */
918	public void addChild(AST node);</tt></pre> <pre> /** Are two nodes exactly equal? */
919	public boolean equals(AST t);</pre> <pre> /** Are two lists of nodes/subtrees exactly
920	* equal in structure and content? */
921	public boolean equalsList(AST t);</pre> <pre> /** Are two lists of nodes/subtrees
922	* partially equal? In other words, 'this'
923	* can be bigger than 't'
924	*/
925	public boolean equalsListPartial(AST t);</pre> <pre> /** Are two nodes/subtrees exactly equal? */
926	public boolean equalsTree(AST t);</pre> <pre> /** Are two nodes/subtrees exactly partially
927	* equal? In other words, 'this' can be
928	* bigger than 't'.
929	*/
930	public boolean equalsTreePartial(AST t);</pre> <pre> /** Return an enumeration of all exact tree
931	* matches for tree within 'this'.
932	*/
933	public ASTEnumeration findAll(AST tree);</pre> <pre> /** Return an enumeration of all partial
934	* tree matches for tree within 'this'.
935	*/
936	public ASTEnumeration findAllPartial(
937	AST subtree);</pre> <pre> /** Init a node with token type and text */
938	public void initialize(int t, String txt);</pre> <pre> /** Init a node using content from 't' */
939	public void initialize(AST t);</pre> <pre> /** Init a node using content from 't' */
940	public void initialize(Token t);</pre> <pre> /** Convert node to printable form */
941	public String toString();</pre> <pre> /** Treat 'this' as list (i.e.,
942	* consider 'this'
943	* siblings) and convert to printable
944	* form
945	*/
946	public String toStringList();</pre> <pre> /** Treat 'this' as tree root
947	* (i.e., don't consider
948	* 'this' siblings) and convert
949	* to printable form */
950	public String toStringTree();<tt>
951	}</tt></pre>
952	<p>
953	This scheme does not preclude the use of heterogeneous trees versus homogeneous trees. However, you will need to write extra code to create heterogeneous trees (via a subclass of <tt>ASTFactory</tt>) or by specifying the node types at the token reference sites or in the <font face="Courier New">tokens</font> section, whereas the homogeneous trees are free.
954	</p>
955	<pre><font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/trees.html#2 $</font></pre>
956	</body>
957	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/yao/share/antlr-2.7.7/doc/trees.html

Download in other formats: