source: trunk/yao/share/antlr-2.7.7/doc/streams.html

Last change on this file was 1, checked in by lnalod, 15 years ago

Initial import of YAO sources

File size: 30.4 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
2<html>
3<head>
4        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
5        <title>ANTLR Specification: Token Streams</title> 
6</head>
7<body bgcolor="#FFFFFF" text="#000000">
8<h2>Token Streams</h2> 
9<p>
10        Traditionally, a lexer and parser are tightly coupled objects; that is, one does not imagine anything sitting between the parser and the lexer, modifying the stream of tokens. &nbsp; However, language recognition and translation can benefit greatly from treating the connection between lexer and parser as a <em>token stream</em>.&nbsp; This idea is analogous to Java I/O streams, where you can pipeline lots of stream objects to produce highly-processed data streams.
11</p>
12<h3><a name="Introduction">Introduction</a></h3> 
13<p>
14        ANTLR identifies a stream of Token objects as any object that satisfies the <font face="Courier New">TokenStream</font> interface (prior to 2.6, this interface was called <font face="Courier New">Tokenizer</font>); i.e., any object that implements the following method.
15</p>
16<pre>Token nextToken();</pre> 
17<p>
18        Graphically, a normal stream of tokens from a lexer (producer) to a parser (consumer) might look like the following at some point during the parse.
19</p>
20<p>
21        <img src="lexer.to.parser.tokens.gif" width="564" height="81" alt="lexer.to.parser.tokens.gif (3585 bytes)">
22</p>
23<p>
24        The most common token stream is a lexer, but once you imagine a physical stream between the lexer and parser, you start imagining interesting things that you can do.&nbsp; For example, you can:
25<ul>
26        <li>
27                filter a stream of tokens to strip out unwanted tokens
28        </li>
29        <li>
30                insert imaginary tokens to help the parser recognize certain nasty structures
31        </li>
32        <li>
33                split a single stream into multiple streams, sending certain tokens of interest down the various streams
34        </li>
35        <li>
36                multiplex multiple token streams onto one stream, thus, &quot;simulating&quot; the lexer states of tools like PCCTS, lex, and so on.
37        </li>
38</ul>
39<p>
40        The beauty of the token stream concept is that parsers and lexers are not affected--they are merely consumers and producers of streams.&nbsp; Stream objects are filters that produce, process, combine, or separate token streams for use by consumers. &nbsp; Existing lexers and parsers may be combined in new and interesting ways without modification.
41</p>
42<p>
43        This document formalizes the notion of a token stream and describes in detail some very useful stream filters.
44</p>
45<h3><a name="Pass-Through Token Stream">Pass-Through Token Stream</a></h3> 
46<p>
47        A token stream is any object satisfying the following interface.
48</p>
49<pre><small>public interface TokenStream {
50  public Token nextToken()</small>
51<small>    throws java.io.IOException;
52}</small></pre> 
53<p>
54        For example, a &quot;no-op&quot; or pass-through filter stream looks like:
55</p>
56<pre><small>import antlr.*;
57import java.io.IOException;
58
59class TokenStreamPassThrough</small>
60<small>    implements TokenStream {
61  protected TokenStream input;
62
63  /** Stream to read tokens from */</small>
64<small>  public TokenStreamPassThrough(TokenStream in) {
65    input = in;
66  }
67
68  /** This makes us a stream */</small>
69<small>  public Token nextToken() throws IOException {
70    return input.nextToken(); // &quot;short circuit&quot;
71  }
72}</small></pre> 
73<p>
74        You would use this simple stream by having it pull tokens from the lexer and then have the parser pull tokens from it as in the following main() program.
75</p>
76<pre><small>public static void main(String[] args) {
77&nbsp;&nbsp;MyLexer&nbsp;lexer&nbsp;=
78    new&nbsp;MyLexer(new&nbsp;DataInputStream(System.in));
79  TokenStreamPassThrough filter =</small>
80<small>    new TokenStreamPassThrough(lexer);
81  MyParser parser = new MyParser(filter);</small>
82<small>  parser.<em>startRule</em>();
83}</small></pre> <h3><a name="Token Stream Filtering">Token Stream Filtering</a></h3> 
84<p>
85        Most of the time, you want the lexer to discard whitespace and comments, however, what if you also want to reuse the lexer in situations where the parser must see the comments?&nbsp; You can design a single lexer to cover many situations by having the lexer emit comments and whitespace along with the normal tokens.&nbsp; Then, when you want to discard whitespace, put a filter between the lexer and the parser to kill whitespace tokens.
86</p>
87<p>
88        ANTLR provides <small><font face="Courier New">TokenStreamBasicFilter</font></small> for such situations.&nbsp; You can instruct it to discard any token type or types without having to modify the lexer.&nbsp; Here is an example usage of <small><font face="Courier New">TokenStreamBasicFilter</font></small> that filters out comments and whitespace.
89</p>
90<pre><small>public static void main(String[] args) {
91&nbsp;&nbsp;MyLexer&nbsp;lexer&nbsp;=
92    new&nbsp;MyLexer(new&nbsp;DataInputStream(System.in));
93  TokenStreamPassThrough filter =</small>
94<small>    new TokenStreamPassThrough(lexer);
95  filter.discard(MyParser.WS);
96  filter.discard(MyParser.COMMENT);</small>
97<small>  MyParser parser = new MyParser(filter);</small>
98<small>  parser.<em>startRule</em>();
99}</small></pre> 
100<p>
101        Note that it is more efficient to have the lexer immediately discard lexical structures you do not want because you do not have to construct a Token object.&nbsp; On the other hand, filtering the stream leads to more flexible lexers.
102</p>
103<h3><a name="Token Stream Splitting">Token Stream Splitting</a></h3> 
104<p>
105        Sometimes you want a translator to ignore but not discard portions of the input during the recognition phase.&nbsp;&nbsp; For example, you want to ignore comments vis-a-vis parsing, but you need the comments for translation.&nbsp;&nbsp; The solution is to send the comments to the parser on a <em>hidden</em> token stream--one that the parser is not &quot;listening&quot; to.&nbsp; During recognition, actions can then examine the hidden stream or streams, collecting the comments and so on.&nbsp; Stream-splitting filters are like prisms that split white light into rainbows.
106</p>
107<p>
108        The following diagram illustrates a situation in which a single stream of tokens is split into three.
109</p>
110<p>
111        <img src="stream.splitter.gif" width="546" height="232" alt="stream.splitter.gif (5527 bytes)">
112</p>
113<p>
114        You would have the parser pull tokens from the topmost stream.
115</p>
116<p>
117        There are many possible capabilities and implementations of a stream splitter. &nbsp; For example, you could have a &quot;Y-splitter&quot; that actually duplicated a stream of tokens like a cable-TV Y-connector.&nbsp; If the filter were thread-safe and buffered, you could have multiple parsers pulling tokens from the filter at the same time.
118</p>
119<p>
120        This section describes a stream filter supplied with ANTLR called <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> that behaves like a coin sorter, sending pennies to one bin, dimes to another, etc...&nbsp; This filter splits the input stream into two streams, a main stream with the majority of the tokens and a <em>hidden</em> stream that is buffered so that you can ask it questions later about its contents. &nbsp; Because of the implementation, however, you cannot attach a parser to the hidden stream.&nbsp; The filter actually weaves the hidden tokens among the main tokens as you will see below.
121</p>
122<h4><a name="Example">Example</a></h4> 
123<p>
124        Consider the following simple grammar that reads in integer variable declarations.
125</p>
126<pre>decls: (decl)+
127     ;
128decl : begin:INT ID end:SEMI
129     ; </pre> 
130<p>
131        Now assume input:
132</p>
133<pre>int n; // list length
134/** doc */
135int f;</pre> 
136<p>
137        Imagine that whitespace is ignored by the lexer and that you have instructed the filter to split comments onto the hidden stream.&nbsp; Now if the parser is pulling tokens from the main stream, it will see only &quot;INT ID SEMI FLOAT ID SEMI&quot; even though the comments are hanging around on the hidden stream.&nbsp; So the parser effectively ignores the comments, but your actions can query the filter for tokens on the hidden stream.
138</p>
139<p>
140        The first time through rule <font face="Courier New">decl</font>, the <font face="Courier New">begin</font> token reference has no hidden tokens before or after, but
141</p>
142<pre><font face="Courier New">filter.getHiddenAfter(end)</font></pre> 
143<p>
144        returns a reference to token
145</p>
146<pre><font face="Courier New">// list length</font></pre> 
147<p>
148        which in turn provides access to
149</p>
150<p>
151        <font face="Courier New">/** doc */</font>
152</p>
153<p>
154        The second time through <font face="Courier New">decl</font>
155</p>
156<pre><font face="Courier New">filter.getHiddenBefore(begin)</font></pre> 
157<p>
158        refers to the
159</p>
160<pre><font face="Courier New">/** doc */</font></pre> 
161<p>
162        comment.
163</p>
164<h4><a name="Filter Implementation">Filter Implementation</a></h4> 
165<p>
166        The following diagram illustrates how the Token objects are physically weaved together to simulate two different streams.
167</p>
168<p align="center">
169        <img src="hidden.stream.gif" width="377" height="148" alt="hidden.stream.gif (3667 bytes)">
170</p>
171<p align="center">
172        &nbsp;
173</p>
174<p>
175        As the tokens are consumed, the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object hooks the hidden tokens to the main tokens via linked list.&nbsp; There is only one physical TokenStream of tokens emanating from this filter and the interweaved pointers maintain sequence information.
176</p>
177<p>
178        Because of the extra pointers required to link the tokens together, you must use a special token object called <small><font face="Courier New">CommonHiddenStreamToken</font></small> (the normal object is called <small><font face="Courier New">CommonToken</font></small>). &nbsp; Recall that you can instruct a lexer to build tokens of a particular class with
179</p>
180<pre><small>lexer.setTokenObjectClass(&quot;<em>classname</em>&quot;);</small></pre> 
181<p>
182        Technically, this exact filter functionality could be implemented without requiring a special token object, but this filter implementation is extremely efficient and it is easy to tell the lexer what kind of tokens to create.&nbsp; Further, this implementation makes it very easy to automatically have tree nodes built that preserve the hidden stream information.
183</p>
184<p>
185        This filter affects the lazy-consume of ANTLR.&nbsp; After recognizing every main stream token, the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> must grab the next Token to see if it is a hidden token. Consequently, the use of this filter is not be very workable for interactive (e.g., command-line) applications.
186</p>
187<h4><a name="How To Use This Filter">How To Use This Filter</a></h4> 
188<p>
189        To use <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small>, all you have to do is:
190<ul>
191        <li type="disc" value="2">
192                Create the lexer and tell it to create token objects augmented with links to hidden tokens.
193        </li>
194</ul>
195<pre><small>MyLexer lexer = new MyLexer(<em>some-input-stream</em>);
196lexer.setTokenObjectClass(</small>
197<small>  &quot;antlr.CommonHiddenStreamToken&quot;</small>
198<small>);</small></pre> 
199<ul>
200        <li>
201                Create a <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object that pulls tokens from the lexer.
202        </li>
203</ul>
204<pre><small><font face="Courier New">TokenStreamHiddenTokenFilter</font> filter =</small>
205<small>  new <font face="Courier New">TokenStreamHiddenTokenFilter</font>(lexer);</small></pre> 
206<ul>
207        <li>
208                Tell the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> which tokens to hide, and which to discard.&nbsp; For example,
209        </li>
210</ul>
211<pre><small>filter.discard(MyParser.WS);
212filter.hide(MyParser.SL_COMMENT);</small></pre> 
213<ul>
214        <li>
215                Create a parser that pulls tokens from the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> rather than the lexer.
216        </li>
217</ul>
218<pre><small>MyParser parser = new MyParser(filter);
219try {
220  parser.<em>startRule</em>(); // parse as usual
221}
222catch (Exception e) {
223  System.err.println(e.getMessage());
224}</small></pre> 
225<p>
226        See the ANTLR fieldguide entry on <a href="http://www.antlr.org/fieldguide/whitespace">preserving whitespace</a> for a complete example.
227</p>
228<h4><a name="Tree Construction">Tree Construction</a></h4> 
229<p>
230        Ultimately, hidden stream tokens are needed during the translation phase, which normally means while tree walking.&nbsp; How do we pass the hidden stream info to the translator without mucking up the tree grammar?&nbsp; Easy: use AST nodes that save the hidden stream tokens.&nbsp; ANTLR defines <small><font face="Courier New">CommonASTWithHiddenTokens</font></small> for you that hooks the hidden stream tokens onto the tree nodes automatically; methods are available to access the hidden tokens associated with a tree node.&nbsp; All you have to do is tell the parser to create nodes of this node type rather than the default <small><font face="Courier New">CommonAST</font></small>.
231</p>
232<pre><small>parser.setASTNodeClass(&quot;antlr.CommonASTWithHiddenTokens&quot;);</small></pre> 
233<p>
234        Tree nodes are created as functions of Token objects.&nbsp; The <small><font face="Courier New">initialize()</font></small> method of the tree node is called with a Token object when the ASTFactory creates the tree node.&nbsp; Tree nodes created from tokens with hidden tokens before or after will have the same hidden tokens.&nbsp; You do not have to use this node definition, but it works for many translation tasks:
235</p>
236<pre><small>package antlr;
237
238/** A CommonAST whose initialization copies
239 *  hidden token information from the Token
240 *  used to create a node.
241 */
242public class CommonASTWithHiddenTokens
243  extends CommonAST {
244  // references to hidden tokens
245  protected Token hiddenBefore, hiddenAfter;
246
247  public CommonHiddenStreamToken <strong>getHiddenAfter</strong>() {</small>
248<small>    return hiddenAfter;</small>
249<small>  }
250  public CommonHiddenStreamToken <strong>getHiddenBefore</strong>() {</small>
251<small>    return hiddenBefore;</small>
252<small>  }
253  public void <strong>initialize</strong>(Token tok) {
254    CommonHiddenStreamToken t =</small>
255<small>      (CommonHiddenStreamToken)tok;
256    super.initialize(t);
257    hiddenBefore = t.getHiddenBefore();
258    hiddenAfter  = t.getHiddenAfter();
259  }
260}</small></pre> 
261<p>
262        Notice that this node definition assumes that you are using <small><font face="Courier New">CommonHiddenStreamToken</font></small> objects.&nbsp; A runtime class cast except occurs if you do not have the lexer create <small><font face="Courier New">CommonHiddenStreamToken</font></small> objects.
263</p>
264<h4><a name="Garbage Collection Issues">Garbage Collection Issues</a></h4> 
265<p>
266        By partitioning up the input stream and preventing hidden stream tokens from referring to main stream tokens, GC is allowed to work on the Token stream. In the integer declaration example above, when there are no more references to the first SEMI token and the second INT token, the comment tokens are candidates for garbage collection.&nbsp; If all tokens were linked together, a single reference to any token would prevent GC of any tokens.&nbsp; This is not the case in ANTLR's implementation.
267</p>
268<h4><a name="Notes">Notes</a></h4> 
269<p>
270        This filter works great for preserving whitespace and comments during translation, but is not always the best solution for handling comments in situations where the output is very dissimilar to the input.&nbsp; For example, there may be 3 comments interspersed within an input statement that you want to combine at the head of the output statement during translation.&nbsp; Rather than having to ask each parsed token for the comments surrounding it, it would be better to have a real, physically-separate stream that buffered the comments and a means of associating groups of parsed tokens with groups of comment stream tokens.&nbsp; You probably want to support questions like &quot;<em>give me all of the tokens on the comment stream that originally appeared between this beginning parsed token and this ending parsed token</em>.&quot;
271</p>
272<p>
273        This filter implements the exact same functionality as JavaCC's <em>special</em> tokens.&nbsp; Sriram Sankar (father of JavaCC) had a great idea with the special tokens and, at the 1997 <a href="http://www.antlr.org/workshop97/summary.html">Dr. T's Traveling Parsing Revival and Beer Tasting Festival</a>, the revival attendees extended the idea to the more general token stream concept.&nbsp; Now, the JavaCC special token functionality is just another ANTLR stream filter with the bonus that you do not have to modify the lexer to specify which tokens are special.
274</p>
275<h3><a name="lexerstates">Token Stream Multiplexing (aka &quot;Lexer states&quot;)</a></h3> 
276<p>
277        Now, consider the opposite problem where you want to combine multiple streams rather than splitting a single stream.&nbsp; When your input contains sections or slices that are radically diverse such as Java and JavaDoc comments, you will find that it is hard to make a single lexer recognize all slices of the input.&nbsp; This is primarily because merging the token definitions of the various slices results in an ambiguous lexical language or allows invalid tokens.&nbsp; For example, &quot;final&quot; may be a keyword in one section, but an identifier in another.&nbsp; Also, &quot;@author&quot; is a valid javadoc tag within a comment, but is invalid in the surrounding Java code.
278</p>
279<p>
280        Most people solve this problem by having the lexer sit in one of multiple states (for example, &quot;reading Java stuff&quot; vs &quot;reading JavaDoc stuff&quot;).&nbsp; The lexer starts out in Java mode and then, upon &quot;/**&quot;, switches to JavaDoc mode; &quot;*/&quot; forces the lexer to switch back to Java mode.
281</p>
282<h4><a name="Multiple Lexers">Multiple Lexers</a></h4> 
283<p>
284        Having a single lexer with multiple states works, but having multiple lexers that are multiplexed onto the same token stream solves the same problem better because the separate lexers are easier to reuse (no cutting and pasting into a new lexer--just tell the stream multiplexor to switch to it).&nbsp; For example, the JavaDoc lexer could be reused for any language problem that had JavaDoc comments.
285</p>
286<p>
287        ANTLR provides a predefined token stream called <small><font face="Courier New">TokenStreamSelector</font></small> that lets you switch between multiple lexers.&nbsp; Actions in the various lexers control how the selector switches input streams.&nbsp; Consider the following Java fragment.
288</p>
289<pre>/** Test.
290 *  @author Terence
291 */
292int n;</pre> 
293<p>
294        Given two lexers, JavaLexer and JavaDocLexer, the sequence of actions by the two lexers might look like this:
295</p>
296<p>
297        <small><font face="Arial">JavaLexer: match JAVADOC_OPEN, switch to JavaDocLexer
298                        <br>
299                        JavaDocLexer: match AUTHOR
300                        <br>
301                        JavaDocLexer: match ID
302                        <br>
303                        JavaDocLexer: match JAVADOC_CLOSE, switch back to JavaLexer
304                        <br>
305                        JavaLexer: match INT
306                        <br>
307                        JavaLexer: match ID
308                        <br>
309                        JavaLexer: match SEMI</font></small>
310</p>
311<p>
312        In the Java lexer grammar, you will need a rule to perform the switch to the JavaDoc lexer (recording on the stack of streams the &quot;return lexer&quot;):
313</p>
314<pre><small>JAVADOC_OPEN
315    :    &quot;/**&quot; {selector.push(&quot;doclexer&quot;);}
316    ;</small></pre> 
317<p>
318        Similarly, you will need a rule in the JavaDoc lexer to switch back:
319</p>
320<pre><small>JAVADOC_CLOSE
321    :    &quot;*/&quot; {selector.pop();}
322    ;</small></pre> 
323<p>
324        The selector has a stack of streams so the JavaDoc lexer does not need to know who invoked it.
325</p>
326<p>
327        Graphically, the selector combines the two lexer streams into a single stream presented to the parser.
328</p>
329<p>
330        <img src="stream.selector.gif" width="538" height="238" alt="stream.selector.gif (5976 bytes)">
331</p>
332<p>
333        The selector can maintain of list of streams for you so that you can switch to another input stream by name or you can tell it to switch to an actual stream object.
334</p>
335<pre><small>public class TokenStreamSelector implements TokenStream {
336  public <strong>TokenStreamSelector</strong>() {...}
337  public void <strong>addInputStream</strong>(TokenStream stream,</small>
338<small>    String key) {...}
339  public void <strong>pop</strong>() {...}
340  public void <strong>push</strong>(TokenStream stream) {...}
341  public void <strong>push</strong>(String sname) {...}
342  /** Set the stream without pushing old stream */
343  public void <strong>select</strong>(TokenStream stream) {...}
344  public void <strong>select</strong>(String sname)</small>
345<small>    throws IllegalArgumentException {...}
346}</small></pre> 
347<p>
348        Using the selector is easy:
349<ul>
350        <li>
351                Create a selector.
352        </li>
353</ul>
354<pre><small>TokenStreamSelector selector =
355  new TokenStreamSelector();</small></pre> 
356<ul>
357        <li>
358                Name the streams (don't have to name--you can use stream object references instead to avoid the hashtable lookup on each switch).
359        </li>
360</ul>
361<pre><small>selector.addInputStream(mainLexer, &quot;main&quot;);
362selector.addInputStream(doclexer, &quot;doclexer&quot;);</small></pre> 
363<ul>
364        <li>
365                Select which lexer reads from the char stream first.
366        </li>
367</ul>
368<pre><small>// start with main java lexer
369selector.select(&quot;main&quot;);</small></pre> 
370<ul>
371        <li>
372                Attach your parser to the selector instead of one of the lexers.
373        </li>
374</ul>
375<pre><small>JavaParser parser = new JavaParser(selector);</small></pre> <h4><a name="Lexers Sharing Same Character Stream">Lexers Sharing Same Character Stream</a></h4> 
376<p>
377        Before moving on to how the parser uses the selector, note that the two lexers have to read characters from the same input stream.&nbsp; Prior to ANTLR 2.6.0, each lexer had its own line number variable, input char stream variable and so on.&nbsp; In order to share the same input state, ANTLR 2.6.0 factors the portion of a lexer dealing with the character input into an object, <small><font face="Courier New">LexerSharedInputState</font></small>, that can be shared among n lexers (single-threaded).&nbsp; To get multiple lexers to share state, you create the first lexer, ask for its input state object, and then use that when constructing any further lexers that need to share that input state:
378</p>
379<pre><small>// create Java lexer</small>
380<small>JavaLexer mainLexer = new JavaLexer(input);
381// create javadoc lexer; attach to shared</small>
382<small>// input state of java lexer
383JavaDocLexer doclexer =</small>
384<small>  new JavaDocLexer(mainLexer.getInputState());</small></pre> <h4><a name="Parsing Multiplexed Token Streams">Parsing Multiplexed Token Streams</a></h4> 
385<p>
386        Just as a single lexer may have trouble producing a single stream of tokens from diverse input slices or sections, a single parser may have trouble handling the multiplexed token stream.&nbsp; Again, a token that is a keyword in one lexer's vocabulary may be an identifier in another lexer's vocabulary.&nbsp; Factoring the parser into separate subparsers for each input section makes sense to handle the separate vocabularies as well as for promoting grammar reuse.
387</p>
388<p>
389        The following parser grammar uses the main lexer token vocabulary (specified with the importVocab option) and upon <small><font face="Courier New">JAVADOC_OPEN</font></small> it creates and invokes a JavaDoc parser to handle the subsequent stream of tokens from within the comment.
390</p>
391<pre><small>class JavaParser extends Parser;
392options {
393    importVocab=Java;
394}
395
396input
397    :   ( (javadoc)? INT ID SEMI )+
398    ;
399
400javadoc
401    :   JAVADOC_OPEN
402        {</small>
403<small>        // create a parser to handle the javadoc comment
404        JavaDocParser jdocparser =</small>
405<small>          new JavaDocParser(getInputState());
406        jdocparser.content(); // go parse the comment
407        }</small>
408<small>        JAVADOC_CLOSE
409    ;</small></pre> 
410<p>
411        You will note that ANTLR parsers from 2.6.0 also share token input stream state. &nbsp; When creating the &quot;subparser&quot;, <small><font face="Courier New">JavaParser</font></small> tells it to pull tokens from the same input state object.
412</p>
413<p>
414        The JavaDoc parser matches a bunch of tags:
415</p>
416<pre><small>class JavaDocParser extends Parser;
417options {
418    importVocab=JavaDoc;
419}
420
421content
422    :   (   PARAM // includes ID as part of PARAM
423        |   EXCEPTION</small>
424<small>        |   AUTHOR
425        )*
426    ;</small></pre> 
427<p>
428        When the subparser rule <small><font face="Courier New">content</font></small> finishes, control is naturally returned to the invoking method, <small><font face="Courier New">javadoc</font></small>, in the Java parser.
429</p>
430<h4><a name="The Effect of Lookahead Upon Multiplexed Token Streams">The Effect of Lookahead Upon Multiplexed Token Streams</a></h4> 
431<p>
432        What would happen if the parser needed to look two tokens ahead at the start of the JavaDoc comment?&nbsp; In other words, from the perspective of the main parser, what is the token following <small><font face="Courier New">JAVADOC_OPEN</font></small>? &nbsp; Token <small><font face="Courier New">JAVADOC_CLOSE</font></small>, naturally!&nbsp; The main parser treats any JavaDoc comment, no matter how complicated, as a single entity; it does not see into the token stream of the comment nor should it--the subparser handles that stream.
433</p>
434<p>
435        What is the token following the <small><font face="Courier New">content</font></small> rule in the subparser?&nbsp; &quot;End of file&quot;.&nbsp; The analysis of the subparser cannot determine what random method will call it from your code.&nbsp; This is not an issue because there is normally a single token that signifies the termination of the subparser.&nbsp; Even if EOF gets pulled into the analysis somehow, EOF will not be present on the token stream.
436</p>
437<h4><a name="Multiple Lexers Versus Calling Another Lexer Rule">Multiple Lexers Versus Calling Another Lexer Rule</a></h4> 
438<p>
439        Multiple lexer states are also often used to handle very complicated single &nbsp; tokens such as strings with embedded escape characters where input &quot;\t&quot; should not be allowed outside of a string.&nbsp; Typically, upon the initial quote, the lexer switches to a &quot;string state&quot; and then switches back to the &quot;normal state&quot; after having matched the guts of the string.
440</p>
441<p>
442        So-called &quot;modal&quot; programming, where your code does something different depending on a mode, is often a bad practice.&nbsp; In the situation of complex tokens, it is better to explicity specify the complicated token with more rules.&nbsp; Here is the golden rule of when to and when not to use multiplexed token streams:
443</p>
444<blockquote>
445        <p>
446                <em>Complicated single tokens should be matched by calling another (protected) lexer rule whereas streams of tokens from diverse slices or sections should be handled by different lexers multiplexed onto the same stream that feeds the parser.</em>
447        </p>
448</blockquote>
449<p>
450        For example, the definition of a string in a lexer should simply call another rule to handle the nastiness of escape characters:
451</p>
452<pre><small>STRING_LITERAL
453    :    '&quot;' (ESC|~('&quot;'|'\\'))* '&quot;'
454    ;
455
456protected // not a token; only invoked by another rule.
457ESC
458    :    '\\'
459        (    'n'
460        |    'r'
461        |    't'
462        |    'b'
463        |    'f'
464        |    '&quot;'
465        |    '\''
466        |    '\\'
467        |    ('u')+</small>
468<small>             HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
469        ...</small>
470       )
471<small>    ;</small></pre>
472
473<h3><a name=rewriteengine>TokenStreamRewriteEngine Easy Syntax-Directed Translation</h3>
474
475There are many common situations where you want to tweak or augment
476a program or data file.  ANTLR 2.7.3 introduced a (Java/C# versions only) a very simple but powerful <tt>TokenStream</tt> targeted at the class of problems where:
477
478<ol>
479<li>the output language and the input language are similar
480<li>the relative order of language elements does not change
481</ol>
482
483See the <a href="http://www.antlr.org/article/rewrite.engine/index.tml"><b>Syntax Directed TokenStream Rewriting</b></a> article on the antlr website.
484
485<h3><a name="The Future">The Future</a></h3> 
486<p>
487        The ANTLR 2.6 release provides the basic structure for using token streams--future versions will be more sophisticated once we have experience using them.
488</p>
489<p>
490        The current &quot;hidden token&quot; stream filter clearly solves the &quot;ignore but preserve whitespace&quot; problem really well, but it does not handle comments too well in most situations.&nbsp; For example, in real translation problems you want to collect comments at various single tree nodes (like DECL or METHOD) for interpretation rather than leaving them strewn throughout the tree.&nbsp; You really need a stream splitter that buffers up the comments on a separate stream so you can say &quot;<em>give me all comments &nbsp; consumed during the recognition of this rule</em>&quot; or &quot;<em>give me all comments found between these two real tokens</em>.&quot; That is almost certainly something you need for translation of comments.
491</p>
492<p>
493        Token streams will lead to fascinating possibilities.&nbsp; Most folks are not used to thinking about token streams so it is hard to imagine what else they could be good for.&nbsp; Let your mind go wild.&nbsp; What about embedded languages where you see slices (aspects) of the input such as Java and SQL (each portion of the input could be sliced off and put through on a different stream).&nbsp; What about parsing Java .class files with and without debugging information?&nbsp; If you have a parser for .class files without debug info and you want to handle .class files with debug info, leave the parser alone and augment the lexer to see the new debug structures.&nbsp; Have a filter split the debug tokens of onto a different stream and the same parser will work for both types of .class files.
494</p>
495<p>
496        Later, I would like to add &quot;perspectives&quot;, which are really just another way to look at filters.&nbsp; Imagine a raw stream of tokens emanating from a lexer--the root perspective.&nbsp; I can build up a tree of perspectives very easily from there.&nbsp; For example, given a Java program with embedded SQL, you might want multiple perspectives on the input stream for parsing or translation reasons:
497</p>
498<p align="center">
499        <img src="stream.perspectives.gif" width="306" height="202" alt="stream.perspectives.gif (2679 bytes)">
500</p>
501<p align="left">
502        You could attach a parser to the SQL stream or the Java stream minus comments, with actions querying the comment stream.
503</p>
504<p align="left">
505        In the future, I would also like to add the ability of a parser to generate a stream of tokens (or text) as output just like it can build trees now.&nbsp; In this manner, multipass parsing becomes a very natural and simple problem because parsers become stream producers also.&nbsp; The output of one parser can be the input to another.
506</p>
507<p align="left">
508        <font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/streams.html#2 $</font> 
509</body>
510</html>
Note: See TracBrowser for help on using the repository browser.