Context Navigation

streams.html

Last change on this file was 1, checked in by lnalod, 15 years ago
Initial import of YAO sources
File size: 30.4 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
2	<html>
3	<head>
4	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
5	<title>ANTLR Specification: Token Streams</title>
6	</head>
7	<body bgcolor="#FFFFFF" text="#000000">
8	<h2>Token Streams</h2>
9	<p>
10	Traditionally, a lexer and parser are tightly coupled objects; that is, one does not imagine anything sitting between the parser and the lexer, modifying the stream of tokens.   However, language recognition and translation can benefit greatly from treating the connection between lexer and parser as a <em>token stream</em>.  This idea is analogous to Java I/O streams, where you can pipeline lots of stream objects to produce highly-processed data streams.
11	</p>
12	<h3><a name="Introduction">Introduction</a></h3>
13	<p>
14	ANTLR identifies a stream of Token objects as any object that satisfies the <font face="Courier New">TokenStream</font> interface (prior to 2.6, this interface was called <font face="Courier New">Tokenizer</font>); i.e., any object that implements the following method.
15	</p>
16	<pre>Token nextToken();</pre>
17	<p>
18	Graphically, a normal stream of tokens from a lexer (producer) to a parser (consumer) might look like the following at some point during the parse.
19	</p>
20	<p>
21	<img src="lexer.to.parser.tokens.gif" width="564" height="81" alt="lexer.to.parser.tokens.gif (3585 bytes)">
22	</p>
23	<p>
24	The most common token stream is a lexer, but once you imagine a physical stream between the lexer and parser, you start imagining interesting things that you can do.  For example, you can:
25	<ul>
26	<li>
27	filter a stream of tokens to strip out unwanted tokens
28	</li>
29	<li>
30	insert imaginary tokens to help the parser recognize certain nasty structures
31	</li>
32	<li>
33	split a single stream into multiple streams, sending certain tokens of interest down the various streams
34	</li>
35	<li>
36	multiplex multiple token streams onto one stream, thus, "simulating" the lexer states of tools like PCCTS, lex, and so on.
37	</li>
38	</ul>
39	<p>
40	The beauty of the token stream concept is that parsers and lexers are not affected--they are merely consumers and producers of streams.  Stream objects are filters that produce, process, combine, or separate token streams for use by consumers.   Existing lexers and parsers may be combined in new and interesting ways without modification.
41	</p>
42	<p>
43	This document formalizes the notion of a token stream and describes in detail some very useful stream filters.
44	</p>
45	<h3><a name="Pass-Through Token Stream">Pass-Through Token Stream</a></h3>
46	<p>
47	A token stream is any object satisfying the following interface.
48	</p>
49	<pre><small>public interface TokenStream {
50	public Token nextToken()</small>
51	<small> throws java.io.IOException;
52	}</small></pre>
53	<p>
54	For example, a "no-op" or pass-through filter stream looks like:
55	</p>
56	<pre><small>import antlr.*;
57	import java.io.IOException;
58
59	class TokenStreamPassThrough</small>
60	<small> implements TokenStream {
61	protected TokenStream input;
62
63	/** Stream to read tokens from */</small>
64	<small> public TokenStreamPassThrough(TokenStream in) {
65	input = in;
66	}
67
68	/** This makes us a stream */</small>
69	<small> public Token nextToken() throws IOException {
70	return input.nextToken(); // "short circuit"
71	}
72	}</small></pre>
73	<p>
74	You would use this simple stream by having it pull tokens from the lexer and then have the parser pull tokens from it as in the following main() program.
75	</p>
76	<pre><small>public static void main(String[] args) {
77	MyLexer lexer =
78	new MyLexer(new DataInputStream(System.in));
79	TokenStreamPassThrough filter =</small>
80	<small> new TokenStreamPassThrough(lexer);
81	MyParser parser = new MyParser(filter);</small>
82	<small> parser.<em>startRule</em>();
83	}</small></pre> <h3><a name="Token Stream Filtering">Token Stream Filtering</a></h3>
84	<p>
85	Most of the time, you want the lexer to discard whitespace and comments, however, what if you also want to reuse the lexer in situations where the parser must see the comments?  You can design a single lexer to cover many situations by having the lexer emit comments and whitespace along with the normal tokens.  Then, when you want to discard whitespace, put a filter between the lexer and the parser to kill whitespace tokens.
86	</p>
87	<p>
88	ANTLR provides <small><font face="Courier New">TokenStreamBasicFilter</font></small> for such situations.  You can instruct it to discard any token type or types without having to modify the lexer.  Here is an example usage of <small><font face="Courier New">TokenStreamBasicFilter</font></small> that filters out comments and whitespace.
89	</p>
90	<pre><small>public static void main(String[] args) {
91	MyLexer lexer =
92	new MyLexer(new DataInputStream(System.in));
93	TokenStreamPassThrough filter =</small>
94	<small> new TokenStreamPassThrough(lexer);
95	filter.discard(MyParser.WS);
96	filter.discard(MyParser.COMMENT);</small>
97	<small> MyParser parser = new MyParser(filter);</small>
98	<small> parser.<em>startRule</em>();
99	}</small></pre>
100	<p>
101	Note that it is more efficient to have the lexer immediately discard lexical structures you do not want because you do not have to construct a Token object.  On the other hand, filtering the stream leads to more flexible lexers.
102	</p>
103	<h3><a name="Token Stream Splitting">Token Stream Splitting</a></h3>
104	<p>
105	Sometimes you want a translator to ignore but not discard portions of the input during the recognition phase.   For example, you want to ignore comments vis-a-vis parsing, but you need the comments for translation.   The solution is to send the comments to the parser on a <em>hidden</em> token stream--one that the parser is not "listening" to.  During recognition, actions can then examine the hidden stream or streams, collecting the comments and so on.  Stream-splitting filters are like prisms that split white light into rainbows.
106	</p>
107	<p>
108	The following diagram illustrates a situation in which a single stream of tokens is split into three.
109	</p>
110	<p>
111	<img src="stream.splitter.gif" width="546" height="232" alt="stream.splitter.gif (5527 bytes)">
112	</p>
113	<p>
114	You would have the parser pull tokens from the topmost stream.
115	</p>
116	<p>
117	There are many possible capabilities and implementations of a stream splitter.   For example, you could have a "Y-splitter" that actually duplicated a stream of tokens like a cable-TV Y-connector.  If the filter were thread-safe and buffered, you could have multiple parsers pulling tokens from the filter at the same time.
118	</p>
119	<p>
120	This section describes a stream filter supplied with ANTLR called <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> that behaves like a coin sorter, sending pennies to one bin, dimes to another, etc...  This filter splits the input stream into two streams, a main stream with the majority of the tokens and a <em>hidden</em> stream that is buffered so that you can ask it questions later about its contents.   Because of the implementation, however, you cannot attach a parser to the hidden stream.  The filter actually weaves the hidden tokens among the main tokens as you will see below.
121	</p>
122	<h4><a name="Example">Example</a></h4>
123	<p>
124	Consider the following simple grammar that reads in integer variable declarations.
125	</p>
126	<pre>decls: (decl)+
127	;
128	decl : begin:INT ID end:SEMI
129	; </pre>
130	<p>
131	Now assume input:
132	</p>
133	<pre>int n; // list length
134	/** doc */
135	int f;</pre>
136	<p>
137	Imagine that whitespace is ignored by the lexer and that you have instructed the filter to split comments onto the hidden stream.  Now if the parser is pulling tokens from the main stream, it will see only "INT ID SEMI FLOAT ID SEMI" even though the comments are hanging around on the hidden stream.  So the parser effectively ignores the comments, but your actions can query the filter for tokens on the hidden stream.
138	</p>
139	<p>
140	The first time through rule <font face="Courier New">decl</font>, the <font face="Courier New">begin</font> token reference has no hidden tokens before or after, but
141	</p>
142	<pre><font face="Courier New">filter.getHiddenAfter(end)</font></pre>
143	<p>
144	returns a reference to token
145	</p>
146	<pre><font face="Courier New">// list length</font></pre>
147	<p>
148	which in turn provides access to
149	</p>
150	<p>
151	<font face="Courier New">/** doc */</font>
152	</p>
153	<p>
154	The second time through <font face="Courier New">decl</font>
155	</p>
156	<pre><font face="Courier New">filter.getHiddenBefore(begin)</font></pre>
157	<p>
158	refers to the
159	</p>
160	<pre><font face="Courier New">/** doc */</font></pre>
161	<p>
162	comment.
163	</p>
164	<h4><a name="Filter Implementation">Filter Implementation</a></h4>
165	<p>
166	The following diagram illustrates how the Token objects are physically weaved together to simulate two different streams.
167	</p>
168	<p align="center">
169	<img src="hidden.stream.gif" width="377" height="148" alt="hidden.stream.gif (3667 bytes)">
170	</p>
171	<p align="center">
172
173	</p>
174	<p>
175	As the tokens are consumed, the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object hooks the hidden tokens to the main tokens via linked list.  There is only one physical TokenStream of tokens emanating from this filter and the interweaved pointers maintain sequence information.
176	</p>
177	<p>
178	Because of the extra pointers required to link the tokens together, you must use a special token object called <small><font face="Courier New">CommonHiddenStreamToken</font></small> (the normal object is called <small><font face="Courier New">CommonToken</font></small>).   Recall that you can instruct a lexer to build tokens of a particular class with
179	</p>
180	<pre><small>lexer.setTokenObjectClass("<em>classname</em>");</small></pre>
181	<p>
182	Technically, this exact filter functionality could be implemented without requiring a special token object, but this filter implementation is extremely efficient and it is easy to tell the lexer what kind of tokens to create.  Further, this implementation makes it very easy to automatically have tree nodes built that preserve the hidden stream information.
183	</p>
184	<p>
185	This filter affects the lazy-consume of ANTLR.  After recognizing every main stream token, the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> must grab the next Token to see if it is a hidden token. Consequently, the use of this filter is not be very workable for interactive (e.g., command-line) applications.
186	</p>
187	<h4><a name="How To Use This Filter">How To Use This Filter</a></h4>
188	<p>
189	To use <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small>, all you have to do is:
190	<ul>
191	<li type="disc" value="2">
192	Create the lexer and tell it to create token objects augmented with links to hidden tokens.
193	</li>
194	</ul>
195	<pre><small>MyLexer lexer = new MyLexer(<em>some-input-stream</em>);
196	lexer.setTokenObjectClass(</small>
197	<small> "antlr.CommonHiddenStreamToken"</small>
198	<small>);</small></pre>
199	<ul>
200	<li>
201	Create a <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object that pulls tokens from the lexer.
202	</li>
203	</ul>
204	<pre><small><font face="Courier New">TokenStreamHiddenTokenFilter</font> filter =</small>
205	<small> new <font face="Courier New">TokenStreamHiddenTokenFilter</font>(lexer);</small></pre>
206	<ul>
207	<li>
208	Tell the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> which tokens to hide, and which to discard.  For example,
209	</li>
210	</ul>
211	<pre><small>filter.discard(MyParser.WS);
212	filter.hide(MyParser.SL_COMMENT);</small></pre>
213	<ul>
214	<li>
215	Create a parser that pulls tokens from the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> rather than the lexer.
216	</li>
217	</ul>
218	<pre><small>MyParser parser = new MyParser(filter);
219	try {
220	parser.<em>startRule</em>(); // parse as usual
221	}
222	catch (Exception e) {
223	System.err.println(e.getMessage());
224	}</small></pre>
225	<p>
226	See the ANTLR fieldguide entry on <a href="http://www.antlr.org/fieldguide/whitespace">preserving whitespace</a> for a complete example.
227	</p>
228	<h4><a name="Tree Construction">Tree Construction</a></h4>
229	<p>
230	Ultimately, hidden stream tokens are needed during the translation phase, which normally means while tree walking.  How do we pass the hidden stream info to the translator without mucking up the tree grammar?  Easy: use AST nodes that save the hidden stream tokens.  ANTLR defines <small><font face="Courier New">CommonASTWithHiddenTokens</font></small> for you that hooks the hidden stream tokens onto the tree nodes automatically; methods are available to access the hidden tokens associated with a tree node.  All you have to do is tell the parser to create nodes of this node type rather than the default <small><font face="Courier New">CommonAST</font></small>.
231	</p>
232	<pre><small>parser.setASTNodeClass("antlr.CommonASTWithHiddenTokens");</small></pre>
233	<p>
234	Tree nodes are created as functions of Token objects.  The <small><font face="Courier New">initialize()</font></small> method of the tree node is called with a Token object when the ASTFactory creates the tree node.  Tree nodes created from tokens with hidden tokens before or after will have the same hidden tokens.  You do not have to use this node definition, but it works for many translation tasks:
235	</p>
236	<pre><small>package antlr;
237
238	/** A CommonAST whose initialization copies
239	* hidden token information from the Token
240	* used to create a node.
241	*/
242	public class CommonASTWithHiddenTokens
243	extends CommonAST {
244	// references to hidden tokens
245	protected Token hiddenBefore, hiddenAfter;
246
247	public CommonHiddenStreamToken <strong>getHiddenAfter</strong>() {</small>
248	<small> return hiddenAfter;</small>
249	<small> }
250	public CommonHiddenStreamToken <strong>getHiddenBefore</strong>() {</small>
251	<small> return hiddenBefore;</small>
252	<small> }
253	public void <strong>initialize</strong>(Token tok) {
254	CommonHiddenStreamToken t =</small>
255	<small> (CommonHiddenStreamToken)tok;
256	super.initialize(t);
257	hiddenBefore = t.getHiddenBefore();
258	hiddenAfter = t.getHiddenAfter();
259	}
260	}</small></pre>
261	<p>
262	Notice that this node definition assumes that you are using <small><font face="Courier New">CommonHiddenStreamToken</font></small> objects.  A runtime class cast except occurs if you do not have the lexer create <small><font face="Courier New">CommonHiddenStreamToken</font></small> objects.
263	</p>
264	<h4><a name="Garbage Collection Issues">Garbage Collection Issues</a></h4>
265	<p>
266	By partitioning up the input stream and preventing hidden stream tokens from referring to main stream tokens, GC is allowed to work on the Token stream. In the integer declaration example above, when there are no more references to the first SEMI token and the second INT token, the comment tokens are candidates for garbage collection.  If all tokens were linked together, a single reference to any token would prevent GC of any tokens.  This is not the case in ANTLR's implementation.
267	</p>
268	<h4><a name="Notes">Notes</a></h4>
269	<p>
270	This filter works great for preserving whitespace and comments during translation, but is not always the best solution for handling comments in situations where the output is very dissimilar to the input.  For example, there may be 3 comments interspersed within an input statement that you want to combine at the head of the output statement during translation.  Rather than having to ask each parsed token for the comments surrounding it, it would be better to have a real, physically-separate stream that buffered the comments and a means of associating groups of parsed tokens with groups of comment stream tokens.  You probably want to support questions like "<em>give me all of the tokens on the comment stream that originally appeared between this beginning parsed token and this ending parsed token</em>."
271	</p>
272	<p>
273	This filter implements the exact same functionality as JavaCC's <em>special</em> tokens.  Sriram Sankar (father of JavaCC) had a great idea with the special tokens and, at the 1997 <a href="http://www.antlr.org/workshop97/summary.html">Dr. T's Traveling Parsing Revival and Beer Tasting Festival</a>, the revival attendees extended the idea to the more general token stream concept.  Now, the JavaCC special token functionality is just another ANTLR stream filter with the bonus that you do not have to modify the lexer to specify which tokens are special.
274	</p>
275	<h3><a name="lexerstates">Token Stream Multiplexing (aka "Lexer states")</a></h3>
276	<p>
277	Now, consider the opposite problem where you want to combine multiple streams rather than splitting a single stream.  When your input contains sections or slices that are radically diverse such as Java and JavaDoc comments, you will find that it is hard to make a single lexer recognize all slices of the input.  This is primarily because merging the token definitions of the various slices results in an ambiguous lexical language or allows invalid tokens.  For example, "final" may be a keyword in one section, but an identifier in another.  Also, "@author" is a valid javadoc tag within a comment, but is invalid in the surrounding Java code.
278	</p>
279	<p>
280	Most people solve this problem by having the lexer sit in one of multiple states (for example, "reading Java stuff" vs "reading JavaDoc stuff").  The lexer starts out in Java mode and then, upon "/*", switches to JavaDoc mode; "/" forces the lexer to switch back to Java mode.
281	</p>
282	<h4><a name="Multiple Lexers">Multiple Lexers</a></h4>
283	<p>
284	Having a single lexer with multiple states works, but having multiple lexers that are multiplexed onto the same token stream solves the same problem better because the separate lexers are easier to reuse (no cutting and pasting into a new lexer--just tell the stream multiplexor to switch to it).  For example, the JavaDoc lexer could be reused for any language problem that had JavaDoc comments.
285	</p>
286	<p>
287	ANTLR provides a predefined token stream called <small><font face="Courier New">TokenStreamSelector</font></small> that lets you switch between multiple lexers.  Actions in the various lexers control how the selector switches input streams.  Consider the following Java fragment.
288	</p>
289	<pre>/** Test.
290	* @author Terence
291	*/
292	int n;</pre>
293	<p>
294	Given two lexers, JavaLexer and JavaDocLexer, the sequence of actions by the two lexers might look like this:
295	</p>
296	<p>
297	<small><font face="Arial">JavaLexer: match JAVADOC_OPEN, switch to JavaDocLexer
298	<br>
299	JavaDocLexer: match AUTHOR
300	<br>
301	JavaDocLexer: match ID
302	<br>
303	JavaDocLexer: match JAVADOC_CLOSE, switch back to JavaLexer
304	<br>
305	JavaLexer: match INT
306	<br>
307	JavaLexer: match ID
308	<br>
309	JavaLexer: match SEMI</font></small>
310	</p>
311	<p>
312	In the Java lexer grammar, you will need a rule to perform the switch to the JavaDoc lexer (recording on the stack of streams the "return lexer"):
313	</p>
314	<pre><small>JAVADOC_OPEN
315	: "/**" {selector.push("doclexer");}
316	;</small></pre>
317	<p>
318	Similarly, you will need a rule in the JavaDoc lexer to switch back:
319	</p>
320	<pre><small>JAVADOC_CLOSE
321	: "*/" {selector.pop();}
322	;</small></pre>
323	<p>
324	The selector has a stack of streams so the JavaDoc lexer does not need to know who invoked it.
325	</p>
326	<p>
327	Graphically, the selector combines the two lexer streams into a single stream presented to the parser.
328	</p>
329	<p>
330	<img src="stream.selector.gif" width="538" height="238" alt="stream.selector.gif (5976 bytes)">
331	</p>
332	<p>
333	The selector can maintain of list of streams for you so that you can switch to another input stream by name or you can tell it to switch to an actual stream object.
334	</p>
335	<pre><small>public class TokenStreamSelector implements TokenStream {
336	public <strong>TokenStreamSelector</strong>() {...}
337	public void <strong>addInputStream</strong>(TokenStream stream,</small>
338	<small> String key) {...}
339	public void <strong>pop</strong>() {...}
340	public void <strong>push</strong>(TokenStream stream) {...}
341	public void <strong>push</strong>(String sname) {...}
342	/** Set the stream without pushing old stream */
343	public void <strong>select</strong>(TokenStream stream) {...}
344	public void <strong>select</strong>(String sname)</small>
345	<small> throws IllegalArgumentException {...}
346	}</small></pre>
347	<p>
348	Using the selector is easy:
349	<ul>
350	<li>
351	Create a selector.
352	</li>
353	</ul>
354	<pre><small>TokenStreamSelector selector =
355	new TokenStreamSelector();</small></pre>
356	<ul>
357	<li>
358	Name the streams (don't have to name--you can use stream object references instead to avoid the hashtable lookup on each switch).
359	</li>
360	</ul>
361	<pre><small>selector.addInputStream(mainLexer, "main");
362	selector.addInputStream(doclexer, "doclexer");</small></pre>
363	<ul>
364	<li>
365	Select which lexer reads from the char stream first.
366	</li>
367	</ul>
368	<pre><small>// start with main java lexer
369	selector.select("main");</small></pre>
370	<ul>
371	<li>
372	Attach your parser to the selector instead of one of the lexers.
373	</li>
374	</ul>
375	<pre><small>JavaParser parser = new JavaParser(selector);</small></pre> <h4><a name="Lexers Sharing Same Character Stream">Lexers Sharing Same Character Stream</a></h4>
376	<p>
377	Before moving on to how the parser uses the selector, note that the two lexers have to read characters from the same input stream.  Prior to ANTLR 2.6.0, each lexer had its own line number variable, input char stream variable and so on.  In order to share the same input state, ANTLR 2.6.0 factors the portion of a lexer dealing with the character input into an object, <small><font face="Courier New">LexerSharedInputState</font></small>, that can be shared among n lexers (single-threaded).  To get multiple lexers to share state, you create the first lexer, ask for its input state object, and then use that when constructing any further lexers that need to share that input state:
378	</p>
379	<pre><small>// create Java lexer</small>
380	<small>JavaLexer mainLexer = new JavaLexer(input);
381	// create javadoc lexer; attach to shared</small>
382	<small>// input state of java lexer
383	JavaDocLexer doclexer =</small>
384	<small> new JavaDocLexer(mainLexer.getInputState());</small></pre> <h4><a name="Parsing Multiplexed Token Streams">Parsing Multiplexed Token Streams</a></h4>
385	<p>
386	Just as a single lexer may have trouble producing a single stream of tokens from diverse input slices or sections, a single parser may have trouble handling the multiplexed token stream.  Again, a token that is a keyword in one lexer's vocabulary may be an identifier in another lexer's vocabulary.  Factoring the parser into separate subparsers for each input section makes sense to handle the separate vocabularies as well as for promoting grammar reuse.
387	</p>
388	<p>
389	The following parser grammar uses the main lexer token vocabulary (specified with the importVocab option) and upon <small><font face="Courier New">JAVADOC_OPEN</font></small> it creates and invokes a JavaDoc parser to handle the subsequent stream of tokens from within the comment.
390	</p>
391	<pre><small>class JavaParser extends Parser;
392	options {
393	importVocab=Java;
394	}
395
396	input
397	: ( (javadoc)? INT ID SEMI )+
398	;
399
400	javadoc
401	: JAVADOC_OPEN
402	{</small>
403	<small> // create a parser to handle the javadoc comment
404	JavaDocParser jdocparser =</small>
405	<small> new JavaDocParser(getInputState());
406	jdocparser.content(); // go parse the comment
407	}</small>
408	<small> JAVADOC_CLOSE
409	;</small></pre>
410	<p>
411	You will note that ANTLR parsers from 2.6.0 also share token input stream state.   When creating the "subparser", <small><font face="Courier New">JavaParser</font></small> tells it to pull tokens from the same input state object.
412	</p>
413	<p>
414	The JavaDoc parser matches a bunch of tags:
415	</p>
416	<pre><small>class JavaDocParser extends Parser;
417	options {
418	importVocab=JavaDoc;
419	}
420
421	content
422	: ( PARAM // includes ID as part of PARAM
423	\| EXCEPTION</small>
424	<small> \| AUTHOR
425	)*
426	;</small></pre>
427	<p>
428	When the subparser rule <small><font face="Courier New">content</font></small> finishes, control is naturally returned to the invoking method, <small><font face="Courier New">javadoc</font></small>, in the Java parser.
429	</p>
430	<h4><a name="The Effect of Lookahead Upon Multiplexed Token Streams">The Effect of Lookahead Upon Multiplexed Token Streams</a></h4>
431	<p>
432	What would happen if the parser needed to look two tokens ahead at the start of the JavaDoc comment?  In other words, from the perspective of the main parser, what is the token following <small><font face="Courier New">JAVADOC_OPEN</font></small>?   Token <small><font face="Courier New">JAVADOC_CLOSE</font></small>, naturally!  The main parser treats any JavaDoc comment, no matter how complicated, as a single entity; it does not see into the token stream of the comment nor should it--the subparser handles that stream.
433	</p>
434	<p>
435	What is the token following the <small><font face="Courier New">content</font></small> rule in the subparser?  "End of file".  The analysis of the subparser cannot determine what random method will call it from your code.  This is not an issue because there is normally a single token that signifies the termination of the subparser.  Even if EOF gets pulled into the analysis somehow, EOF will not be present on the token stream.
436	</p>
437	<h4><a name="Multiple Lexers Versus Calling Another Lexer Rule">Multiple Lexers Versus Calling Another Lexer Rule</a></h4>
438	<p>
439	Multiple lexer states are also often used to handle very complicated single   tokens such as strings with embedded escape characters where input "\t" should not be allowed outside of a string.  Typically, upon the initial quote, the lexer switches to a "string state" and then switches back to the "normal state" after having matched the guts of the string.
440	</p>
441	<p>
442	So-called "modal" programming, where your code does something different depending on a mode, is often a bad practice.  In the situation of complex tokens, it is better to explicity specify the complicated token with more rules.  Here is the golden rule of when to and when not to use multiplexed token streams:
443	</p>
444	<blockquote>
445	<p>
446	<em>Complicated single tokens should be matched by calling another (protected) lexer rule whereas streams of tokens from diverse slices or sections should be handled by different lexers multiplexed onto the same stream that feeds the parser.</em>
447	</p>
448	</blockquote>
449	<p>
450	For example, the definition of a string in a lexer should simply call another rule to handle the nastiness of escape characters:
451	</p>
452	<pre><small>STRING_LITERAL
453	: '"' (ESC\|~('"'\|'\\'))* '"'
454	;
455
456	protected // not a token; only invoked by another rule.
457	ESC
458	: '\\'
459	( 'n'
460	\| 'r'
461	\| 't'
462	\| 'b'
463	\| 'f'
464	\| '"'
465	\| '\''
466	\| '\\'
467	\| ('u')+</small>
468	<small> HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
469	...</small>
470	)
471	<small> ;</small></pre>
472
473	<h3><a name=rewriteengine>TokenStreamRewriteEngine Easy Syntax-Directed Translation</h3>
474
475	There are many common situations where you want to tweak or augment
476	a program or data file. ANTLR 2.7.3 introduced a (Java/C# versions only) a very simple but powerful <tt>TokenStream</tt> targeted at the class of problems where:
477
478	<ol>
479	<li>the output language and the input language are similar
480	<li>the relative order of language elements does not change
481	</ol>
482
483	See the <a href="http://www.antlr.org/article/rewrite.engine/index.tml"><b>Syntax Directed TokenStream Rewriting</b></a> article on the antlr website.
484
485	<h3><a name="The Future">The Future</a></h3>
486	<p>
487	The ANTLR 2.6 release provides the basic structure for using token streams--future versions will be more sophisticated once we have experience using them.
488	</p>
489	<p>
490	The current "hidden token" stream filter clearly solves the "ignore but preserve whitespace" problem really well, but it does not handle comments too well in most situations.  For example, in real translation problems you want to collect comments at various single tree nodes (like DECL or METHOD) for interpretation rather than leaving them strewn throughout the tree.  You really need a stream splitter that buffers up the comments on a separate stream so you can say "<em>give me all comments   consumed during the recognition of this rule</em>" or "<em>give me all comments found between these two real tokens</em>." That is almost certainly something you need for translation of comments.
491	</p>
492	<p>
493	Token streams will lead to fascinating possibilities.  Most folks are not used to thinking about token streams so it is hard to imagine what else they could be good for.  Let your mind go wild.  What about embedded languages where you see slices (aspects) of the input such as Java and SQL (each portion of the input could be sliced off and put through on a different stream).  What about parsing Java .class files with and without debugging information?  If you have a parser for .class files without debug info and you want to handle .class files with debug info, leave the parser alone and augment the lexer to see the new debug structures.  Have a filter split the debug tokens of onto a different stream and the same parser will work for both types of .class files.
494	</p>
495	<p>
496	Later, I would like to add "perspectives", which are really just another way to look at filters.  Imagine a raw stream of tokens emanating from a lexer--the root perspective.  I can build up a tree of perspectives very easily from there.  For example, given a Java program with embedded SQL, you might want multiple perspectives on the input stream for parsing or translation reasons:
497	</p>
498	<p align="center">
499	<img src="stream.perspectives.gif" width="306" height="202" alt="stream.perspectives.gif (2679 bytes)">
500	</p>
501	<p align="left">
502	You could attach a parser to the SQL stream or the Java stream minus comments, with actions querying the comment stream.
503	</p>
504	<p align="left">
505	In the future, I would also like to add the ability of a parser to generate a stream of tokens (or text) as output just like it can build trees now.  In this manner, multipass parsing becomes a very natural and simple problem because parsers become stream producers also.  The output of one parser can be the input to another.
506	</p>
507	<p align="left">
508	<font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/streams.html#2 $</font>
509	</body>
510	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/yao/share/antlr-2.7.7/doc/streams.html

Download in other formats: