Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

runtime.html @ 1

Last change on this file since 1 was 1, checked in by lnalod, 15 years ago
Initial import of YAO sources
File size: 41.2 KB

Line
1	<html>
2	<head>
3	<title>ANTLR Specification: Run-time</title>
4	</head>
5	<body bgcolor="#FFFFFF" text="#000000">
6	<h1><a name="_bb1">Java Runtime Model</a></h1>
7	<hr>
8	<h2><a name="_bb2">Programmer's Interface</a></h2>
9	<p>
10	In this section, we describe what ANTLR generates after reading your grammar file and how to use that output to parse input. The classes from which your lexer, token, and parser classes are derived are provided as well.
11	</p>
12	<h3><a name="_bb3">What ANTLR generates</a></h3>
13	<p>
14	ANTLR generates the following types of files, where <i>MyParser</i>, <i>MyLexer</i>, and <i>MyTreeParser</i> are names of grammar classes specified in the grammar file. You may have an arbitrary number of parsers, lexers, and tree-parsers per grammar file; a separate class file will be generated for each. In addition, token type files will be generated containing the token vocabularies used in the parsers and lexers. One or more token vocabularies may be defined in a grammar file, and shared between different grammars. For example, given the grammar file: <tt>
15	</p>
16	<pre>class MyParser extends Parser;
17	options {
18	exportVocab=My;
19	}
20	... rules ...
21
22	class MyLexer extends Lexer;
23	options {
24	exportVocab=My;
25	}
26	... rules ...
27
28	class MyTreeParser extends TreeParser;
29	options {
30	exportVocab=My;
31	}
32	... rules ...</tt></pre>
33	<p>
34	The following files will be generated:
35	<ul>
36	<li>
37	<tt><i>MyParser</i>.java</tt>. The parser with member methods for the parser rules.
38	</li>
39	<li>
40	<tt><i>MyLexer</i>.java</tt>. The lexer with the member methods for the lexical rules.
41	</li>
42	<li>
43	<tt><i>MyTreeParser</i>.java</tt>. The tree-parser with the member methods for the tree-parser rules.
44	</li>
45	<li>
46	<tt><i>My</i>TokenTypes.java</tt>. An interface containing all of the token types defined by your parsers and lexers using the exported vocabulary named <tt>My</tt>.
47	</li>
48	<li>
49	<tt><i>My</i>TokenTypes.txt</tt>. A text file containing all of the token types, literals, and paraphrases defined by parsers and lexers contributing vocabulary <tt>My</tt>.
50	</li>
51	</ul>
52	<p>
53	The programmer uses the classes by referring to them:
54	<ol>
55	<li>
56	Create a lexical analyzer. The constructor with no arguments implies that you want to read from standard input.
57	</li>
58	<li>
59	Create a parser and attach it to the lexer (or other TokenStream).
60	</li>
61	<li>
62	Call one of the methods in the parser to begin parsing.
63	</li>
64	</ol>
65	<p>
66	If your parser generates an AST, then get the AST value, create a tree-parser, and invoke one of the tree-parser rules using the AST.
67	</p>
68	<tt><pre>MyLexer lex = new MyLexer();
69	MyParser p =
70	new MyParser(lex,<i>user-defined-args-if-any</i>);
71	p.<i>start-rule</i>();
72	// and, if you are tree parsing the result...
73	MyTreeParser tp = new MyTreeParser();
74	tp.<i>start-rule</i>(p.getAST());</tt></pre>
75	<p>
76	You can also specify the name of the token and/or AST objects that you want the lexer/parser to create. Java's support of dynamic programming makes this quite painless:
77	</p>
78	<pre><tt>MyLexer lex = new MyLexer();
79	lex.setTokenObjectClass("mypackage.MyToken");
80	// defaults to "antlr.CommonToken"
81	...
82	parser.setASTNodeClass("mypackage.MyASTNode");
83	// defaults to "antlr.CommonAST"</tt></pre>
84	<p>
85	Make sure you give a fully-qualified class name.
86	<p>
87	The lexer and parser can cause IOExceptions as well as RecognitionExceptions, which you must catch:
88	</p>
89	<pre> CalcLexer lexer =
90	new CalcLexer(new DataInputStream(System.in));
91	CalcParser parser = new CalcParser(lexer);
92	// Parse the input expression
93	try {
94	parser.expr();
95	}
96	catch (IOException io) {
97	System.err.println("IOException");
98	}
99	catch(RecognitionException e) {
100	System.err.println("exception: "+e);
101	}</pre> <h3><a name="sharingstate">Multiple Lexers/Parsers With Shared Input State</a></h3>
102	<p>
103	Occasionally, you will want two parsers or two lexers to share input state; that is, you will want them to pull input from the same source token stream or character stream.   The section on <a href="streams.html#lexerstates">multiple lexer "states"</a> describes such a situation.
104	</p>
105	<p>
106	ANTLR factors the input variables such as line number, guessing state, input stream, etc... into a separate object so that another lexer or parser could same that state.  The <small><font face="Courier New">LexerSharedInputState</font></small> and <small><font face="Courier New">ParserSharedInputState</font></small> embody this factoring.   Method <small><font face="Courier New">getInputState()</font></small> can be used on either <small><font face="Courier New">CharScanner</font></small> or <small><font face="Courier New">Parser</font></small> objects.  Here is how to construct two lexers sharing the same input stream:
107	</p>
108	<pre><small>// create Java lexer</small>
109	<small>JavaLexer mainLexer = new JavaLexer(input);
110	// create javadoc lexer; attach to shared</small>
111	<small>// input state of java lexer
112	JavaDocLexer doclexer =</small>
113	<small> new JavaDocLexer(mainLexer.getInputState());</small></pre>
114	<p>
115	Parsers with shared input state can be created similarly:
116	</p>
117	<pre><small>JavaDocParser jdocparser =</small>
118	<small> new JavaDocParser(getInputState());
119	jdocparser.content(); // go parse the comment</small></pre>
120	<p>
121	Sharing state is easy, but what happens upon exception during the execution of the "subparser"?  What about syntactic predicate execution?  It turns out that invoking a subparser with the same input state is exactly the same as calling another rule in the same parser as far as error handling and syntactic predicate guessing are concerned.  If the parser is guessing before the call to the subparser, the subparser must continue guessing, right?  Exceptions thrown inside the subparser must exit the subparser and return to enclosing erro handler or syntactic predicate handler.
122	</p>
123	<h2><a name="_bb4">Parser Implementation</a></h2> <h3><a name="_bb5">Parser Class</a></h3>
124	<p>
125	ANTLR generates a parser class (an extension of <tt>LLkParser</tt>) that contains a method for every rule in your grammar. The general format looks like: <tt>
126	</p>
127	<pre>
128	public class MyParser extends LLkParser
129	implements MyLexerTokenTypes
130	{
131	protected P(TokenBuffer tokenBuf, int k) {
132	super(tokenBuf,k);
133	tokenNames = _tokenNames;
134	}
135	public P(TokenBuffer tokenBuf) {
136	this(tokenBuf,1);
137	}
138	protected P(TokenStream lexer, int k) {
139	super(lexer,k);
140	tokenNames = _tokenNames;
141	}
142	public P(TokenStream lexer) {
143	this(lexer,1);
144	}
145	public P(ParserSharedInputState state) {
146	super(state,1);
147	tokenNames = _tokenNames;
148	}
149	...
150	// add your own constructors here...
151	<i>rule-definitions</i>
152	}
153	</tt> </pre> <h3><a name="_bb6">Parser Methods</a></h3>
154	<p>
155	ANTLR generates recursive-descent parsers, therefore, every rule in the grammar will result in a method that applies the specified grammatical structure to the input token stream. The general form of a parser method looks like: <tt>
156	</p>
157	<pre>
158	public void rule()
159	throws RecognitionException,
160	TokenStreamException
161	{
162	<i>init-action-if-present</i>
163	if ( <i>lookahead-predicts-production-1</i> ) {
164	<i>code-to-match-production-1</i>
165	}
166	else if ( <i>lookahead-predicts-production-2</i> ) {
167	<i>code-to-match-production-2</i>
168	}
169	...
170	else if ( <i>lookahead-predicts-production-n</i> ) {
171	<i>code-to-match-production-n</i>
172	}
173	else {
174	// syntax error
175	throw new NoViableAltException(LT(1));
176	}
177	}
178	</tt> This code results from a rule of the form: <tt></pre> <pre>
179	rule: <i>production-1</i>
180	\| <i>production-2</i>
181	...
182	\| <i>production-n</i>
183	;
184	</tt> </pre>
185	<p>
186	If you have specified arguments and a return type for the rule, the method header changes to: <tt>
187	</p>
188	<pre>
189	/* generated from:
190	* rule(<i>user-defined-args</i>)
191	* returns <i>return-type</i> : ... ;
192	*/
193	public <i>return-type</i> rule(<i>user-defined-args</i>)
194	throws RecognitionException,
195	TokenStreamException
196	{
197	...
198	}
199	</tt> </pre>
200	<p>
201	Token types are integers and we make heavy use of bit sets and range comparisons to avoid excessively-long test expressions.
202	</p>
203	<h3><a name="_bb7">EBNF Subrules</a></h3>
204	<p>
205	Subrules are like unlabeled rules, consequently, the code generated for an EBNF subrule mirrors that generated for a rule. The only difference is induced by the EBNF subrule operators that imply optionality or looping.
206	</p>
207	<p>
208	<b><tt>(...)?</tt> optional subrule</b>. The only difference between the code generated for an optional subrule and a rule is that there is no default <tt>else</tt>-clause to throw an exception--the recognition continues on having ignored the optional subrule. <tt>
209	</p>
210	<pre>
211	{
212	<i>init-action-if-present</i>
213	if ( <i>lookahead-predicts-production-1</i> ) {
214	<i>code-to-match-production-1</i>
215	}
216	else if ( <i>lookahead-predicts-production-2</i> ) {
217	<i>code-to-match-production-2</i>
218	}
219	...
220	else if ( <i>lookahead-predicts-production-n</i> ) {
221	<i>code-to-match-production-n</i>
222	}
223	}
224	</tt> </pre>
225	<p>
226	Not testing the optional paths of optional blocks has the potential to delay the detection of syntax errors.
227	</p>
228	<p>
229	<b><tt>(...)*</tt> closure subrule</b>. A closure subrule is like an optional looping subrule, therefore, we wrap the code for a simple subrule in a "forever" loop that exits whenever the lookahead is not consistent with any of the alternative productions. <tt>
230	</p>
231	<pre>
232	{
233	<i>init-action-if-present</i>
234	loop:
235	do {
236	if ( <i>lookahead-predicts-production-1</i> ) {
237	<i>code-to-match-production-1</i>
238	}
239	else if ( <i>lookahead-predicts-production-2</i> ) {
240	<i>code-to-match-production-2</i>
241	}
242	...
243	else if ( <i>lookahead-predicts-production-n</i> ) {
244	<i>code-to-match-production-n</i>
245	}
246	else {
247	break loop;
248	}
249	}
250	while (true);
251	}
252	</tt> </pre>
253	<p>
254	While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token <tt>A</tt>. <tt>
255	</p>
256	<pre>
257	( A \| B )* A
258	</tt> </pre>
259	<p>
260	Upon <tt>A</tt>, should the loop continue or exit? One must also ask if the loop should even begin. Because you cannot answer these questions with only one symbol of lookahead, the decision is non-LL(1).
261	</p>
262	<p>
263	Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.
264	</p>
265	<p>
266	As a special case, a closure subrule with one alternative production results in: <tt>
267	</p>
268	<pre>
269	{
270	<i>init-action-if-present</i>
271	loop:
272	while ( <i>lookahead-predicts-production-1</i> ) {
273	<i>code-to-match-production-1</i>
274	}
275	}
276	</tt> </pre>
277	<p>
278	This special case results in smaller, faster, and more readable code.
279	</p>
280	<p>
281	<b><tt>(...)+</tt> positive closure subrule</b>. A positive closure subrule is a loop around a series of production prediction tests like a closure subrule. However, we must guarantee that at least one iteration of the loop is done before proceeding to the construct beyond the subrule.
282	</p>
283	<tt><pre>
284	{
285	int _cnt = 0;
286	<i>init-action-if-present</i>
287	loop:
288	do {
289	if ( <i>lookahead-predicts-production-1</i> ) {
290	<i>code-to-match-production-1</i>
291	}
292	else if ( <i>lookahead-predicts-production-2</i> ) {
293	<i>code-to-match-production-2</i>
294	}
295	...
296	else if ( <i>lookahead-predicts-production-n</i> ) {
297	<i>code-to-match-production-n</i>
298	}
299	else if ( _cnt>1 ) {
300	// lookahead predicted nothing and we've
301	// done an iteration
302	break loop;
303	}
304	else {
305	throw new NoViableAltException(LT(1));
306	}
307	_cnt++; // track times through the loop
308	}
309	while (true);
310	}
311	</tt> </pre>
312	<p>
313	While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token <tt>A</tt>. <tt>
314	</p>
315	<pre>
316	( A \| B )+ A
317	</tt> </pre>
318	<p>
319	Upon <tt>A</tt>, should the loop continue or exit? Because you cannot answer this with only one symbol of lookahead, the decision is non-LL(1).
320	</p>
321	<p>
322	Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.
323	</p>
324	<p>
325	You might ask why we do not have a <tt>while</tt> loop that tests to see if the lookahead is consistent with any of the alternatives (rather than having series of tests inside the loop with a <tt>break</tt>). It turns out that we can generate smaller code for a series of tests than one big one. Moreover, the individual tests must be done anyway to distinguish between alternatives so a <tt>while</tt> condition would be redundant.
326	</p>
327	<p>
328	As a special case, if there is only one alternative, the following is generated: <tt>
329	</p>
330	<pre>
331	{
332	<i>init-action-if-present</i>
333	do {
334	<i>code-to-match-production-1</i>
335	}
336	while ( <i>lookahead-predicts-production-1</i> );
337	}
338	</tt> </pre>
339	<p>
340	<b>Optimization.</b> When there are a large (where large is user-definable) number of strictly LL(1) prediction alternatives, then a <tt>switch</tt>-statement can be used rather than a sequence of <tt>if</tt>-statements. The non-LL(1) cases are handled by generating the usual <tt>if</tt>-statements in the <tt>default</tt> case. For example: <tt>
341	</p>
342	<pre>
343	switch ( LA(1) ) {
344	case KEY_WHILE :
345	case KEY_IF :
346	case KEY_DO :
347	statement();
348	break;
349	case KEY_INT :
350	case KEY_FLOAT :
351	declaration();
352	break;
353	default :
354	// do whatever else-clause is appropriate
355	}
356	</tt> </pre>
357	<p>
358	This optimization relies on the compiler building a more direct jump (via jump table or hash table) to the ith production matching code. This is also more readable and faster than a series of bit set membership tests.
359	</p>
360	<h3><a name="_bb8">Production Prediction</a> </h3>
361	<p>
362	<b>LL(1) prediction.</b> Any LL(1) prediction test is a simple set membership test. If the set is a singleton set (a set with only one element), then an integer token type <tt>==</tt> comparison is done. If the set degree is greater than one, a bit set is created and the single input token type is tested for membership against that set. For example, consider the following rule: <tt>
363	</p>
364	<pre>
365	a : A \| b ;
366	b : B \| C \| D \| E \| F;
367	</tt> </pre>
368	<p>
369	The lookahead that predicts production one is {<tt>A</tt>} and the lookahead that predicts production two is {<tt>B,C,D,E,F</tt>}. The following code would be generated by ANTLR for rule <tt>a</tt> (slightly cleaned up for clarity):
370	</p>
371	<tt><pre>
372	public void a() {
373	if ( LA(1)==A ) {
374	match(A);
375	}
376	else if (token_set1.member(LA(1))) {
377	b();
378	}
379	}
380	</tt> </pre>
381	<p>
382	The prediction for the first production can be done with a simple integer comparison, but the second alternative uses a bit set membership test for speed, which you probably didn't recognize as testing <tt>LA(1) member {B,C,D,E,F}</tt>. The complexity threshold above which bitset-tests are generated is user-definable.
383	</p>
384	<p>
385	We use arrays of <tt>long int</tt>s (64 bits) to hold bit sets. The ith element of a bitset is stored in the word number <tt>i/64</tt> and the bit position within that word is <tt>i % 64</tt>. The divide and modulo operations are extremely expensive and, but fortunately, a strength reduction can be done. Dividing by a power of two is the same as shifting right and modulo a power of two is the same as masking with that power minus one. All of these details are hidden inside the implementation of the <tt>BitSet</tt> class in the package <tt>antlr.collections.impl</tt>.
386	</p>
387	<p>
388	The various bit sets needed by ANTLR are created and initialized in the generated parser (or lexer) class.
389	</p>
390	<p>
391	<b>Approximate LL(k) prediction.</b> An extension of LL(1)...basically we do a series of up to k bit set tests rather than a single as we do in LL(1) prediction. Each decision will use a different amount of lookahead, with LL(1) being the dominant decision type.
392	</p>
393	<h3><a name="_bb9"></a><a name="Production Element Recognition">Production Element Recognition</a> </h3>
394	<p>
395	<b>Token references.</b> Token references are translated to: <tt>
396	</p>
397	<pre>
398	match(<i>token-type</i>);
399	</tt> </pre>
400	<p>
401	For example, a reference to token <tt>KEY_BEGIN</tt> results in: <tt>
402	</p>
403	<pre>
404	match(KEY_BEGIN);
405	</tt> </pre>
406	<p>
407	where <tt>KEY_BEGIN</tt> will be an integer constant defined in the <tt><i>MyParser</i>TokenType</tt> interface generated by ANTLR.
408	</p>
409	<p>
410	<b>String literal references.</b> String literal references are references to automatically generated tokens to which ANTLR automatically assigns a token type (one for each unique string). String references are translated to:
411	</p>
412	<tt><pre>
413	match(<i>T</i>);
414	</tt> </pre>
415	<p>
416	where <tt><i>T</i></tt> is the token type assigned by ANTLR to that token.
417	</p>
418	<p>
419	<b>Character literal references.</b> Referencing a character literal implies that the current rule is a lexical rule. Single characters, '<i>t</i>', are translated to:
420	</p>
421	<tt><pre>
422	match('<i>t</i>');
423	</tt> </pre>
424	<p>
425	which can be manually inlined with: <tt>
426	</p>
427	<pre>
428	if ( c=='<i>t</i>' ) consume();
429	else throw new MismatchedCharException(
430	"mismatched char: '"+(char)c+"'");
431	</tt> </pre>
432	<p>
433	if the method call proves slow (at the cost of space).
434	</p>
435	<p>
436	<b>Wildcard references.</b> In lexical rules, the wildcard is translated to: <tt>
437	</p>
438	<pre>
439	consume();
440	</tt> </pre>
441	<p>
442	which simply gets the next character of input without doing a test.
443	</p>
444	<p>
445	References to the wildcard in a parser rule results in the same thing except that the <tt>consume</tt> call will be with respect to the parser.
446	</p>
447	<p>
448	<b>Not operator.</b> When operating on a token, <tt>~<i>T</i></tt> is translated to:
449	</p>
450	<tt><pre>
451	matchNot(<i>T</i>);
452	</tt> </pre>
453	<p>
454	When operating on a character literal, <tt>'<i>t</i>'</tt> is translated to: <tt>
455	</p>
456	<pre>
457	matchNot('<i>t</i>');
458	</tt> </pre>
459	<p>
460	<b>Range operator.</b> In parser rules, the range operator (<tt><i>T1</i>..<i>T2</i></tt>) is translated to: <tt>
461	</p>
462	<pre>
463	matchRange(<i>T1</i>,<i>T2</i>);
464	</tt> </pre>
465	<p>
466	In a lexical rule, the range operator for characters <tt><i>c1..c2</i></tt> is translated to: <tt>
467	</p>
468	<pre>
469	matchRange(<i>c1</i>,<i>c2</i>);
470	</tt> </pre>
471	<p>
472	<b>Labels.</b> Element labels on atom references become <tt>Token</tt> references in parser rules and <tt>int</tt>s in lexical rules. For example, the parser rule: <tt>
473	</p>
474	<pre>
475	a : id:ID {System.out.println("id is "+id);} ;
476	</tt> would be translated to: <tt></pre> <pre>
477	public void a() {
478	Token id = null;
479	id = LT(1);
480	match(ID);
481	System.out.println("id is "+id);
482	}
483	</tt> For lexical rules such as: <tt></pre> <pre>
484	ID : w:. {System.out.println("w is "+(char)w);};
485	</tt> the following code would result: <tt></pre> <pre>
486	public void ID() {
487	int w = 0;
488	w = c;
489	consume(); // match wildcard (anything)
490	System.out.println("w is "+(char)w);
491	}
492	</tt> </pre>
493	<p>
494	Labels on rule references result in <tt>AST</tt> references, when generating trees, of the form <tt><i>label</i>_ast</tt>.
495	</p>
496	<p>
497	<b>Rule references.</b> Rule references become method calls. Arguments to rules become arguments to the invoked methods. Return values are assigned like Java assignments. Consider rule reference <tt>i=list[1]</tt> to rule: <tt>
498	</p>
499	<pre>
500	list[int scope] returns int
501	: { return scope+3; }
502	;
503	</tt> The rule reference would be translated to: <tt></pre> <pre>
504	i = list(1);
505	</tt> </pre>
506	<p>
507	<b>Semantic actions.</b> Actions are translated verbatim to the output parser or lexer except for the <a href="trees.html#Action Translation">translations required for AST generation</a> and the following:
508
509	<ul>
510	<li><tt>$FOLLOW(r)</tt>: FOLLOW set name for rule r
511	<li><tt>$FIRST(r)</tt>: FIRST set name for rule r
512	</ul>
513
514	<p>
515	Omitting the rule argument implies you mean the current rule. The result type is a BitSet, which you can test via $FIRST(a).member(LBRACK) etc...
516
517	<p>
518	Here is a sample rule:
519
520	<pre>
521	a : A {System.out.println($FIRST(a));} B
522	exception
523	catch [RecognitionException e] {
524	if ( $FOLLOW.member(SEMICOLON) ) {
525	consumeUntil(SEMICOLON);
526	}
527	else {
528	consume();
529	}
530	}
531	;
532	</pre>
533
534	Results in
535
536	<pre>
537	public final void a() throws RecognitionException, TokenStreamException {
538	try {
539	match(A);
540	System.out.println(_tokenSet_0);
541	match(B);
542	}
543	catch (RecognitionException e) {
544	if ( _tokenSet_1.member(SEMICOLON) ) {
545	consumeUntil(SEMICOLON);
546	}
547	else {
548	consume();
549	}
550	}
551	}
552	</pre>
553
554	<p>
555	To add members to a lexer or parser class definition, add the class member definitions enclosed in {} immediately following the class specification, for example: <tt>
556	</p>
557	<pre>
558	class MyParser;
559	{
560	protected int i;
561	public MyParser(TokenStream lexer,
562	int aUsefulArgument) {
563	i = aUsefulArgument;
564	}
565	}
566	... rules ...
567	</tt></pre>
568	<p>
569	ANTLR collects everything inside the {...} and inserts it in the class definition before the rule-method definitions. When generating C++, this may have to be extended to allow actions after the rules due to the wacky ordering restrictions of C++.
570	</p>
571
572	<h3><a name="_bb10">Standard Classes</a></h3>
573	<p>
574	ANTLR constructs parser classes that are subclasses of the <tt>antlr.LLkParser</tt> class, which is a subclass of the <tt>antlr.Parser</tt> class. We summarize the more important members of these classes here. See Parser.java and LLkParser.java for details of the implementation. <tt>
575	</p>
576	<pre>
577	public abstract class Parser {
578	protected ParserSharedInputState inputState;
579	protected ASTFactory ASTFactory;
580	public abstract int LA(int i);
581	public abstract Token LT(int i);
582	public abstract void consume();
583	public void consumeUntil(BitSet set) { ... }
584	public void consumeUntil(int tokenType) { ... }
585	public void match(int t)
586	throws MismatchedTokenException { ... }
587	public void matchNot(int t)
588	throws MismatchedTokenException { ... }
589	...
590	}
591
592	public class LLkParser extends Parser {
593	public LLkParser(TokenBuffer tokenBuf, int k_)
594	{ ... }
595	public LLkParser(TokenStream lexer, int k_)
596	{ ... }
597	public int LA(int i) { return input.LA(i); }
598	public Token LT(int i) { return input.LT(i); }
599	public void consume() { input.consume(); }
600	...
601	}
602	</pre> </tt><h2><a name="_bb11">Lexer Implementation</a></h2> <h3><a name="_bb12">Lexer Form</a></h3>
603	<p>
604	The lexers produced by ANTLR are a lot like the parsers produced by ANTLR. They only major differences are that (a) scanners use characters instead of tokens, and (b) ANTLR generates a special <tt>nextToken</tt> rule for each scanner which is a production containing each public lexer rule as an alternate. The name of the lexical grammar class provided by the programmer results in a subclass of <tt>CharScanner</tt>, for example <tt>
605	</p>
606	<pre>
607	public class MyLexer extends antlr.CharScanner
608	implements LTokenTypes, TokenStream
609	{
610	public L(InputStream in) {
611	this(new ByteBuffer(in));
612	}
613	public L(Reader in) {
614	this(new CharBuffer(in));
615	}
616	public L(InputBuffer ib) {
617	this(new LexerSharedInputState(ib));
618	}
619	public L(LexerSharedInputState state) {
620	super(state);
621	caseSensitiveLiterals = true;
622	setCaseSensitive(true);
623	literals = new Hashtable();
624	}
625
626	public Token nextToken() throws TokenStreamException {
627	<i>scanning logic</i>
628	...
629	}
630	<i>recursive and other non-inlined lexical methods</i>
631	...
632	}
633	</tt> </pre>
634	<p>
635	When an ANTLR-generated parser needs another token from its lexer, it calls a method called <tt>nextToken</tt>. The general form of the <tt>nextToken</tt> method is: <tt>
636	</p>
637	<pre>
638	public Token nextToken()
639	throws TokenStreamException {
640	int tt;
641	for (;;) {
642	try {
643	resetText();
644	switch ( c ) {
645	<i>case for each char predicting lexical rule</i>
646	<i>call lexical rule gets token type -></i> tt
647	default :
648	throw new NoViableAltForCharException(
649	"bad char: '"+(char)c+"'");
650	}
651	if ( tt!=Token.SKIP ) {
652	return makeToken(tt);
653	}
654	}
655	catch (RecognitionException ex) {
656	reportError(ex.toString());
657	}
658	}
659	}
660	</tt> </pre>
661	<p>
662	For example, the lexical rules: <tt>
663	</p>
664	<pre>
665	lexclass Lex;
666
667	WS : ('\t' \| '\r' \| ' ') {_ttype=Token.SKIP;} ;
668	PLUS : '+';
669	MINUS: '-';
670	INT : ( '0'..'9' )+ ;
671	ID : ( 'a'..'z' )+ ;
672	UID : ( 'A'..'Z' )+ ;
673	</tt> would result in something like: <tt></pre> <pre>
674	public class Lex extends CharScanner
675	implements TTokenTypes {
676	...
677	public Token nextToken()
678	throws TokenStreamException {
679	int _tt = Token.EOF_TYPE;
680	for (;;) {
681	try {
682	resetText();
683	switch ( _c ) {
684	case '\t': case '\r': case ' ':
685	_tt=mWS();
686	break;
687	case '+':
688	_tt=mPLUS();
689	break;
690	case '-':
691	_tt=mMINUS();
692	break;
693	case '0': case '1': case '2': case '3':
694	case '4': case '5': case '6': case '7':
695	case '8': case '9':
696	_tt=mINT();
697	break;
698	case 'a': case 'b': case 'c': case 'd':
699	case 'e': case 'f': case 'g': case 'h':
700	case 'i': case 'j': case 'k': case 'l':
701	case 'm': case 'n': case 'o': case 'p':
702	case 'q': case 'r': case 's': case 't':
703	case 'u': case 'v': case 'w': case 'x':
704	case 'y': case 'z':
705	_tt=mID();
706	break;
707	case 'A': case 'B': case 'C': case 'D':
708	case 'E': case 'F': case 'G': case 'H':
709	case 'I': case 'J': case 'K': case 'L':
710	case 'M': case 'N': case 'O': case 'P':
711	case 'Q': case 'R': case 'S': case 'T':
712	case 'U': case 'V': case 'W': case 'X':
713	case 'Y': case 'Z':
714	_tt=mUID();
715	break;
716	case EOF_CHAR :
717	_tt = Token.EOF_TYPE;
718	break;
719	default :
720	throw new NoViableAltForCharException(
721	"invalid char "+_c);
722	}
723	if ( _tt!=Token.SKIP ) {
724	return makeToken(_tt);
725	}
726	} // try
727	catch (RecognitionException ex) {
728	reportError(ex.toString());
729	}
730	} // for
731	}
732
733	public int mWS()
734	throws RecognitionException,
735	CharStreamException,
736	TokenStreamException {
737	int _ttype = WS;
738	switch ( _c) {
739	case '\t':
740	match('\t');
741	break;
742	case '\r':
743	match('\r');
744	break;
745	case ' ':
746	match(' ');
747	break;
748	default :
749	{
750	throw new NoViableAltForException(
751	"no viable for char: "+(char)_c);
752	}
753	}
754	_ttype = Token.SKIP;
755	return _ttype;
756	}
757
758	public int mPLUS()
759	throws RecognitionException,
760	CharStreamException,
761	TokenStreamException {
762	int _ttype = PLUS;
763	match('+');
764	return _ttype;
765	}
766
767	public int mMINUS()
768	throws RecognitionException,
769	CharStreamException,
770	TokenStreamException {
771
772	int _ttype = MINUS;
773	match('-');
774	return _ttype;
775	}
776
777	public int mINT()
778	throws RecognitionException,
779	CharStreamException,
780	TokenStreamException {
781
782	int _ttype = INT;
783	{
784	int _cnt=0;
785	_loop:
786	do {
787	if ( _c>='0' && _c<='9')
788	{ matchRange('0','9'); }
789	else
790	if ( _cnt>=1 ) break _loop;
791	else {
792	throw new ScannerException(
793	"no viable alternative for char: "+
794	(char)_c);
795	}
796	_cnt++;
797	} while (true);
798	}
799	return _ttype;
800	}
801
802	public int mID()
803	throws RecognitionException,
804	CharStreamException,
805	TokenStreamException {
806	int _ttype = ID;
807	{
808	int _cnt=0;
809	_loop:
810	do {
811	if ( _c>='a' && _c<='z')
812	{ matchRange('a','z'); }
813	else
814	if ( _cnt>=1 ) break _loop;
815	else {
816	throw new NoViableAltForCharException(
817	"no viable alternative for char: "+
818	(char)_c);
819	}
820	_cnt++;
821	} while (true);
822	}
823	return _ttype;
824	}
825
826	public int mUID()
827	throws RecognitionException,
828	CharStreamException,
829	TokenStreamException {
830
831	int _ttype = UID;
832	{
833	int _cnt=0;
834	_loop:
835	do {
836	if ( _c>='A' && _c<='Z')
837	{ matchRange('A','Z'); }
838	else
839	if ( _cnt>=1 ) break _loop;
840	else {
841	throw new NoViableAltForCharException(
842	"no viable alternative for char: "+
843	(char)_c);
844	}
845	_cnt++;
846	} while (true);
847	}
848	return _ttype;
849	}
850
851	}
852	</tt> </pre>
853	<p>
854	ANTLR-generated lexers assume that you will be reading streams of characters. If this is not the case, you must create your own lexer.
855	</p>
856	<h3><a name="_bb13">Creating Your Own Lexer</a></h3>
857	<p>
858	To create your own lexer, your Java class that will doing the lexing must implement interface <tt>TokenStream</tt>, which simply states that you must be able to return a stream of tokens via <tt>nextToken</tt>: <tt>
859	</p>
860	<pre>
861	/**This interface allows any object to
862	* pretend it is a stream of tokens.
863	* @author Terence Parr, MageLang Institute
864	*/
865	public interface TokenStream {
866	public Token nextToken();
867	}
868	</tt> </pre>
869	<p>
870	ANTLR will not generate a lexer if you do not specify a lexical class.
871	</p>
872	<p>
873	Launching a parser with a non-ANTLR-generated lexer is the same as launching a parser with an ANTLR-generated lexer:
874	</p>
875	<tt><pre>HandBuiltLexer lex = new HandBuiltLexer(...);
876	MyParser p = new MyParser(lex);
877	p.<i>start-rule</i>();</tt></pre>
878	<p>
879	The parser does not care what kind of object you use for scanning as as long as it can answer <tt>nextToken</tt>.
880	</p>
881	<p>
882	If you build your own lexer, and the token values are also generated by that lexer, then you should inform the ANTLR-generated parsers about the token type values generated by that lexer. Use the <a href="options.html#importVocab">importVocab</a> in the parsers that use the externally-generated token set, and create a token definition file following the requirements of the importVocab option.
883	</p>
884	<h3><a name="_bb14"></a><a name="Lexical Rules">Lexical Rules</a> </h3>
885	<p>
886	Lexical rules are essentially the same as parser rules except that lexical rules apply a structure to a series of characters rather than a series of tokens. As with parser rules, each lexical rule results in a method in the output lexer class.
887	</p>
888	<p>
889	<b>Alternative blocks.</b> Consider a simple series of alternatives within a block: <tt>
890	</p>
891	<pre>
892	FORMAT : 'x' \| 'f' \| 'd';
893	</tt> </pre>
894	<p>
895	The lexer would contain the following method: <tt>
896	</p>
897	<pre>
898	public int mFORMAT() {
899	if ( c=='x' ) {
900	match('x');
901	}
902	else if ( c=='x' ) {
903	match('x');
904	}
905	else if ( c=='f' ) {
906	match('f');
907	}
908	else if ( c=='d' ) {
909	match('d');
910	}
911	else {
912	throw new NoViableAltForCharException(
913	"no viable alternative: '"+(char)c+"'");
914	}
915	return FORMAT;
916	}
917	</tt> </pre>
918	<p>
919	The only real differences between lexical methods and grammar methods are that lookahead prediction expressions do character comparisons rather than <tt>LA(i)</tt> comparisons, <tt>match</tt> matches characters instead of tokens, a <tt>return</tt> is added to the bottom of the rule, and lexical methods throw <tt>CharStreamException</tt> objects in addition to <font face="Courier New">TokenStreamException</font> and <font face="Courier New">RecognitionException</font> objects.
920	</p>
921	<p>
922	<b>Optimization: Non-Recursive lexical rules.</b> Rules that do not directly or indirectly call themselves can be inlined into the lexer entry method: <tt>nextToken</tt>. For example, the common identifier rule would be placed directly into the <tt>nextToken</tt> method. That is, rule: <tt>
923	</p>
924	<pre>
925	ID : ( 'a'..'z' \| 'A'..'Z' )+
926	;
927	</tt> </pre>
928	<p>
929	would not result in a method in your lexer class. This rule would become part of the resulting lexer as it would be probably inlined by ANTLR: <tt>
930	</p>
931	<pre>
932	public Token nextToken() {
933	switch ( c ) {
934	<i>cases for operators and such here</i>
935	case '0': // chars that predict ID token
936	case '1':
937	case '2':
938	case '3':
939	case '4':
940	case '5':
941	case '6':
942	case '7':
943	case '8':
944	case '9':
945	while ( c>='0' && c<='9' ) {
946	matchRange('0','9');
947	}
948	return makeToken(ID);
949	default :
950	<i>check harder stuff here like rules
951	beginning with a..z</i>
952	}
953	</tt> </pre>
954	<p>
955	If not inlined, the method for scanning identifiers would look like: <tt>
956	</p>
957	<pre>
958	public int mID() {
959	while ( c>='0' && c<='9' ) {
960	matchRange('0','9');
961	}
962	return ID;
963	}
964	</tt> </pre>
965	<p>
966	where token names are converted to method names by prefixing them with the letter <tt>m</tt>. The <tt>nextToken</tt> method would become: <tt>
967	</p>
968	<pre>
969	public Token nextToken() {
970	switch ( c ) {
971	<i>cases for operators and such here</i>
972	case '0': // chars that predict ID token
973	case '1':
974	case '2':
975	case '3':
976	case '4':
977	case '5':
978	case '6':
979	case '7':
980	case '8':
981	case '9':
982	return makeToken(mID());
983	default :
984	<i>check harder stuff here like rules
985	beginning with a..z</i>
986	}
987	</tt> </pre>
988	<p>
989	Note that this type of range loop is so common that it should probably be optimized to: <tt>
990	</p>
991	<pre>
992	while ( c>='0' && c<='9' ) {
993	consume();
994	}
995	</tt> </pre>
996	<p>
997	<b>Optimization: Recursive lexical rules.</b> Lexical rules that are directly or indirectly recursive are not inlined. For example, consider the following rule that matches nested actions: <tt>
998	</p>
999	<pre>
1000	ACTION
1001	: '{' ( ACTION \| ~'}' )* '}'
1002	;
1003	</tt> </pre>
1004	<p>
1005	<tt>ACTION</tt> would be result in (assuming a character vocabulary of 'a'..'z', '{', '}'): <tt>
1006	</p>
1007	<pre>
1008	public int mACTION()
1009	throws RecognitionException,
1010	CharStreamException,
1011	TokenStreamException {
1012
1013	int _ttype = ACTION;
1014	match('{');
1015	{
1016	_loop:
1017	do {
1018	switch ( _c) {
1019	case '{':
1020	mACTION();
1021	break;
1022	case 'a': case 'b': case 'c': case 'd':
1023	case 'e': case 'f': case 'g': case 'h':
1024	case 'i': case 'j': case 'k': case 'l':
1025	case 'm': case 'n': case 'o': case 'p':
1026	case 'q': case 'r': case 's': case 't':
1027	case 'u': case 'v': case 'w': case 'x':
1028	case 'y': case 'z':
1029	matchNot('}');
1030	break;
1031	default :
1032	break _loop;
1033	}
1034	} while (true);
1035	}
1036	match('}');
1037	return _ttype;
1038	}
1039	</tt> </pre> <h2><a name="_bb15">Token Objects</a></h2>
1040	<p>
1041	The basic token knows only about a token type:
1042	</p>
1043	<pre><tt>public class Token {
1044	// constants
1045	public static final int MIN_USER_TYPE = 3;
1046	public static final int INVALID_TYPE = 0;
1047	public static final int EOF_TYPE = 1;
1048	public static final int SKIP = -1;
1049
1050	// each Token has at least a token type
1051	int type=INVALID_TYPE;
1052
1053	// the illegal token object
1054	public static Token badToken =
1055	new Token(INVALID_TYPE, "<no text>");
1056
1057	public Token() {;}
1058	public Token(int t) { type = t; }
1059	public Token(int t, String txt) {
1060	type = t; setText(txt);
1061	}
1062
1063	public void setType(int t) { type = t; }
1064	public void setLine(int l) {;}
1065	public void setColumn(int c) {;}
1066	public void setText(String t) {;}
1067
1068	public int getType() { return type; }
1069	public int getLine() { return 0; }
1070	public int getColumn() { return 0; }
1071	public String getText() {...}
1072	}
1073	</tt></pre>
1074	<p>
1075	The raw <tt>Token</tt> class is not very useful.  ANTLR supplies a "common" token class that it uses by default, which contains the line number and text associated with the token:<tt>
1076	</p>
1077	</tt>
1078	<p>
1079	<tt>public class CommonToken extends Token {
1080	<br>
1081	// most tokens will want line, text information
1082	<br>
1083	int line;
1084	<br>
1085	String text = null;
1086	<br>
1087
1088	<br>
1089	public CommonToken() {}
1090	<br>
1091	public CommonToken(String s)  { text = s; }
1092	<br>
1093	public CommonToken(int t, String txt) {
1094	<br>
1095	type = t;
1096	<br>
1097	setText(txt);
1098	<br>
1099	}
1100	<br>
1101	<br>
1102	public void setLine(int l)    { line = l; }
1103	<br>
1104	public int  getLine()         { return line; }
1105	<br>
1106	public void setText(String s) { text = s; }
1107	<br>
1108	public String getText()       { return text; }
1109	<br>
1110	}</tt>
1111	</p>
1112	<p>
1113	ANTLR will generate an interface that defines the types of tokens in a token vocabulary. Parser and lexers that share this token vocabulary are generated such that they implement the resulting token types interface: <tt>
1114	</p>
1115	<pre>public interface MyLexerTokenTypes {
1116	public static final int ID = 2;
1117	public static final int BEGIN = 3;
1118	...
1119	}</tt></pre>
1120	<p>
1121	ANTLR defines a token object for use with the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object called <small><font face="Courier New">CommonHiddenStreamToken</font></small>:
1122	</p>
1123	<pre>public class CommonHiddenStreamToken
1124	extends CommonToken {
1125	protected CommonHiddenStreamToken hiddenBefore;
1126	protected CommonHiddenStreamToken hiddenAfter;
1127
1128	public CommonHiddenStreamToken
1129	<strong>getHiddenAfter</strong>() {...}
1130	public CommonHiddenStreamToken
1131	<strong>getHiddenBefore</strong>() {...}
1132	}</pre>
1133	<p>
1134	Hidden tokens are weaved amongst the normal tokens.  Note that, for garbage collection reasons, hidden tokens never point back to normal tokens (preventing a linked list of the entire token stream).
1135	</p>
1136	<h2><a name="_bb16">Token Lookahead Buffer</a></h2>
1137	<p>
1138	The parser must always have fast access to k symbols of lookahead. In a world without syntactic predicates, a simple buffer of k <tt>Token</tt> references would suffice. However, given that even LL(1) ANTLR parsers must be able to backtrack, an arbitrarily-large buffer of <tt>Token</tt> references must be maintained. <tt>LT(i)</tt> looks into the token buffer.
1139	</p>
1140	<p>
1141	Fortunately, the parser itself does not implement the token-buffering and lookahead algorithm. That is handled by the <tt>TokenBuffer</tt> object. We begin the discussion of lookahead by providing an LL(k) parser framework: <tt>
1142	</p>
1143	<pre>
1144	public class LLkParser extends Parser {
1145	TokenBuffer input;
1146	public int LA(int i) {
1147	return input.LA(i);
1148	}
1149	public Token LT(int i) {
1150	return input.LT(i);
1151	}
1152	public void consume() {
1153	input.consume();
1154	}
1155	}
1156	</tt> </pre>
1157	<p>
1158	All lookahead-related calls are simply forwarded to the <tt>TokenBuffer</tt> object. In the future, some simple caching may be performed in the parser itself to avoid the extra indirection, or ANTLR may generate the call to input.LT(i) directly.
1159	</p>
1160	<p>
1161	The <tt>TokenBuffer</tt> object caches the token stream emitted by the scanner. It supplies <tt>LT()</tt> and <tt>LA()</tt> methods for accessing the k<sup>th</sup> lookahead token or token type, as well as methods for consuming tokens, guessing, and backtracking. <tt>
1162	</p>
1163	<pre>
1164	public class TokenBuffer {
1165	...
1166	/** Mark another token for
1167	* deferred consumption */
1168	public final void consume() {...}
1169
1170	/** Get a lookahead token */
1171	public final Token LT(int i) { ... }
1172
1173	/** Get a lookahead token value */
1174	public final int LA(int i) { ... }
1175
1176	/**Return an integer marker that can be used to
1177	* rewind the buffer to its current state. */
1178	public final int mark() { ... }
1179
1180	/*Rewind the token buffer to a marker./
1181	public final void rewind(int mark) { ... }
1182	}
1183	</pre> </tt>
1184	<p>
1185	To begin backtracking, a <tt>mark</tt> is issued, which makes the <tt>TokenBuffer</tt> record the current position so that it can rewind the token stream. A subsequent <tt>rewind</tt> directive will reset the internal state to the point before the last <tt>mark</tt>.
1186	</p>
1187	<p>
1188	Consider the following rule that employs backtracking:
1189	</p>
1190	<tt><pre>
1191	stat: (list EQUAL) => list EQUAL list
1192	\| list
1193	;
1194	list: LPAREN (ID)* RPAREN
1195	;
1196	</tt> </pre>
1197	<p>
1198	Something like the following code would be generated: <tt>
1199	</p>
1200	<pre>
1201	public void stat()
1202	throws RecognitionException,
1203	TokenStreamException
1204	{
1205	boolean synPredFailed;
1206	if ( LA(1)==LPAREN ) { // check lookahead
1207	int marker = tokenBuffer.mark();
1208	try {
1209	list();
1210	match(EQUAL);
1211	synPredFailed = false;
1212	}
1213	catch (RecognitionException e) {
1214	tokenBuffer.rewind(marker);
1215	synPredFailed = true;
1216	}
1217	}
1218	if ( LA(1)==LPAREN && !synPredFailed ) {
1219	// test prediction of alt 1
1220	list();
1221	match(EQUAL);
1222	list();
1223	}
1224	else if ( LA(1)==LPAREN ) {
1225	list();
1226	}
1227	}
1228	</tt> </pre>
1229	<p>
1230	The token lookahead buffer uses a circular token buffer to perform quick indexed access to the lookahead tokens. The circular buffer is expanded as necessary to calculate LT(i) for arbitrary i. <tt>TokenBuffer.consume()</tt> does not actually read more tokens. Instead, it defers the read by counting how many tokens have been consumed, and then adjusts the token buffer and/or reads new tokens when LA() or LT() is called.
1231	</p>
1232	<p>
1233	<font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/runtime.html#2 $</font>
1234	</body>
1235	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/yao/share/antlr-2.7.7/doc/runtime.html @ 1

Download in other formats: