source: trunk/yao/share/antlr-2.7.7/doc/runtime.html @ 1

Last change on this file since 1 was 1, checked in by lnalod, 15 years ago

Initial import of YAO sources

File size: 41.2 KB
Line 
1<html>
2<head>
3        <title>ANTLR Specification: Run-time</title> 
4</head>
5<body bgcolor="#FFFFFF" text="#000000">
6<h1><a name="_bb1">Java Runtime Model</a></h1> 
7<hr>
8<h2><a name="_bb2">Programmer's Interface</a></h2> 
9<p>
10        In this section, we describe what ANTLR generates after reading your grammar file and how to use that output to parse input. The classes from which your lexer, token, and parser classes are derived are provided as well.
11</p>
12<h3><a name="_bb3">What ANTLR generates</a></h3> 
13<p>
14        ANTLR generates the following types of files, where <i>MyParser</i>, <i>MyLexer</i>, and <i>MyTreeParser</i> are names of grammar classes specified in the grammar file. You may have an arbitrary number of parsers, lexers, and tree-parsers per grammar file; a separate class file will be generated for each. In addition, token type files will be generated containing the token vocabularies used in the parsers and lexers. One or more token vocabularies may be defined in a grammar file, and shared between different grammars. For example, given the grammar file: <tt>
15</p>
16<pre>class MyParser extends Parser;
17options {
18  exportVocab=My;
19}
20... rules ...
21
22class MyLexer extends Lexer;
23options {
24  exportVocab=My;
25}
26... rules ...
27
28class MyTreeParser extends TreeParser;
29options {
30  exportVocab=My;
31}
32... rules ...</tt></pre> 
33<p>
34        The following files will be generated:
35<ul>
36        <li>
37                <tt><i>MyParser</i>.java</tt>. The parser with member methods for the parser rules.
38        </li>
39        <li>
40                <tt><i>MyLexer</i>.java</tt>. The lexer with the member methods for the lexical rules.
41        </li>
42        <li>
43                <tt><i>MyTreeParser</i>.java</tt>. The tree-parser with the member methods for the tree-parser rules.
44        </li>
45        <li>
46                <tt><i>My</i>TokenTypes.java</tt>. An interface containing all of the token types defined by your parsers and lexers using the exported vocabulary named <tt>My</tt>.
47        </li>
48        <li>
49                <tt><i>My</i>TokenTypes.txt</tt>. A text file containing all of the token types, literals, and paraphrases defined by parsers and lexers contributing vocabulary <tt>My</tt>.
50        </li>
51</ul>
52<p>
53        The programmer uses the classes by referring to them:
54<ol>
55        <li>
56                Create a lexical analyzer. The constructor with no arguments implies that you want to read from standard input.
57        </li>
58        <li>
59                Create a parser and attach it to the lexer (or other TokenStream).
60        </li>
61        <li>
62                Call one of the methods in the parser to begin parsing.
63        </li>
64</ol>
65<p>
66        If your parser generates an AST, then get the AST value, create a tree-parser, and invoke one of the tree-parser rules using the AST.
67</p>
68<tt><pre>MyLexer lex = new MyLexer();
69MyParser p =
70  new MyParser(lex,<i>user-defined-args-if-any</i>);
71p.<i>start-rule</i>();
72// and, if you are tree parsing the result...
73MyTreeParser tp = new MyTreeParser();
74tp.<i>start-rule</i>(p.getAST());</tt></pre> 
75<p>
76        You can also specify the name of the token and/or AST objects that you want the lexer/parser to create. Java's support of dynamic programming makes this quite painless:
77</p>
78<pre><tt>MyLexer lex = new MyLexer();
79lex.setTokenObjectClass(&quot;mypackage.MyToken&quot;);
80  // defaults to &quot;antlr.CommonToken&quot;
81...
82parser.setASTNodeClass(&quot;mypackage.MyASTNode&quot;);
83  // defaults to &quot;antlr.CommonAST&quot;</tt></pre> 
84<p>
85Make sure you give a fully-qualified class name.
86<p>
87The lexer and parser can cause IOExceptions as well as RecognitionExceptions, which you must catch:
88</p>
89<pre>  CalcLexer lexer =
90    new CalcLexer(new DataInputStream(System.in));
91  CalcParser parser = new CalcParser(lexer);
92  // Parse the input expression
93  try {
94    parser.expr();
95  }
96  catch (IOException io) {
97    System.err.println(&quot;IOException&quot;);
98  }
99  catch(RecognitionException e) {
100    System.err.println(&quot;exception: &quot;+e);
101  }</pre> <h3><a name="sharingstate">Multiple Lexers/Parsers With Shared Input State</a></h3> 
102<p>
103        Occasionally, you will want two parsers or two lexers to share input state; that is, you will want them to pull input from the same source token stream or character stream. &nbsp; The section on <a href="streams.html#lexerstates">multiple lexer &quot;states&quot;</a> describes such a situation.
104</p>
105<p>
106        ANTLR factors the input variables such as line number, guessing state, input stream, etc... into a separate object so that another lexer or parser could same that state.&nbsp; The <small><font face="Courier New">LexerSharedInputState</font></small> and <small><font face="Courier New">ParserSharedInputState</font></small> embody this factoring. &nbsp; Method <small><font face="Courier New">getInputState()</font></small> can be used on either <small><font face="Courier New">CharScanner</font></small> or <small><font face="Courier New">Parser</font></small> objects.&nbsp; Here is how to construct two lexers sharing the same input stream:
107</p>
108<pre><small>// create Java lexer</small>
109<small>JavaLexer mainLexer = new JavaLexer(input);
110// create javadoc lexer; attach to shared</small>
111<small>// input state of java lexer
112JavaDocLexer doclexer =</small>
113<small>  new JavaDocLexer(mainLexer.getInputState());</small></pre> 
114<p>
115        Parsers with shared input state can be created similarly:
116</p>
117<pre><small>JavaDocParser jdocparser =</small>
118<small>  new JavaDocParser(getInputState());
119jdocparser.content(); // go parse the comment</small></pre> 
120<p>
121        Sharing state is easy, but what happens upon exception during the execution of the &quot;subparser&quot;?&nbsp; What about syntactic predicate execution?&nbsp; It turns out that invoking a subparser with the same input state is exactly the same as calling another rule in the same parser as far as error handling and syntactic predicate guessing are concerned.&nbsp; If the parser is guessing before the call to the subparser, the subparser must continue guessing, right?&nbsp; Exceptions thrown inside the subparser must exit the subparser and return to enclosing erro handler or syntactic predicate handler.
122</p>
123<h2><a name="_bb4">Parser Implementation</a></h2> <h3><a name="_bb5">Parser Class</a></h3> 
124<p>
125        ANTLR generates a parser class (an extension of <tt>LLkParser</tt>) that contains a method for every rule in your grammar. The general format looks like: <tt>
126</p>
127<pre>
128public class MyParser extends LLkParser
129    implements MyLexerTokenTypes
130{
131  protected P(TokenBuffer tokenBuf, int k) {
132    super(tokenBuf,k);
133    tokenNames = _tokenNames;
134  }
135  public P(TokenBuffer tokenBuf) { 
136    this(tokenBuf,1);
137  }
138  protected P(TokenStream lexer, int k) {
139    super(lexer,k);
140    tokenNames = _tokenNames;       
141  }
142  public P(TokenStream lexer) { 
143    this(lexer,1);
144  }
145  public P(ParserSharedInputState state) {
146    super(state,1);
147    tokenNames = _tokenNames;
148  }
149  ...
150  // add your own constructors here...
151  <i>rule-definitions</i>
152}
153</tt>  </pre> <h3><a name="_bb6">Parser Methods</a></h3> 
154<p>
155        ANTLR generates recursive-descent parsers, therefore, every rule in the grammar will result in a method that applies the specified grammatical structure to the input token stream. The general form of a parser method looks like: <tt>
156</p>
157<pre>
158public void rule()
159  throws RecognitionException,
160         TokenStreamException
161{
162  <i>init-action-if-present</i>
163  if ( <i>lookahead-predicts-production-1</i> ) {
164     <i>code-to-match-production-1</i>
165  }
166  else if ( <i>lookahead-predicts-production-2</i> ) {
167     <i>code-to-match-production-2</i>
168  }
169  ...
170  else if ( <i>lookahead-predicts-production-n</i> ) {
171     <i>code-to-match-production-n</i>
172  }
173  else {
174    // syntax error
175    throw new NoViableAltException(LT(1));
176  }
177}
178</tt>  This code results from a rule of the form:  <tt></pre> <pre>
179rule:   <i>production-1</i>
180    |   <i>production-2</i>
181   ...
182    |   <i>production-n</i>
183    ;
184</tt>  </pre> 
185<p>
186        If you have specified arguments and a return type for the rule, the method header changes to: <tt>
187</p>
188<pre>
189/* generated from:
190 *    rule(<i>user-defined-args</i>)
191 *      returns <i>return-type</i> : ... ;
192 */
193public <i>return-type</i> rule(<i>user-defined-args</i>)
194  throws RecognitionException,
195         TokenStreamException
196{
197  ...
198}
199</tt>  </pre> 
200<p>
201        Token types are integers and we make heavy use of bit sets and range comparisons to avoid excessively-long test expressions.
202</p>
203<h3><a name="_bb7">EBNF Subrules</a></h3> 
204<p>
205        Subrules are like unlabeled rules, consequently, the code generated for an EBNF subrule mirrors that generated for a rule. The only difference is induced by the EBNF subrule operators that imply optionality or looping.
206</p>
207<p>
208        <b><tt>(...)?</tt> optional subrule</b>. The only difference between the code generated for an optional subrule and a rule is that there is no default <tt>else</tt>-clause to throw an exception--the recognition continues on having ignored the optional subrule. <tt>
209</p>
210<pre>
211{
212  <i>init-action-if-present</i>
213  if ( <i>lookahead-predicts-production-1</i> ) {
214     <i>code-to-match-production-1</i>
215  }
216  else if ( <i>lookahead-predicts-production-2</i> ) {
217     <i>code-to-match-production-2</i>
218  }
219  ...
220  else if ( <i>lookahead-predicts-production-n</i> ) {
221     <i>code-to-match-production-n</i>
222  }
223}
224</tt>  </pre> 
225<p>
226        Not testing the optional paths of optional blocks has the potential to delay the detection of syntax errors.
227</p>
228<p>
229        <b><tt>(...)*</tt> closure subrule</b>. A closure subrule is like an optional looping subrule, therefore, we wrap the code for a simple subrule in a &quot;forever&quot; loop that exits whenever the lookahead is not consistent with any of the alternative productions. <tt>
230</p>
231<pre>
232{
233  <i>init-action-if-present</i>
234loop:
235  do {
236    if ( <i>lookahead-predicts-production-1</i> ) {
237       <i>code-to-match-production-1</i>
238    }
239    else if ( <i>lookahead-predicts-production-2</i> ) {
240       <i>code-to-match-production-2</i>
241    }
242    ...
243    else if ( <i>lookahead-predicts-production-n</i> ) {
244       <i>code-to-match-production-n</i>
245    }
246    else {
247      break loop;
248    }
249  }
250  while (true);
251}
252</tt>  </pre> 
253<p>
254        While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token <tt>A</tt>. <tt>
255</p>
256<pre>
257( A | B )* A
258</tt>  </pre> 
259<p>
260        Upon <tt>A</tt>, should the loop continue or exit? One must also ask if the loop should even begin. Because you cannot answer these questions with only one symbol of lookahead, the decision is non-LL(1).
261</p>
262<p>
263        Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.
264</p>
265<p>
266        As a special case, a closure subrule with one alternative production results in: <tt>
267</p>
268<pre>
269{
270  <i>init-action-if-present</i>
271loop:
272  while ( <i>lookahead-predicts-production-1</i> ) {
273       <i>code-to-match-production-1</i>
274  }
275}
276 </tt>  </pre> 
277<p>
278        This special case results in smaller, faster, and more readable code.
279</p>
280<p>
281        <b><tt>(...)+</tt> positive closure subrule</b>. A positive closure subrule is a loop around a series of production prediction tests like a closure subrule. However, we must guarantee that at least one iteration of the loop is done before proceeding to the construct beyond the subrule.
282</p>
283<tt><pre>
284{
285  int _cnt = 0;
286  <i>init-action-if-present</i>
287loop:
288  do {
289    if ( <i>lookahead-predicts-production-1</i> ) {
290       <i>code-to-match-production-1</i>
291    }
292    else if ( <i>lookahead-predicts-production-2</i> ) {
293       <i>code-to-match-production-2</i>
294    }
295    ...
296    else if ( <i>lookahead-predicts-production-n</i> ) {
297       <i>code-to-match-production-n</i>
298    }
299    else if ( _cnt&gt;1 ) {
300      // lookahead predicted nothing and we've
301      // done an iteration
302      break loop;
303    }
304    else {
305      throw new NoViableAltException(LT(1));
306    }
307    _cnt++;  // track times through the loop
308  }
309  while (true);
310}
311</tt>  </pre> 
312<p>
313        While there is no need to explicity test the lookahead for consistency with the exit path, the grammar analysis phase computes the lookahead of what follows the block. The lookahead of what follows much be disjoint from the lookahead of each alternative otherwise the loop will not know when to terminate. For example, consider the following subrule that is nondeterministic upon token <tt>A</tt>. <tt>
314</p>
315<pre>
316( A | B )+ A
317</tt>  </pre> 
318<p>
319        Upon <tt>A</tt>, should the loop continue or exit? Because you cannot answer this with only one symbol of lookahead, the decision is non-LL(1).
320</p>
321<p>
322        Not testing the exit paths of closure loops has the potential to delay the detection of syntax errors.
323</p>
324<p>
325        You might ask why we do not have a <tt>while</tt> loop that tests to see if the lookahead is consistent with any of the alternatives (rather than having series of tests inside the loop with a <tt>break</tt>). It turns out that we can generate smaller code for a series of tests than one big one. Moreover, the individual tests must be done anyway to distinguish between alternatives so a <tt>while</tt> condition would be redundant.
326</p>
327<p>
328        As a special case, if there is only one alternative, the following is generated: <tt>
329</p>
330<pre>
331{
332  <i>init-action-if-present</i>
333  do {
334    <i>code-to-match-production-1</i>
335  }
336  while ( <i>lookahead-predicts-production-1</i> );
337}
338</tt>  </pre> 
339<p>
340        <b>Optimization.</b> When there are a large (where large is user-definable) number of strictly LL(1) prediction alternatives, then a <tt>switch</tt>-statement can be used rather than a sequence of <tt>if</tt>-statements. The non-LL(1) cases are handled by generating the usual <tt>if</tt>-statements in the <tt>default</tt> case. For example: <tt>
341</p>
342<pre>
343switch ( LA(1) ) {
344  case KEY_WHILE :
345  case KEY_IF :
346  case KEY_DO :
347    statement();
348    break;
349  case KEY_INT :
350  case KEY_FLOAT :
351    declaration();
352    break;
353  default :
354    // do whatever else-clause is appropriate
355}
356</tt>  </pre> 
357<p>
358        This optimization relies on the compiler building a more direct jump (via jump table or hash table) to the ith production matching code. This is also more readable and faster than a series of bit set membership tests.
359</p>
360<h3><a name="_bb8">Production Prediction</a> </h3> 
361<p>
362        <b>LL(1) prediction.</b> Any LL(1) prediction test is a simple set membership test. If the set is a singleton set (a set with only one element), then an integer token type <tt>==</tt> comparison is done. If the set degree is greater than one, a bit set is created and the single input token type is tested for membership against that set. For example, consider the following rule: <tt>
363</p>
364<pre>
365a : A | b ;
366b : B | C | D | E | F;
367</tt>  </pre> 
368<p>
369        The lookahead that predicts production one is {<tt>A</tt>} and the lookahead that predicts production two is {<tt>B,C,D,E,F</tt>}. The following code would be generated by ANTLR for rule <tt>a</tt> (slightly cleaned up for clarity):
370</p>
371<tt><pre>
372public void a() {
373  if ( LA(1)==A ) {
374    match(A);
375  }
376  else if (token_set1.member(LA(1))) {
377    b();
378  }
379}
380</tt>  </pre> 
381<p>
382        The prediction for the first production can be done with a simple integer comparison, but the second alternative uses a bit set membership test for speed, which you probably didn't recognize as testing <tt>LA(1) member {B,C,D,E,F}</tt>. The complexity threshold above which bitset-tests are generated is user-definable.
383</p>
384<p>
385        We use arrays of <tt>long int</tt>s (64 bits) to hold bit sets. The ith element of a bitset is stored in the word number <tt>i/64</tt> and the bit position within that word is <tt>i % 64</tt>. The divide and modulo operations are extremely expensive and, but fortunately, a strength reduction can be done. Dividing by a power of two is the same as shifting right and modulo a power of two is the same as masking with that power minus one. All of these details are hidden inside the implementation of the <tt>BitSet</tt> class in the package <tt>antlr.collections.impl</tt>.
386</p>
387<p>
388        The various bit sets needed by ANTLR are created and initialized in the generated parser (or lexer) class.
389</p>
390<p>
391        <b>Approximate LL(k) prediction.</b> An extension of LL(1)...basically we do a series of up to k bit set tests rather than a single as we do in LL(1) prediction. Each decision will use a different amount of lookahead, with LL(1) being the dominant decision type.
392</p>
393<h3><a name="_bb9"></a><a name="Production Element Recognition">Production Element Recognition</a> </h3> 
394<p>
395        <b>Token references.</b> Token references are translated to: <tt>
396</p>
397<pre>
398match(<i>token-type</i>);
399</tt>  </pre> 
400<p>
401        For example, a reference to token <tt>KEY_BEGIN</tt> results in: <tt>
402</p>
403<pre>
404match(KEY_BEGIN);
405</tt>  </pre> 
406<p>
407        where <tt>KEY_BEGIN</tt> will be an integer constant defined in the <tt><i>MyParser</i>TokenType</tt> interface generated by ANTLR.
408</p>
409<p>
410        <b>String literal references.</b> String literal references are references to automatically generated tokens to which ANTLR automatically assigns a token type (one for each unique string). String references are translated to:
411</p>
412<tt><pre>
413match(<i>T</i>);
414</tt>  </pre> 
415<p>
416        where <tt><i>T</i></tt> is the token type assigned by ANTLR to that token.
417</p>
418<p>
419        <b>Character literal references.</b> Referencing a character literal implies that the current rule is a lexical rule. Single characters, '<i>t</i>', are translated to:
420</p>
421<tt><pre>
422match('<i>t</i>');
423</tt>  </pre> 
424<p>
425        which can be manually inlined with: <tt>
426</p>
427<pre>
428if ( c=='<i>t</i>' ) consume();
429else throw new MismatchedCharException(
430               &quot;mismatched char: '&quot;+(char)c+&quot;'&quot;);
431 </tt>  </pre> 
432<p>
433        if the method call proves slow (at the cost of space).
434</p>
435<p>
436        <b>Wildcard references.</b> In lexical rules, the wildcard is translated to: <tt>
437</p>
438<pre>
439consume();
440</tt>  </pre> 
441<p>
442        which simply gets the next character of input without doing a test.
443</p>
444<p>
445        References to the wildcard in a parser rule results in the same thing except that the <tt>consume</tt> call will be with respect to the parser.
446</p>
447<p>
448        <b>Not operator.</b> When operating on a token, <tt>~<i>T</i></tt> is translated to:
449</p>
450<tt><pre>
451matchNot(<i>T</i>);
452</tt> </pre> 
453<p>
454        When operating on a character literal, <tt>'<i>t</i>'</tt> is translated to: <tt>
455</p>
456<pre>
457matchNot('<i>t</i>');
458</tt>  </pre> 
459<p>
460        <b>Range operator.</b> In parser rules, the range operator (<tt><i>T1</i>..<i>T2</i></tt>) is translated to: <tt>
461</p>
462<pre>
463matchRange(<i>T1</i>,<i>T2</i>);
464</tt>   </pre> 
465<p>
466        In a lexical rule, the range operator for characters <tt><i>c1..c2</i></tt> is translated to: <tt>
467</p>
468<pre>
469matchRange(<i>c1</i>,<i>c2</i>);
470</tt>  </pre> 
471<p>
472        <b>Labels.</b> Element labels on atom references become <tt>Token</tt> references in parser rules and <tt>int</tt>s in lexical rules. For example, the parser rule: <tt>
473</p>
474<pre>
475a : id:ID {System.out.println(&quot;id is &quot;+id);} ;
476</tt>  would be translated to:  <tt></pre> <pre>
477public void a() {
478  Token id = null;
479  id = LT(1);
480  match(ID);
481  System.out.println(&quot;id is &quot;+id);
482}
483</tt>  For lexical rules such as:  <tt></pre> <pre>
484ID : w:. {System.out.println(&quot;w is &quot;+(char)w);};
485</tt>  the following code would result:  <tt></pre> <pre>
486public void ID() {
487  int w = 0;
488  w = c;
489  consume(); // match wildcard (anything)
490  System.out.println(&quot;w is &quot;+(char)w);
491}
492</tt>  </pre> 
493<p>
494        Labels on rule references result in <tt>AST</tt> references, when generating trees, of the form <tt><i>label</i>_ast</tt>.
495</p>
496<p>
497        <b>Rule references.</b> Rule references become method calls. Arguments to rules become arguments to the invoked methods. Return values are assigned like Java assignments. Consider rule reference <tt>i=list[1]</tt> to rule: <tt>
498</p>
499<pre>
500list[int scope] returns int
501    :   { return scope+3; }
502    ;
503</tt>  The rule reference would be translated to:  <tt></pre> <pre>
504i = list(1);
505</tt>  </pre> 
506<p>
507        <b>Semantic actions.</b> Actions are translated verbatim to the output parser or lexer except for the <a href="trees.html#Action Translation">translations required for AST generation</a> and the following:
508
509<ul>
510<li><tt>$FOLLOW(r)</tt>: FOLLOW set name for rule r
511<li><tt>$FIRST(r)</tt>: FIRST set name for rule r
512</ul>
513
514<p>
515Omitting the rule argument implies you mean the current rule.  The result type is a BitSet, which you can test via $FIRST(a).member(LBRACK) etc...
516
517<p>
518Here is a sample rule:
519
520<pre>
521a : A {System.out.println($FIRST(a));} B
522  exception
523    catch [RecognitionException e] {   
524        if ( $FOLLOW.member(SEMICOLON) ) {
525        consumeUntil(SEMICOLON);
526    }
527    else {
528        consume();
529    }
530    }
531  ;
532</pre>
533
534Results in
535
536<pre>
537public final void a() throws RecognitionException, TokenStreamException { 
538    try {
539        match(A);
540        System.out.println(_tokenSet_0);
541        match(B);
542    }
543    catch (RecognitionException e) {
544        if ( _tokenSet_1.member(SEMICOLON) ) {
545            consumeUntil(SEMICOLON);
546        }
547        else {
548            consume();
549        }
550    }
551}
552</pre>
553
554<p>
555        To add members to a lexer or parser class definition, add the class member definitions enclosed in {} immediately following the class specification, for example: <tt>
556</p>
557<pre>
558class MyParser;
559{
560   protected int i;
561   public MyParser(TokenStream lexer,
562        int aUsefulArgument) {
563      i = aUsefulArgument;
564   }
565}
566... rules ...
567</tt></pre> 
568<p>
569        ANTLR collects everything inside the {...} and inserts it in the class definition before the rule-method definitions. When generating C++, this may have to be extended to allow actions after the rules due to the wacky ordering restrictions of C++.
570</p>
571
572<h3><a name="_bb10">Standard Classes</a></h3> 
573<p>
574        ANTLR constructs parser classes that are subclasses of the <tt>antlr.LLkParser</tt> class, which is a subclass of the <tt>antlr.Parser</tt> class. We summarize the more important members of these classes here. See Parser.java and LLkParser.java for details of the implementation. <tt>
575</p>
576<pre>
577public abstract class Parser {
578   protected ParserSharedInputState inputState;
579   protected ASTFactory ASTFactory;
580   public abstract int LA(int i);
581   public abstract Token LT(int i);
582   public abstract void consume();
583   public void consumeUntil(BitSet set) { ... }
584   public void consumeUntil(int tokenType) { ... }
585   public void match(int t)
586      throws MismatchedTokenException { ... }
587   public void matchNot(int t)
588      throws MismatchedTokenException { ... }
589   ...
590}
591
592public class LLkParser extends Parser {
593   public LLkParser(TokenBuffer tokenBuf, int k_)
594     { ... }
595   public LLkParser(TokenStream lexer, int k_)
596     { ... }
597   public int LA(int i) { return input.LA(i); }
598   public Token LT(int i) { return input.LT(i); }
599   public void consume() { input.consume(); }
600   ...
601}
602</pre> </tt><h2><a name="_bb11">Lexer Implementation</a></h2> <h3><a name="_bb12">Lexer Form</a></h3> 
603<p>
604        The lexers produced by ANTLR are a lot like the parsers produced by ANTLR. They only major differences are that (a) scanners use characters instead of tokens, and (b) ANTLR generates a special <tt>nextToken</tt> rule for each scanner which is a production containing each public lexer rule as an alternate. The name of the lexical grammar class provided by the programmer results in a subclass of <tt>CharScanner</tt>, for example <tt>
605</p>
606<pre>
607public class MyLexer extends antlr.CharScanner
608  implements LTokenTypes, TokenStream
609{
610  public L(InputStream in) {
611          this(new ByteBuffer(in));
612  }
613  public L(Reader in) {
614          this(new CharBuffer(in));
615  }
616  public L(InputBuffer ib) {
617          this(new LexerSharedInputState(ib));
618  }
619  public L(LexerSharedInputState state) {
620          super(state);
621          caseSensitiveLiterals = true;
622          setCaseSensitive(true);
623          literals = new Hashtable();
624  }
625
626  public Token nextToken() throws TokenStreamException {
627     <i>scanning logic</i>
628    ...
629  }
630  <i>recursive and other non-inlined lexical methods</i>
631  ...
632}
633</tt>  </pre> 
634<p>
635        When an ANTLR-generated parser needs another token from its lexer, it calls a method called <tt>nextToken</tt>. The general form of the <tt>nextToken</tt> method is: <tt>
636</p>
637<pre>
638public Token nextToken()
639  throws TokenStreamException {
640  int tt;
641  for (;;) {
642     try {
643        resetText();
644        switch ( c ) {
645        <i>case for each char predicting lexical rule</i>
646           <i>call lexical rule gets token type -&gt;</i> tt
647        default :
648           throw new NoViableAltForCharException(
649               &quot;bad char: '&quot;+(char)c+&quot;'&quot;);
650        }
651        if ( tt!=Token.SKIP ) {
652           return makeToken(tt);
653        }
654     }
655     catch (RecognitionException ex) {
656        reportError(ex.toString());
657     }
658  }
659}
660</tt>  </pre> 
661<p>
662        For example, the lexical rules: <tt>
663</p>
664<pre>
665lexclass Lex;
666
667WS   : ('\t' | '\r' | ' ') {_ttype=Token.SKIP;} ;
668PLUS : '+';
669MINUS: '-';
670INT  : ( '0'..'9' )+ ;
671ID   : ( 'a'..'z' )+ ;
672UID  : ( 'A'..'Z' )+ ;
673</tt>  would result in something like:  <tt></pre> <pre>
674public class Lex extends CharScanner
675  implements TTokenTypes {
676...
677public Token nextToken()
678    throws TokenStreamException {
679    int _tt = Token.EOF_TYPE;
680    for (;;) {
681    try {
682       resetText();
683       switch ( _c ) {
684       case '\t': case '\r': case ' ':
685           _tt=mWS();
686           break;
687       case '+':
688           _tt=mPLUS();
689           break;
690       case '-':
691           _tt=mMINUS();
692           break;
693       case '0': case '1': case '2': case '3':
694       case '4': case '5': case '6': case '7':
695       case '8': case '9':
696           _tt=mINT();
697           break;
698       case 'a': case 'b': case 'c': case 'd':
699       case 'e': case 'f': case 'g': case 'h':
700       case 'i': case 'j': case 'k': case 'l':
701       case 'm': case 'n': case 'o': case 'p':
702       case 'q': case 'r': case 's': case 't':
703       case 'u': case 'v': case 'w': case 'x':
704       case 'y': case 'z':
705           _tt=mID();
706           break;
707       case 'A': case 'B': case 'C': case 'D':
708       case 'E': case 'F': case 'G': case 'H':
709       case 'I': case 'J': case 'K': case 'L':
710       case 'M': case 'N': case 'O': case 'P':
711       case 'Q': case 'R': case 'S': case 'T':
712       case 'U': case 'V': case 'W': case 'X':
713       case 'Y': case 'Z':
714           _tt=mUID();
715           break;
716       case EOF_CHAR :
717           _tt = Token.EOF_TYPE;
718           break;
719       default :
720          throw new NoViableAltForCharException(
721               &quot;invalid char &quot;+_c);
722       }
723       if ( _tt!=Token.SKIP ) {
724           return makeToken(_tt);
725       }
726    }  // try
727        catch (RecognitionException ex) {
728          reportError(ex.toString());
729        }
730        }  // for
731}
732
733public int mWS()
734    throws RecognitionException,
735           CharStreamException,
736           TokenStreamException {
737    int _ttype = WS;
738    switch ( _c) {
739    case '\t':
740        match('\t');
741        break;
742    case '\r':
743        match('\r');
744        break;
745    case ' ':
746        match(' ');
747        break;
748    default :
749    {
750        throw new NoViableAltForException(
751               &quot;no viable for char: &quot;+(char)_c);
752    }
753    }
754     _ttype = Token.SKIP;
755    return _ttype;
756}
757
758public int mPLUS()
759    throws RecognitionException,
760           CharStreamException,
761           TokenStreamException {
762    int _ttype = PLUS;
763    match('+');
764    return _ttype;
765}
766
767public int mMINUS()
768    throws RecognitionException,
769           CharStreamException,
770           TokenStreamException {
771
772    int _ttype = MINUS;
773    match('-');
774    return _ttype;
775}
776
777public int mINT()
778    throws RecognitionException,
779           CharStreamException,
780           TokenStreamException {
781
782    int _ttype = INT;
783    {
784    int _cnt=0;
785    _loop:
786    do {
787        if ( _c&gt;='0' &amp;&amp; _c&lt;='9')
788          { matchRange('0','9'); }
789        else
790        if ( _cnt&gt;=1 ) break _loop;
791        else {
792           throw new ScannerException(
793              &quot;no viable alternative for char: &quot;+
794                (char)_c);
795        }
796        _cnt++;
797    } while (true);
798    }
799    return _ttype;
800}
801
802public int mID()
803    throws RecognitionException,
804           CharStreamException,
805           TokenStreamException {
806    int _ttype = ID;
807    {
808    int _cnt=0;
809    _loop:
810    do {
811        if ( _c&gt;='a' &amp;&amp; _c&lt;='z')
812        { matchRange('a','z'); }
813        else
814        if ( _cnt&gt;=1 ) break _loop;
815        else {
816            throw new NoViableAltForCharException(
817               &quot;no viable alternative for char: &quot;+
818                 (char)_c);
819        }
820        _cnt++;
821        } while (true);
822    }
823    return _ttype;
824}
825
826public int mUID()
827    throws RecognitionException,
828           CharStreamException,
829           TokenStreamException {
830
831    int _ttype = UID;
832    {
833    int _cnt=0;
834    _loop:
835    do {
836        if ( _c&gt;='A' &amp;&amp; _c&lt;='Z')
837        { matchRange('A','Z'); }
838        else
839        if ( _cnt&gt;=1 ) break _loop;
840        else {
841            throw new NoViableAltForCharException(
842               &quot;no viable alternative for char: &quot;+
843                 (char)_c);
844        }
845        _cnt++;
846    } while (true);
847    }
848    return _ttype;
849}
850
851}
852</tt>  </pre> 
853<p>
854        ANTLR-generated lexers assume that you will be reading streams of characters. If this is not the case, you must create your own lexer.
855</p>
856<h3><a name="_bb13">Creating Your Own Lexer</a></h3> 
857<p>
858        To create your own lexer, your Java class that will doing the lexing must implement interface <tt>TokenStream</tt>, which simply states that you must be able to return a stream of tokens via <tt>nextToken</tt>: <tt>
859</p>
860<pre>
861/**This interface allows any object to
862 * pretend it is a stream of tokens.
863 * @author Terence Parr, MageLang Institute
864 */
865public interface TokenStream {
866  public Token nextToken();
867}
868</tt>  </pre> 
869<p>
870        ANTLR will not generate a lexer if you do not specify a lexical class.
871</p>
872<p>
873        Launching a parser with a non-ANTLR-generated lexer is the same as launching a parser with an ANTLR-generated lexer:
874</p>
875<tt><pre>HandBuiltLexer lex = new HandBuiltLexer(...);
876MyParser p = new MyParser(lex);
877p.<i>start-rule</i>();</tt></pre> 
878<p>
879        The parser does not care what kind of object you use for scanning as as long as it can answer <tt>nextToken</tt>.
880</p>
881<p>
882        If you build your own lexer, and the token values are also generated by that lexer, then you should inform the ANTLR-generated parsers about the token type values generated by that lexer. Use the <a href="options.html#importVocab">importVocab</a> in the parsers that use the externally-generated token set, and create a token definition file following the requirements of the importVocab option.
883</p>
884<h3><a name="_bb14"></a><a name="Lexical Rules">Lexical Rules</a> </h3> 
885<p>
886        Lexical rules are essentially the same as parser rules except that lexical rules apply a structure to a series of characters rather than a series of tokens. As with parser rules, each lexical rule results in a method in the output lexer class.
887</p>
888<p>
889        <b>Alternative blocks.</b> Consider a simple series of alternatives within a block: <tt>
890</p>
891<pre>
892FORMAT : 'x' | 'f' | 'd';
893</tt>  </pre> 
894<p>
895        The lexer would contain the following method: <tt>
896</p>
897<pre>
898public int mFORMAT() {
899  if ( c=='x' ) {
900    match('x');
901  }
902  else if ( c=='x' ) {
903    match('x');
904  }
905  else if ( c=='f' ) {
906    match('f');
907  }
908  else if ( c=='d' ) {
909    match('d');
910  }
911  else {
912    throw new NoViableAltForCharException(
913        &quot;no viable alternative: '&quot;+(char)c+&quot;'&quot;);
914  }
915  return FORMAT;
916}
917</tt>  </pre> 
918<p>
919        The only real differences between lexical methods and grammar methods are that lookahead prediction expressions do character comparisons rather than <tt>LA(i)</tt> comparisons, <tt>match</tt> matches characters instead of tokens, a <tt>return</tt> is added to the bottom of the rule, and lexical methods throw <tt>CharStreamException</tt> objects in addition to <font face="Courier New">TokenStreamException</font> and <font face="Courier New">RecognitionException</font> objects.
920</p>
921<p>
922        <b>Optimization: Non-Recursive lexical rules.</b> Rules that do not directly or indirectly call themselves can be inlined into the lexer entry method: <tt>nextToken</tt>. For example, the common identifier rule would be placed directly into the <tt>nextToken</tt> method. That is, rule: <tt>
923</p>
924<pre>
925ID  :   ( 'a'..'z' | 'A'..'Z' )+
926    ;
927</tt> </pre> 
928<p>
929        would not result in a method in your lexer class. This rule would become part of the resulting lexer as it would be probably inlined by ANTLR: <tt>
930</p>
931<pre>
932public Token nextToken() {
933  switch ( c ) {
934  <i>cases for operators and such here</i>
935  case '0': // chars that predict ID token
936  case '1':
937  case '2':
938  case '3':
939  case '4':
940  case '5':
941  case '6':
942  case '7':
943  case '8':
944  case '9':
945    while ( c&gt;='0' &amp;&amp; c&lt;='9' ) {
946      matchRange('0','9');
947    }
948    return makeToken(ID);
949  default :
950    <i>check harder stuff here like rules
951      beginning with a..z</i>
952}
953</tt>  </pre> 
954<p>
955        If not inlined, the method for scanning identifiers would look like: <tt>
956</p>
957<pre>
958public int mID() {
959  while ( c&gt;='0' &amp;&amp; c&lt;='9' ) {
960    matchRange('0','9');
961  }
962  return ID;
963}
964</tt>  </pre> 
965<p>
966        where token names are converted to method names by prefixing them with the letter <tt>m</tt>. The <tt>nextToken</tt> method would become: <tt>
967</p>
968<pre>
969public Token nextToken() {
970  switch ( c ) {
971  <i>cases for operators and such here</i>
972  case '0': // chars that predict ID token
973  case '1':
974  case '2':
975  case '3':
976  case '4':
977  case '5':
978  case '6':
979  case '7':
980  case '8':
981  case '9':
982    return makeToken(mID());
983  default :
984    <i>check harder stuff here like rules
985      beginning with a..z</i>
986}
987</tt>  </pre> 
988<p>
989        Note that this type of range loop is so common that it should probably be optimized to: <tt>
990</p>
991<pre>
992while ( c&gt;='0' &amp;&amp; c&lt;='9' ) {
993  consume();
994}
995</tt>  </pre> 
996<p>
997        <b>Optimization: Recursive lexical rules.</b> Lexical rules that are directly or indirectly recursive are not inlined. For example, consider the following rule that matches nested actions: <tt>
998</p>
999<pre>
1000ACTION
1001    :   '{' ( ACTION | ~'}' )* '}'
1002    ;
1003</tt>  </pre> 
1004<p>
1005        <tt>ACTION</tt> would be result in (assuming a character vocabulary of 'a'..'z', '{', '}'): <tt>
1006</p>
1007<pre>
1008public int mACTION()
1009    throws RecognitionException,
1010           CharStreamException,
1011           TokenStreamException {
1012
1013    int _ttype = ACTION;
1014    match('{');
1015    {
1016    _loop:
1017    do {
1018        switch ( _c) {
1019        case '{':
1020            mACTION();
1021            break;
1022        case 'a': case 'b': case 'c': case 'd':
1023        case 'e': case 'f': case 'g': case 'h':
1024        case 'i': case 'j': case 'k': case 'l':
1025        case 'm': case 'n': case 'o': case 'p':
1026        case 'q': case 'r': case 's': case 't':
1027        case 'u': case 'v': case 'w': case 'x':
1028        case 'y': case 'z':
1029            matchNot('}');
1030            break;
1031        default :
1032            break _loop;
1033        }
1034    } while (true);
1035    }
1036    match('}');
1037    return _ttype;
1038}
1039</tt>       </pre> <h2><a name="_bb15">Token Objects</a></h2> 
1040<p>
1041        The basic token knows only about a token type:
1042</p>
1043<pre><tt>public class Token {
1044  // constants
1045  public static final int MIN_USER_TYPE = 3;
1046  public static final int INVALID_TYPE = 0;
1047  public static final int EOF_TYPE = 1;
1048  public static final int SKIP = -1;
1049 
1050  // each Token has at least a token type
1051  int type=INVALID_TYPE;
1052 
1053  // the illegal token object
1054  public static Token badToken =
1055    new Token(INVALID_TYPE, &quot;<no text>&quot;);
1056 
1057  public Token() {;}
1058  public Token(int t) { type = t; }
1059  public Token(int t, String txt) {
1060    type = t; setText(txt);
1061  }
1062
1063  public void setType(int t) { type = t; }
1064  public void setLine(int l) {;}
1065  public void setColumn(int c) {;}
1066  public void setText(String t) {;}
1067 
1068  public int getType() { return type; }
1069  public int getLine() { return 0; }
1070  public int getColumn() { return 0; }
1071  public String getText() {...}
1072}
1073</tt></pre> 
1074        <p>
1075                The raw <tt>Token</tt> class is not very useful.&nbsp; ANTLR supplies a &quot;common&quot; token class that it uses by default, which contains the line number and text associated with the token:<tt>
1076        </p>
1077        </tt>
1078        <p>
1079                <tt>public class CommonToken extends Token {
1080                        <br>
1081                        &nbsp; // most tokens will want line, text information
1082                        <br>
1083                        &nbsp; int line;
1084                        <br>
1085                        &nbsp; String text = null;
1086                        <br>
1087                        &nbsp; 
1088                        <br>
1089                        &nbsp; public CommonToken() {}
1090                        <br>
1091                        &nbsp; public CommonToken(String s)&nbsp; { text = s; }
1092                        <br>
1093                        &nbsp; public CommonToken(int t, String txt) {
1094                        <br>
1095                        &nbsp;&nbsp;&nbsp; type = t;
1096                        <br>
1097                        &nbsp;&nbsp;&nbsp; setText(txt);
1098                        <br>
1099                        &nbsp; }
1100                        <br>
1101                        <br>
1102                        &nbsp; public void setLine(int l)&nbsp;&nbsp;&nbsp; { line = l; }
1103                        <br>
1104                        &nbsp; public int&nbsp; getLine()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { return line; }
1105                        <br>
1106                        &nbsp; public void setText(String s) { text = s; }
1107                        <br>
1108                        &nbsp; public String getText()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; { return text; }
1109                        <br>
1110                        }</tt>
1111        </p>
1112        <p>
1113                ANTLR will generate an interface that defines the types of tokens in a token vocabulary. Parser and lexers that share this token vocabulary are generated such that they implement the resulting token types interface: <tt>
1114        </p>
1115<pre>public interface MyLexerTokenTypes {
1116  public static final int ID = 2;
1117  public static final int BEGIN = 3;
1118  ...
1119}</tt></pre> 
1120        <p>
1121                ANTLR defines a token object for use with the <small><font face="Courier New">TokenStreamHiddenTokenFilter</font></small> object called <small><font face="Courier New">CommonHiddenStreamToken</font></small>:
1122        </p>
1123<pre>public class CommonHiddenStreamToken
1124  extends CommonToken {
1125  protected CommonHiddenStreamToken hiddenBefore;
1126  protected CommonHiddenStreamToken hiddenAfter;
1127
1128  public CommonHiddenStreamToken
1129    <strong>getHiddenAfter</strong>() {...}
1130  public CommonHiddenStreamToken
1131    <strong>getHiddenBefore</strong>() {...}
1132}</pre> 
1133        <p>
1134                Hidden tokens are weaved amongst the normal tokens.&nbsp; Note that, for garbage collection reasons, hidden tokens never point back to normal tokens (preventing a linked list of the entire token stream).
1135        </p>
1136        <h2><a name="_bb16">Token Lookahead Buffer</a></h2> 
1137        <p>
1138                The parser must always have fast access to k symbols of lookahead. In a world without syntactic predicates, a simple buffer of k <tt>Token</tt> references would suffice. However, given that even LL(1) ANTLR parsers must be able to backtrack, an arbitrarily-large buffer of <tt>Token</tt> references must be maintained. <tt>LT(i)</tt> looks into the token buffer.
1139        </p>
1140        <p>
1141                Fortunately, the parser itself does not implement the token-buffering and lookahead algorithm. That is handled by the <tt>TokenBuffer</tt> object. We begin the discussion of lookahead by providing an LL(k) parser framework: <tt>
1142        </p>
1143<pre>
1144public class LLkParser extends Parser {
1145   TokenBuffer input;
1146   public int LA(int i) {
1147      return input.LA(i);
1148   }
1149   public Token LT(int i) {
1150      return input.LT(i);
1151   }
1152   public void consume() {
1153      input.consume();
1154   }
1155}
1156</tt>       </pre> 
1157        <p>
1158                All lookahead-related calls are simply forwarded to the <tt>TokenBuffer</tt> object. In the future, some simple caching may be performed in the parser itself to avoid the extra indirection, or ANTLR may generate the call to input.LT(i) directly.
1159        </p>
1160        <p>
1161                The <tt>TokenBuffer</tt> object caches the token stream emitted by the scanner. It supplies <tt>LT()</tt> and <tt>LA()</tt> methods for accessing the k<sup>th</sup> lookahead token or token type, as well as methods for consuming tokens, guessing, and backtracking. <tt>
1162        </p>
1163<pre>
1164public class TokenBuffer {
1165   ...
1166   /** Mark another token for
1167    *  deferred consumption */
1168   public final void consume() {...}
1169
1170   /** Get a lookahead token */
1171   public final Token LT(int i) { ... }
1172
1173   /** Get a lookahead token value */
1174   public final int LA(int i) { ... }
1175
1176   /**Return an integer marker that can be used to
1177    * rewind the buffer to its current state. */
1178   public final int mark() { ... }
1179
1180   /**Rewind the token buffer to a marker.*/
1181   public final void rewind(int mark) { ... }
1182}
1183</pre> </tt>
1184        <p>
1185                To begin backtracking, a <tt>mark</tt> is issued, which makes the <tt>TokenBuffer</tt> record the current position so that it can rewind the token stream. A subsequent <tt>rewind</tt> directive will reset the internal state to the point before the last <tt>mark</tt>.
1186        </p>
1187        <p>
1188                Consider the following rule that employs backtracking:
1189        </p>
1190        <tt><pre>
1191stat:   (list EQUAL) =&gt; list EQUAL list
1192    |   list
1193    ;
1194list:   LPAREN (ID)* RPAREN
1195    ;
1196</tt> </pre> 
1197<p>
1198        Something like the following code would be generated: <tt>
1199</p>
1200<pre>
1201public void stat()
1202  throws RecognitionException,
1203         TokenStreamException
1204{
1205  boolean synPredFailed;
1206  if ( LA(1)==LPAREN ) { // check lookahead
1207    int marker = tokenBuffer.mark();
1208    try {
1209      list();
1210      match(EQUAL);
1211      synPredFailed = false;
1212    }
1213    catch (RecognitionException e) {
1214      tokenBuffer.rewind(marker);
1215      synPredFailed = true;
1216    }
1217  }
1218  if ( LA(1)==LPAREN &amp;&amp; !synPredFailed ) {
1219    // test prediction of alt 1
1220    list();
1221    match(EQUAL);
1222    list();
1223  }
1224  else if ( LA(1)==LPAREN ) {
1225    list();
1226  }
1227}
1228</tt>      </pre> 
1229<p>
1230        The token lookahead buffer uses a circular token buffer to perform quick indexed access to the lookahead tokens. The circular buffer is expanded as necessary to calculate LT(i) for arbitrary i. <tt>TokenBuffer.consume()</tt> does not actually read more tokens. Instead, it defers the read by counting how many tokens have been consumed, and then adjusts the token buffer and/or reads new tokens when LA() or LT() is called.
1231</p>
1232<p>
1233        <font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.7/doc/runtime.html#2 $</font> 
1234</body>
1235</html>
Note: See TracBrowser for help on using the repository browser.