1 | <html> |
---|
2 | <head> |
---|
3 | <title>ANTLR-centric Language Glossary</title> |
---|
4 | </head> |
---|
5 | <body bgcolor="#FFFFFF" text="#000000"> |
---|
6 | |
---|
7 | <font face=Arial> |
---|
8 | |
---|
9 | <h1><a id="ANTLR-centric_Language_Glossary" name="ANTLR-centric_Language_Glossary">ANTLR-centric Language Glossary</a></h1> |
---|
10 | |
---|
11 | <i>Terence Parr</i> |
---|
12 | <br><br> |
---|
13 | This glossary defines some of the more important terms used in the |
---|
14 | ANTLR documentation. I have tried to be very informal and provide |
---|
15 | references to other pages that are useful. For another great source |
---|
16 | of information about formal computer languages, see <a |
---|
17 | href="http://www.wikipedia.org/wiki/Context-free_grammar"><b>Wikipedia</b></a>. |
---|
18 | |
---|
19 | <font size=2> |
---|
20 | <!--index--> |
---|
21 | <ul> |
---|
22 | <li><a href="#Ambiguous">Ambiguous</a></li> |
---|
23 | <li><a href="#ANTLR">ANTLR</a></li> |
---|
24 | <li><a href="#AST">AST</a></li> |
---|
25 | <li><a href="#Bit_set">Bit set</a></li> |
---|
26 | <li><a href="#Child-sibling_Tree">Child-sibling Tree</a></li> |
---|
27 | <li><a href="#Context-free_grammar">Context-free grammar</a></li> |
---|
28 | <li><a href="#Context-sensitive">Context-sensitive</a></li> |
---|
29 | <li><a href="#DFA">DFA</a></li> |
---|
30 | <li><a href="#FIRST">FIRST</a></li> |
---|
31 | <li><a href="#FOLLOW">FOLLOW</a></li> |
---|
32 | <li><a href="#Grammar">Grammar</a></li> |
---|
33 | <li><a href="#Hoisting">Hoisting</a></li> |
---|
34 | <li><a href="#Inheritance,_grammar">Inheritance, grammar</a></li> |
---|
35 | <li><a href="#LA(n)">LA(n)</a></li> |
---|
36 | <li><a href="#Left-prefix,_left_factor">Left-prefix, left factor</a></li> |
---|
37 | <li><a href="#Literal">Literal</a></li> |
---|
38 | <li><a href="#Linear_approximate_lookahead">Linear approximate lookahead</a></li> |
---|
39 | <li><a href="#LL(k)">LL(k)</a></li> |
---|
40 | <li><a href="#LT(n)">LT(n)</a></li> |
---|
41 | <li><a href="#Language">Language</a></li> |
---|
42 | <li><a href="#Lexer">Lexer</a></li> |
---|
43 | <li><a href="#Lookahead">Lookahead</a></li> |
---|
44 | <li><a href="#nextToken">nextToken</a></li> |
---|
45 | <li><a href="#NFA">NFA</a></li> |
---|
46 | <li><a href="#Nondeterministic">Nondeterministic</a></li> |
---|
47 | <li><a href="#Parser">Parser</a></li> |
---|
48 | <li><a href="#Predicate,_semantic">Predicate, semantic</a></li> |
---|
49 | <li><a href="#Predicate,_syntactic">Predicate, syntactic</a></li> |
---|
50 | <li><a href="#Production">Production</a></li> |
---|
51 | <li><a href="#Protected">Protected</a></li> |
---|
52 | <li><a href="#Recursive-descent">Recursive-descent</a></li> |
---|
53 | <li><a href="#Regular">Regular</a></li> |
---|
54 | <li><a href="#Rule">Rule</a></li> |
---|
55 | <li><a href="#Scanner">Scanner</a></li> |
---|
56 | <li><a href="#Semantics">Semantics</a></li> |
---|
57 | <li><a href="#Subrule">Subrule</a></li> |
---|
58 | <li><a href="#Syntax">Syntax</a></li> |
---|
59 | <li><a href="#Token">Token</a></li> |
---|
60 | <li><a href="#Token_stream">Token stream</a></li> |
---|
61 | <li><a href="#Tree">Tree</a></li> |
---|
62 | <li><a href="#Tree_parser">Tree parser</a></li> |
---|
63 | <li><a href="#Vocabulary">Vocabulary</a></li> |
---|
64 | <li><a href="#Wow">Wow</a></li> |
---|
65 | </ul> |
---|
66 | <!--/index--> |
---|
67 | </font> |
---|
68 | |
---|
69 | <h3><a id="Ambiguous" name="Ambiguous">Ambiguous</a></h3> |
---|
70 | |
---|
71 | A language is ambiguous if the same sentence or phrase can be |
---|
72 | interpreted in more than a single way. For example, the following |
---|
73 | sentence by Groucho Marx is easily interpreted in two ways: "I once |
---|
74 | shot an elephant in my pajamas. How he got in my pajamas I'll never |
---|
75 | know!" In the computer world, a typical language ambiguity is the |
---|
76 | if-then-else ambiguity where the else-clause may be attached to either |
---|
77 | the most recent if-then or an older one. Reference manuals for |
---|
78 | computer languages resolve this ambiguity by stating that else-clauses |
---|
79 | always match up with the most recent if-then. |
---|
80 | |
---|
81 | <p> |
---|
82 | A grammar is ambiguous if the same input sequence can be derived in |
---|
83 | multiple ways. Ambiguous languages always yield ambiguous grammars |
---|
84 | unless you can find a way to encode semantics (actions or predicates |
---|
85 | etc...) that resolve the ambiguity. Most language tools like ANTLR |
---|
86 | resolve the if-then-else ambiguity by simply choosing to match |
---|
87 | greedily (i.e., as soon as possible). This matches the else with the |
---|
88 | most recent if-then. See nondeterministic. |
---|
89 | |
---|
90 | <h3><a id="ANTLR" name="ANTLR">ANTLR</a></h3> |
---|
91 | |
---|
92 | <b>AN</b>other <b>T</b>ool for <b>L</b>anguage <b>R</b>ecognition, a |
---|
93 | predicated-LL(k) parser generator that handles lexers, parsers, and |
---|
94 | tree parsers. ANTLR has been available since 1990 and led to a |
---|
95 | resurgence of recursive-descent parsing after decades dominated by LR |
---|
96 | and other DFA-based strategies. |
---|
97 | |
---|
98 | <h3><a id="AST" name="AST">AST</a></h3> |
---|
99 | |
---|
100 | <b>A</b>bstract <b>S</b>yntax <b>T</b>ree. ASTs are used as internal |
---|
101 | representations of an input stream, normally constructed during a |
---|
102 | parsing phase. Because AST are two-dimensional trees they |
---|
103 | can encode the structure (as determined by the parser) of the input as |
---|
104 | well as the input symbols. |
---|
105 | |
---|
106 | <p> |
---|
107 | A homogeneous AST is in one in which the physical objects are all of |
---|
108 | the same type; e.g., <tt>CommonAST</tt> in ANTLR. A heterogeneous |
---|
109 | tree may have multiple types such as <tt>PlusNode</tt>, |
---|
110 | <tt>MultNode</tt> etc... |
---|
111 | |
---|
112 | <p> An AST is not a parse tree, which encodes the sequence of rules |
---|
113 | used to match input symbols. See <a |
---|
114 | href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the |
---|
115 | difference between a parse tree and an abstract syntax tree (AST)? Why |
---|
116 | doesn't ANTLR generate trees with nodes for grammar rules like JJTree |
---|
117 | does?</b></a>. |
---|
118 | |
---|
119 | <p>An AST for input <tt>3+4</tt> might be represented as |
---|
120 | <pre> |
---|
121 | + |
---|
122 | / \ |
---|
123 | 3 4 |
---|
124 | </pre> |
---|
125 | |
---|
126 | or more typically (ala ANTLR) in child-sibling form: |
---|
127 | |
---|
128 | <pre> |
---|
129 | + |
---|
130 | | |
---|
131 | 3--4 |
---|
132 | </pre> |
---|
133 | |
---|
134 | Operators are usually subtree roots and operands are usually leaves. |
---|
135 | |
---|
136 | <h3><a id="Bit_set" name="Bit_set">Bit set</a></h3> |
---|
137 | |
---|
138 | Bit sets are an extremely efficient representation for dense integer |
---|
139 | sets. You can easily encode sets of strings also by mapping unique |
---|
140 | strings to unique integers. ANTLR uses bitsets for lookahead |
---|
141 | prediction in parsers and lexers. Simple bit set implementations do |
---|
142 | not work so well for sparse sets, particularly when the maximum |
---|
143 | integer stored in the set is large. |
---|
144 | |
---|
145 | <p> ANTLR's bit set represents membership with a bit for each possible |
---|
146 | integer value. For a maximum value of <i>n</i>, a bit set needs |
---|
147 | <i>n/64</i> long words or <i>n/8</i> bytes. For ASCII bit sets with a |
---|
148 | maximum value of 127, you only need 16 bytes or 2 long words. UNICODE |
---|
149 | has a max value of \uFFFF or 65535, requiring 8k bytes, and these sets |
---|
150 | are typically sparse. Fortunately most lexers only need a few of |
---|
151 | these space inefficient (but speedy) bitsets and so it's not really a |
---|
152 | problem. |
---|
153 | |
---|
154 | <h3><a id="Child-sibling_Tree" name="Child-sibling_Tree">Child-sibling Tree</a></h3> |
---|
155 | |
---|
156 | A particularly efficient data structure for representing trees. See AST. |
---|
157 | |
---|
158 | <h3><a id="Context-free_grammar" name="Context-free_grammar">Context-free grammar</a></h3> |
---|
159 | |
---|
160 | A grammar where recognition of a particular construct does not depend |
---|
161 | on whether it is in a particular syntactic context. A context-free |
---|
162 | grammar has a set of rules like |
---|
163 | |
---|
164 | <pre> |
---|
165 | stat : IF expr THEN stat |
---|
166 | | ... |
---|
167 | ; |
---|
168 | </pre> |
---|
169 | |
---|
170 | where there is no restriction on when the <tt>IF</tt> alternative may |
---|
171 | be applied--if you are in rule <tt>stat</tt>, you may apply the |
---|
172 | alternative. |
---|
173 | |
---|
174 | <h3><a id="Context-sensitive" name="Context-sensitive">Context-sensitive</a></h3> |
---|
175 | |
---|
176 | A grammar where recognition of a particular construct may depend on a |
---|
177 | syntactic context. You never see these grammars in practice because |
---|
178 | they are impractical (Note, an Earley parser is O(n^3) worst-case for |
---|
179 | context-<i>free</i> grammars). A context-free rule looks like: |
---|
180 | |
---|
181 | <pre> |
---|
182 | Α → γ |
---|
183 | </pre> |
---|
184 | |
---|
185 | but a context-<i>sensitive</i> rule may have context on the left-side: |
---|
186 | |
---|
187 | <pre> |
---|
188 | αΑβ → αγβ |
---|
189 | </pre> |
---|
190 | |
---|
191 | meaning that rule Α may only be applied (converted to γ) |
---|
192 | in between α and β. |
---|
193 | |
---|
194 | <p> |
---|
195 | In an ANTLR sense, you can recognize context-sensitive constructs with |
---|
196 | a semantic predicate. The action evaluates to true or false |
---|
197 | indicating the validity of applying the alternative. |
---|
198 | |
---|
199 | <p> |
---|
200 | See <a |
---|
201 | href="http://www.wikipedia.org/wiki/Context-sensitive"><b>Context-sensitive gramar</b></a>. |
---|
202 | |
---|
203 | <h3><a id="DFA" name="DFA">DFA</a></h3> |
---|
204 | |
---|
205 | <b>D</b>eterministic <b>F</b>inite <b>A</b>utomata. A state machine |
---|
206 | used typically to formally describe lexical analyzers. <tt>lex</tt> |
---|
207 | builds a DFA to recognize tokens whereas ANTLR builds a recursive |
---|
208 | descent lexer similar to what you would build by hand. See <a |
---|
209 | href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite |
---|
210 | state machine</b></a> and ANTLR's lexer documentation. |
---|
211 | |
---|
212 | <h3><a id="FIRST" name="FIRST">FIRST</a></h3> |
---|
213 | |
---|
214 | The set of symbols that may be matched on the left-edge of a rule. |
---|
215 | For example, the FIRST(decl) is set {ID, INT} for the following: |
---|
216 | |
---|
217 | <pre> |
---|
218 | decl : ID ID SEMICOLON |
---|
219 | | INT ID SEMICOLON |
---|
220 | ; |
---|
221 | </pre> |
---|
222 | |
---|
223 | The situation gets more complicated when you have optional |
---|
224 | constructs. The FIRST(a) below is {A,B,C} |
---|
225 | |
---|
226 | <pre> |
---|
227 | a : (A)? B |
---|
228 | | C |
---|
229 | ; |
---|
230 | </pre> |
---|
231 | |
---|
232 | because the A is optional and the B may be seen on the left-edge. |
---|
233 | |
---|
234 | <p> |
---|
235 | Naturally k>1 lookahead symbols makes this even more complicated. |
---|
236 | FIRST_k must track sets of k-sequences not just individual symbols. |
---|
237 | |
---|
238 | <h3><a id="FOLLOW" name="FOLLOW">FOLLOW</a></h3> |
---|
239 | |
---|
240 | The set of input symbols that may follow any reference to the |
---|
241 | specified rule. For example, FOLLOW(decl) is {RPAREN, SEMICOLON): |
---|
242 | |
---|
243 | <pre> |
---|
244 | methodHead : ID LPAREN decl RPAREN ; |
---|
245 | var : decl SEMICOLON ; |
---|
246 | decl : TYPENAME ID ; |
---|
247 | </pre> |
---|
248 | |
---|
249 | because RPAREN and SEMICOLON both follow references to rule decl. |
---|
250 | FIRST and FOLLOW computations are used to analyze grammars and |
---|
251 | generate parsing decisions. |
---|
252 | |
---|
253 | <p> |
---|
254 | This grammar analysis all gets very complicated when k>1. |
---|
255 | |
---|
256 | <h3><a id="Grammar" name="Grammar">Grammar</a></h3> |
---|
257 | |
---|
258 | A finite means of formally describing the structure of a possibly |
---|
259 | infinite language. Parser generators build parsers that recognize |
---|
260 | sentences in the language described by a grammar. Most parser |
---|
261 | generators allow you to add actions to be executed during the parse. |
---|
262 | |
---|
263 | <h3><a id="Hoisting" name="Hoisting">Hoisting</a></h3> |
---|
264 | |
---|
265 | Semantic predicates describe the semantic context in which a rule or |
---|
266 | alternative applies. The predicate is hoisted into a prediction |
---|
267 | expression. Hoisting typically refers to pulling a predicate out of |
---|
268 | its enclosing rule and into the prediction expression of another |
---|
269 | rule. For example, |
---|
270 | |
---|
271 | <pre> |
---|
272 | decl : typename ID SEMICOLON |
---|
273 | | ID ID SEMICOLON |
---|
274 | ; |
---|
275 | typename : {isType(LT(1))}? ID |
---|
276 | ; |
---|
277 | </pre> |
---|
278 | |
---|
279 | The predicate is not needed in typename as there is no decision, |
---|
280 | however, rule decl needs it to distinguish between its two |
---|
281 | alternatives. The first alternative would look like: |
---|
282 | |
---|
283 | <pre> |
---|
284 | if ( LA(1)==ID && isType(LT(1)) ) { |
---|
285 | typename(); |
---|
286 | match(ID); |
---|
287 | match(SEMICOLON); |
---|
288 | } |
---|
289 | </pre> |
---|
290 | |
---|
291 | PCCTS 1.33 did, but ANTLR currently does not hoist predicates into |
---|
292 | other rules. |
---|
293 | |
---|
294 | <h3><a id="Inheritance,_grammar" name="Inheritance,_grammar">Inheritance, grammar</a></h3> |
---|
295 | |
---|
296 | The ability of ANTLR to define a new grammar as it differs from an |
---|
297 | existing grammar. See the ANTLR documentation. |
---|
298 | |
---|
299 | <h3><a id="LA(n)" name="LA(n)">LA(n)</a></h3> |
---|
300 | The nth lookahead character, token type, or AST node type depending |
---|
301 | on the grammar type (lexer, parser, or tree parser respectively). |
---|
302 | |
---|
303 | <h3><a id="Left-prefix,_left_factor" name="Left-prefix,_left_factor">Left-prefix, left factor</a></h3> |
---|
304 | |
---|
305 | A common sequence of symbols on the left-edge of a set of alternatives |
---|
306 | such as: |
---|
307 | |
---|
308 | <pre> |
---|
309 | a : A B X |
---|
310 | | A B Y |
---|
311 | ; |
---|
312 | </pre> |
---|
313 | |
---|
314 | The left-prefix is A B, which you can remove by left-factoring: |
---|
315 | |
---|
316 | <pre> |
---|
317 | a : A B (X|Y) |
---|
318 | ; |
---|
319 | </pre> |
---|
320 | |
---|
321 | Left-factoring is done to reduce lookahead requirements. |
---|
322 | |
---|
323 | <h3><a id="Literal" name="Literal">Literal</a></h3> |
---|
324 | |
---|
325 | Generally a literal refers to a fixed string such as <tt>begin</tt> |
---|
326 | that you wish to match. When you reference a literal in an ANTLR |
---|
327 | grammar via <tt>"begin"</tt>, ANTLR assigns it a token type like any |
---|
328 | other token. If you have defined a lexer, ANTLR provides information |
---|
329 | about the literal (type and text) to the lexer so it may detect |
---|
330 | occurrences of the literal. |
---|
331 | |
---|
332 | <h3><a id="Linear_approximate_lookahead" name="Linear_approximate_lookahead">Linear approximate lookahead</a></h3> |
---|
333 | |
---|
334 | An approximation to full lookahead (that can be applied to both LL and LR |
---|
335 | parsers) for k>1 that reduces the complexity of storing and testing |
---|
336 | lookahead from O(n^k) to O(nk); exponential to linear reduction. When |
---|
337 | linear approximate lookahead is insufficient (results in a |
---|
338 | nondeterministic parser), you can use the approximate lookahead to |
---|
339 | attenuate the cost of building the full decision. |
---|
340 | |
---|
341 | <p>Here is a simple example illustrating the difference between full |
---|
342 | and approximate lookahead: |
---|
343 | |
---|
344 | <pre> |
---|
345 | a : (A B | C D) |
---|
346 | | A D |
---|
347 | ; |
---|
348 | </pre> |
---|
349 | |
---|
350 | This rule is LL(2) but not linear approximate LL(2). The real |
---|
351 | FIRST_2(a) is {AB,CD} for alternative 1 and {AD} for alternative 2. |
---|
352 | No intersection, so no problem. Linear approximate lookahead |
---|
353 | collapses all symbols at depth i yielding k sets instead of a possibly |
---|
354 | n^k k-sequences. The approximation (compressed) sets are |
---|
355 | {AB,AD,CD,CB} and {AD}. Note the introduction of the spurious |
---|
356 | k-sequences AD and CB. Unfortunately, this compression introduces a |
---|
357 | conflict upon AD between the alternatives. PCCTS did full LL(k) and |
---|
358 | ANTLR does linear approximate only as I found that linear approximate |
---|
359 | lookahead works for the vast majority of parsing decisions and is |
---|
360 | extremely fast. I find one or two problem spots in a large grammar |
---|
361 | usually with ANTLR, which forces me to reorganize my grammar in a |
---|
362 | slightly unnatural manner. Unfortunately, your brain does full LL(k) |
---|
363 | and ANTLR does a slightly weaker linear approximate lookahead--a |
---|
364 | source of many (invalid) bug reports ;) |
---|
365 | |
---|
366 | <p> |
---|
367 | This compression was the subject of <a href="http://www.antlr.org/papers/parr.phd.thesis.pdf"><b>my doctoral |
---|
368 | dissertation</b></a> (PDF 477k) at Purdue. |
---|
369 | |
---|
370 | <h3><a id="LL(k)" name="LL(k)">LL(k)</a></h3> |
---|
371 | |
---|
372 | Formally, LL(k) represents a class of parsers and grammars that parse |
---|
373 | symbols from left-to-right (beginning to end of input stream) using a |
---|
374 | leftmost derivation and using k symbols of lookahead. A leftmost |
---|
375 | derivation is one in which derivations (parses) proceed by attempting |
---|
376 | to replace rule references from left-to-right within a production. |
---|
377 | Given the following rule |
---|
378 | |
---|
379 | <pre> |
---|
380 | stat : IF expr THEN stat |
---|
381 | | ... |
---|
382 | ; |
---|
383 | </pre> |
---|
384 | |
---|
385 | an LL parser would match the IF then attempt to parse expr rather than |
---|
386 | a rightmost derivation, which would attempt to parse stat first. |
---|
387 | |
---|
388 | <p> |
---|
389 | LL(k) is synonymous with a "top-down" parser because the |
---|
390 | parser begins at the start symbol and works its way down the |
---|
391 | derivation/parse tree (tree here means the stack of method activations |
---|
392 | for recursive descent or symbol stack for a table-driven parser). A |
---|
393 | recursive-descent parser is particular implementation of an LL parser |
---|
394 | that uses functions or method calls to implement the parser rather |
---|
395 | than a table. |
---|
396 | |
---|
397 | <p> |
---|
398 | ANTLR generates predicate-LL(k) parsers that support syntactic and |
---|
399 | sematic predicates allowing you to specify many context-free and |
---|
400 | context-sensitive grammars (with a bit of work). |
---|
401 | |
---|
402 | |
---|
403 | <h3><a id="LT(n)" name="LT(n)">LT(n)</a></h3> |
---|
404 | In a parser, this is the nth lookahead Token object. |
---|
405 | |
---|
406 | <h3><a id="Language" name="Language">Language</a></h3> A possibly infinite set of valid sentences. The |
---|
407 | vocabulary symbols may be characters, tokens, and tree nodes in an |
---|
408 | ANTLR context. |
---|
409 | |
---|
410 | <h3><a id="Lexer" name="Lexer">Lexer</a></h3> |
---|
411 | A recognizer that breaks up a stream of characters into |
---|
412 | vocabulary symbols for a parser. The parser pulls vocabulary symbols |
---|
413 | from the lexer via a queue. |
---|
414 | |
---|
415 | <h3><a id="Lookahead" name="Lookahead">Lookahead</a></h3> |
---|
416 | |
---|
417 | When parsing a stream of input symbols, a parser has matched (and no |
---|
418 | longer needs to consider) a portion of the stream to the left of its |
---|
419 | read pointer. The next k symbols to the right of the read pointer are |
---|
420 | considered the fixed lookahead. This information is used to direct |
---|
421 | the parser to the next state. In an LL(k) parser this means to |
---|
422 | predict which path to take from the current state using the next k |
---|
423 | symbols of lookahead. |
---|
424 | |
---|
425 | <p> ANTLR supports syntactic predicates, a manually-specified form of |
---|
426 | backtracking that effectively gives you infinite lookahead. For |
---|
427 | example, consider the following rule that distinguishes between sets |
---|
428 | (comma-separated lists of words) and parallel assignments (one list |
---|
429 | assigned to another): |
---|
430 | |
---|
431 | <pre> |
---|
432 | stat: ( list "=" )=> list "=" list |
---|
433 | | list |
---|
434 | ; |
---|
435 | </pre> |
---|
436 | |
---|
437 | If a list followed by an assignment operator is found on the input |
---|
438 | stream, the first production is predicted. If not, the second |
---|
439 | alternative production is attempted. |
---|
440 | |
---|
441 | <h3><a id="nextToken" name="nextToken">nextToken</a></h3> |
---|
442 | |
---|
443 | A lexer method automatically generated by ANTLR that figures out which |
---|
444 | of the lexer rules to apply. For example, if you have two rules ID |
---|
445 | and INT in your lexer, ANTLR will generate a lexer with methods for ID |
---|
446 | and INT as well as a nextToken method that figures out which rule |
---|
447 | method to attempt given k input characters. |
---|
448 | |
---|
449 | <h3><a id="NFA" name="NFA">NFA</a></h3> |
---|
450 | |
---|
451 | <b>N</b>ondeterministic <b>F</b>inite <b>A</b>utomata. See <a |
---|
452 | href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite |
---|
453 | state machine</b></a>. |
---|
454 | |
---|
455 | <h3><a id="Nondeterministic" name="Nondeterministic">Nondeterministic</a></h3> |
---|
456 | |
---|
457 | A parser is nondeterministic if there is at least one decision point |
---|
458 | where the parser cannot resolve which path to take. Nondeterminisms |
---|
459 | arise because of parsing strategy weaknesses. |
---|
460 | |
---|
461 | <ul> |
---|
462 | |
---|
463 | <li>If your strategy works only for unambiguous grammars, then |
---|
464 | ambiguous grammars will yield nondeterministic parsers; this is true |
---|
465 | of the basic LL, LR strategies. Even unambiguous grammars can yield |
---|
466 | nondeterministic parsers though. Here is a nondeterministic LL(1) |
---|
467 | grammar: |
---|
468 | |
---|
469 | <pre> |
---|
470 | decl : ID ID SEMICOLON |
---|
471 | | ID SEMICOLON |
---|
472 | ; |
---|
473 | </pre> |
---|
474 | |
---|
475 | Rule <tt>decl</tt> is, however, LL(2) because the second lookahead |
---|
476 | symbol (either ID or SEMICOLON) uniquely determines which alternative |
---|
477 | to predict. You could also left-factor the rule to reduce the |
---|
478 | lookahead requirements.<br><br> |
---|
479 | |
---|
480 | <li> |
---|
481 | If you are willing to pay a |
---|
482 | performance hit or simply need to handle ambiguous grammars, you can |
---|
483 | use an Earley parser or a Tomita parser (LR-based) that match all |
---|
484 | possible interpretations of the input, thus, avoiding the idea of |
---|
485 | nondeterminism altogether. This does present problems when trying to |
---|
486 | execute actions, however, because multiple parses are, in effect, |
---|
487 | occurring in parallel. |
---|
488 | |
---|
489 | </ul> |
---|
490 | |
---|
491 | <p> |
---|
492 | Note that a parser may have multiple decision points that are |
---|
493 | nondeterministic. |
---|
494 | |
---|
495 | <h3><a id="Parser" name="Parser">Parser</a></h3> |
---|
496 | A recognizer that applies a grammatical structure to a stream |
---|
497 | of vocabulary symbols called tokens. |
---|
498 | |
---|
499 | <h3><a id="Predicate,_semantic" name="Predicate,_semantic">Predicate, semantic</a></h3> |
---|
500 | |
---|
501 | A semantic predicate is a boolean expression used to alter the parse |
---|
502 | based upon semantic information. This information is usually a |
---|
503 | function of the constructs/input that have already been matched, but |
---|
504 | can even be a flag that turns on and off subsets of the language (as |
---|
505 | you might do for a grammar handling both K&R and ANSI C). One of the |
---|
506 | most common semantic predicates uses a symbol table to help |
---|
507 | distinguish between syntactically, but semantically different |
---|
508 | productions. In FORTRAN, array references and function calls look the |
---|
509 | same, but may be distinguished by checking what the type of the |
---|
510 | identifier is. |
---|
511 | |
---|
512 | <pre> |
---|
513 | expr : {isVar(LT(1))}? ID LPAREN args RPAREN // array ref |
---|
514 | | {isFunction(LT(1))}? ID LPAREN args RPAREN // func call |
---|
515 | ; |
---|
516 | </pre> |
---|
517 | |
---|
518 | <h3><a id="Predicate,_syntactic" name="Predicate,_syntactic">Predicate, syntactic</a></h3> |
---|
519 | |
---|
520 | A selective form of backtracking used to recognize language constructs |
---|
521 | that cannot be distinguished without seeing all or most of the |
---|
522 | construct. For example, in C++ some declarations look exactly like |
---|
523 | expressions. You have to check to see if it is a declaration. If it |
---|
524 | parses like a declaration, assume it is a declaration--reparse it with |
---|
525 | "feeling" (execute your actions). If not, it must be an expression or |
---|
526 | an error: |
---|
527 | |
---|
528 | <pre> |
---|
529 | stat : (declaration) => declaration |
---|
530 | | expression |
---|
531 | ; |
---|
532 | </pre> |
---|
533 | |
---|
534 | <h3><a id="Production" name="Production">Production</a></h3> |
---|
535 | |
---|
536 | An alternative in a grammar rule. |
---|
537 | |
---|
538 | <h3><a id="Protected" name="Protected">Protected</a></h3> |
---|
539 | |
---|
540 | A protected lexer rule does not represent a complete token--it is a |
---|
541 | helper rule referenced by another lexer rule. This overloading of the |
---|
542 | access-visibility Java term occurs because if the rule is not visible, |
---|
543 | it cannot be "seen" by the parser (yes, this nomeclature sucks). |
---|
544 | |
---|
545 | <h3><a id="Recursive-descent" name="Recursive-descent">Recursive-descent</a></h3> |
---|
546 | |
---|
547 | See LL(k). |
---|
548 | |
---|
549 | <h3><a id="Regular" name="Regular">Regular</a></h3> |
---|
550 | |
---|
551 | A regular language is one that can be described by a regular grammar |
---|
552 | or regular expression or accepted by a DFA-based lexer such as those |
---|
553 | generated by <tt>lex</tt>. Regular languages are normally used to |
---|
554 | describe tokens. |
---|
555 | |
---|
556 | <p> In practice you can pick out a regular grammar by noticing that |
---|
557 | references to other rules are not allowed accept at the end of a |
---|
558 | production. The following grammar is regular because reference to |
---|
559 | <tt>B</tt> occurs at the right-edge of rule <tt>A</tt>. |
---|
560 | |
---|
561 | <pre> |
---|
562 | A : ('a')+ B ; |
---|
563 | B : 'b' ; |
---|
564 | </pre> |
---|
565 | |
---|
566 | Another way to look at it is, "what can I recognize without a stack |
---|
567 | (such as a method return address stack)?". |
---|
568 | |
---|
569 | <p> |
---|
570 | Regular grammars cannot describe context-free languages, hence, LL or |
---|
571 | LR based grammars are used to describe programming languages. ANTLR |
---|
572 | is not restricted to regular languages for tokens because it generates |
---|
573 | recursive-descent lexers. This makes it handy to recognize HTML tags |
---|
574 | and so on all in the lexer. |
---|
575 | |
---|
576 | <h3><a id="Rule" name="Rule">Rule</a></h3> |
---|
577 | |
---|
578 | A rule describes a partial sentence in a language such as a statement |
---|
579 | or expression in a programming language. Rules may have one or more |
---|
580 | alternative productions. |
---|
581 | |
---|
582 | <h3><a id="Scanner" name="Scanner">Scanner</a></h3> |
---|
583 | |
---|
584 | See Lexer. |
---|
585 | |
---|
586 | <h3><a id="Semantics" name="Semantics">Semantics</a></h3> See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What |
---|
587 | do "syntax" and "semantics" mean and how are they different?</b></a>. |
---|
588 | |
---|
589 | <h3><a id="Subrule" name="Subrule">Subrule</a></h3> |
---|
590 | |
---|
591 | Essentially a rule that has been expanded inline. Subrules are |
---|
592 | enclosed in parenthesis and may have suffixes like star, plus, and |
---|
593 | question mark that indicate zero-or-more, one-or-more, or optional. |
---|
594 | The following rule has 3 subrules: |
---|
595 | |
---|
596 | <pre> |
---|
597 | a : (A|B)+ (C)* (D)? |
---|
598 | ; |
---|
599 | </pre> |
---|
600 | |
---|
601 | <h3><a id="Syntax" name="Syntax">Syntax</a></h3> |
---|
602 | See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What |
---|
603 | do "syntax" and "semantics" mean and how are they different?</b></a>. |
---|
604 | |
---|
605 | <h3><a id="Token" name="Token">Token</a></h3> |
---|
606 | |
---|
607 | A vocabulary symbol for a language. This term typically refers to the |
---|
608 | vocabulary symbols of a parser. A token may represent a constant |
---|
609 | symbol such as a keyword like <tt>begin</tt> or a "class" of input |
---|
610 | symbols like <tt>ID</tt> or <tt>INTEGER_LITERAL</tt>. |
---|
611 | |
---|
612 | <h3><a id="Token_stream" name="Token_stream">Token stream</a></h3> |
---|
613 | |
---|
614 | See <a href="http://www.antlr.org/doc/streams.html"><b>Token |
---|
615 | Streams</b></a> in the ANTLR documentation. |
---|
616 | |
---|
617 | <h3><a id="Tree" name="Tree">Tree</a></h3> |
---|
618 | See AST and <a |
---|
619 | href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the |
---|
620 | difference between a parse tree and an abstract syntax tree (AST)? Why |
---|
621 | doesn't ANTLR generate trees with nodes for grammar rules like JJTree |
---|
622 | does?</b></a>. |
---|
623 | |
---|
624 | <h3><a id="Tree_parser" name="Tree_parser">Tree parser</a></h3> A recognizer that applies a grammatical structure to a |
---|
625 | two-dimensional input tree. Grammatical rules are like an "executable |
---|
626 | comment" that describe the tree structure. These parsers are useful |
---|
627 | during translation to (i) annotate trees with, for example, symbol |
---|
628 | table information, (2) perform tree rewrites, and (3) generate output. |
---|
629 | |
---|
630 | <h3><a id="Vocabulary" name="Vocabulary">Vocabulary</a></h3> |
---|
631 | |
---|
632 | The set of symbols used to construct sentences in a language. These |
---|
633 | symbols are usually called tokens or token types. For lexers, the |
---|
634 | vocabulary is a set of characters. |
---|
635 | |
---|
636 | <h3><a id="Wow" name="Wow">Wow</a></h3> See ANTLR. |
---|
637 | |
---|
638 | </font> |
---|
639 | </body> |
---|
640 | </html> |
---|