|
Grammars and Metagrammars
We can summarize the various
components that make up a grammar as
follows:
- A grammar is a
collection of rules that describe how a language is
structured;
- A rule is a
collection of productions that list the alternative
ways in which parts of the language can
appear;
- A production is a
list of non-terminals and/or terminals that must
appear in the text being parsed if the surrounding
rule is to match;
- A non-terminal is
a reference to some other rule that must be matched
as part of the matching of the surrounding
rule.
- A terminal is
something that actually can literally appear in a
text (a word in the grammar's
vocabulary);
What's interesting about this summary is how
similar it is itself to a grammar. Just as each
grammar rule is defined as a series of references to
other rules (non-terminals) and token names
(terminals), so the terms 'grammar', 'rule',
'production', 'non-terminal', and 'terminal' are
defined by reference to other terms (non-terminals)
or by self-contained explanations (terminals). That
leads to the somewhat recursive conclusion that you
could define a grammar using a grammar. Or maybe
even using the same grammar!
And so you can. Here is the entire grammar for
specifying Parse::RecDescent grammars, expressed
itself as a Parse::RecDescent grammar:
grammar: components(s)
component: rule | comment
rule: "\n" identifier ":" productions(?)
productions: production '|' productions | production
production: items(s)
item: lookahead(?) simpleitem | directive | comment
lookahead: '...' | '...!'
simpleitem: subrule args(?) | repetition | terminal | bracket args(?) | action
subrule: identifier
args: {extract_codeblock($text,'[]')}
repetition: subrule args(?) howoften
howoften: '(?)' | '(s?)' | '(s)' | /(\d+)?[.][.](/\d+)?
terminal: /[/]([\][/]|[^/])*[/][gimsoxe]*/ | /"([\]"|[^"])*"/ | /'([\]'|[^'])*'/
action: { extract_codeblock($text) }
bracket: '(' Item(s) production(s?) ')'
directive: '<commit>' | '<uncommit>' | '<resync>' | '<resync:' pattern '>'
| '<reject>' | '<reject:' condition '>' | '<error>'
| '<error:' string '>' | '<error?>' | '<error?:' string '>'
| '<rulevar:' /[^]+/ '>' | '<matchrule:' string '>'
identifier: /[a-z]\w*/i
comment: /#[^\n]*/
pattern: {extract_bracketed($text,'<')}
condition: {extract_codeblock($text,'{<')}
string: {extract_variable($text)} | {extract_quotelike($text)}
| {extract_bracketed($text,'<')}
The grammar is fairly straightforward, except
perhaps the actions which call the various extract_
functions. These functions are imported from the
Text::Balanced module and are useful shortcuts for
parsing matching nested parentheses, Perl variables,
quoted strings, Perl quotelike operators, and
complete blocks of Perl text. It would be possible to
dispense with all of these and encode the entire
grammar without any actions, but then the grammar
would be four times bigger.
But, of course, now you're asking yourself: "Does
Parse::RecDescent use a Parse::RecDescent grammar to
convert grammars into parsers?" The answer is no. If
Parse::RecDescent had to call Parse::RecDescent to
build a parser for the Parse::RecDescent grammar,
then that second version of Parse::RecDescent would
itself have to call Parse::RecDescent to build a
parser for the "Parse::RecDescent grammar" grammar,
and that version would in turn have to call
Parse::RecDescent to build a parser for the '
"Parse::RecDescent grammar" grammar' grammar, and so
on right down to Little Cat Z.
Hence, although it can automatically build parsers
for just about any other purpose, Parse::RecDescent
must content itself with a hand-built parser.
|