[cc]mc | .hd stacc "recursive descent parser generator" 08/27/84 stacc [ Ratfor | C | Pascal | Pl/1 | Plp ] [cc]mc .ds 'Stacc' (STill Another Compiler-Compiler) is a simple parser generator designed to reduce the effort involved in building recursive-descent parsers. Its development was motivated by two things: (1) the large number of ad-hoc recursive descent parsers constantly being written and re-written at the Georgia Tech installation; (2) the desire to quickly generate an SSPL compiler in SSPL for microprocessor software development work. Its design was inspired both by 'yacc' (Yet Another Compiler-Compiler) on Unix and the GTL Syntax Parser on the now defunct Georgia Tech Burroughs B5500. .sp 2 .bf Basic Theory .sp Given an LL(1) grammar written in an extended BNF, 'stacc' generates a very simple top-down, recursive-descent parser. Many excellent references are available on the subject of such parsers; the following will serve as starting points: .sp .in +10 .ti -5 Gries, David, .ul Compiler Construction for Digital Computers, John Wiley & Sons, Inc., New York, 1971 .br (See chapters 3, 4, 12, and 15) .sp .ti -5 Aho, Alfred V., and Jeffrey D. Ullman, .ul The Theory of Parsing, Translation, and Compiling, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1972 .br (See chapters 1, 3, and 5) .sp .ti -5 Georgia Institute of Technology School of Information and Computer Science, .ul GTL Programmer's Reference Manual for the Burroughs B5500, 1974 .br (See chapter 8) .in -10 .sp 2 .bf Principles of Operation .sp 'Stacc' generates a "parser," a program which converts a stream of "tokens" into a representation of a derivation tree which describes the production of the input stream from a given grammar. In practice, the derivation tree exists solely in the call structure of the subprograms called to parse the input, so the user must supply "action" routines to produce the output he desires. .sp A parser written with 'stacc' also requires a "lexical analyzer," a routine that converts the input stream of ASCII characters into the tokens handled by the parser. .sp The operation of a 'stacc' parser is roughly as follows. In the initialization phase, the lexical analyzer is called to pick up the first token from the stream of input characters. An integer assigned to the class of token found is then placed in a "current symbol" variable by the lexical analyzer. The parser is then called. The parser attempts to "match" the current symbol against all possibilities for the first input symbol; if a match is found, any actions supplied by the user are performed and the lexical analyzer is called to fetch the next input token. This procedure is repeated until either the entire input stream is recognized as a valid sentence in the input language or an error (a missing or illegal "current symbol") is detected. In the event of an error, the user must supply recovery actions so that the parse can proceed. .sp The function of 'stacc' is to convert an extended BNF grammar into code that checks the current input symbol, calls the lexical analyzer when appropriate, and performs actions specified by the user after certain constructs in the input stream are recognized, thus freeing the user from the bookkeeping details needed to build a parser. .sp 2 .bf Usage Information .sp 'Stacc' takes input (described in detail below) on its first standard input port and produces output on its first and second standard outputs. The first output is the code that implements the parser. This code is expressed in the language whose name is given as the first argument on the command line that invoked 'stacc'. The second output is a set of macro definitions that establishes mnemonics for the integer "current symbol" values supplied by the lexical analyzer. .sp .nh Conventionally, 'stacc' input files have the extension ".stacc", e.g. "hp.stacc," "sspl.stacc," "stacc.stacc". The first output of 'stacc' (the parser) is normally placed in a file with extension ".stacc.", e.g. "hp.stacc.r", "sspl.stacc.s", "stacc.stacc.r". The second output of 'stacc' (the macro definitions) is normally placed in a file with extension ".stacc.defs", e.g. "hp.stacc.defs", "sspl.stacc.defs", "stacc.stacc.defs". These files may then be "included" in a source file during compilation. (Note that slightly different naming conventions are used by the separate compilation handler 'sep'). .hy .sp 2 .bf Input Specifications .sp Input to 'stacc' consists of a series of "declarations" and "productions", separated by semicolons. There may be any number of either, and they may be mixed in any order. Input is free-form; whitespace may be inserted where desired to improve readability. Ratfor-style comments (beginning with a sharp (#), ending with a NEWLINE) may be used for documentation. .sp Declarations consist of a period (.) followed by a keyword and an argument list whose format depends on the keyword. There are seven types of declarations. .sp Four declarations are used to select the names of critical objects used by the parser: ".state " declares the parser state variable (named "state" by default), ".scanner " declares the name of the lexical analyzer subprogram ("getsym" by default), ".symbol " declares the "current symbol" variable (named "symbol" by default), and ".epsilon " declares the symbol to be used to match the null token (empty string of input symbols). .sp One declaration is used only by parsers written in Ratfor: ".common ''" specifies the name of an include file containing the declarations of the current symbol variable and any other variables used for communication between the parser and the lexical analyzer. .sp The final two types of declarations are used to list mnemonics for terminal symbols recognized by the lexical analyzer. The first consists of ".ext_term" followed by a list of identifiers used by the lexical analyzer to identify terminal symbols. This declaration merely informs 'stacc' that the given names represent terminal symbols; no macro definitions are generated. The second type consists of ".terminal" followed by a list of mnemonics. Each mnemonic is assigned an integer value, and output in a macro definition to make that value available to the lexical analyzer. A specific value may be assigned to a terminal symbol by preceding it with an integer; two terminal symbols may be equated by placing an equals sign (=) between them. Otherwise, increasing values (starting from zero) are assigned. For example, the following declaration .sp .nf .in +10 .cc # .terminal ALTSYM DECLSYM = NOADVANCESYM 100 LETTER_D LETTER_E ; #cc .in -10 .fi .sp produces the following Ratfor macro definitions: .nf .sp .in +10 define(ALTSYM,0) define(DECLSYM,1) define(NOADVANCESYM,1) define(LETTER_D,100) define(LETTER_E,101) .sp .in -10 .fi Productions are written in a language similar to the extended BNF used throughout the Subsystem. A production consists of a nonterminal symbol, followed by the "rewrites as" symbol (->), followed by a right-hand-side with imbedded semantic actions. .sp The right-hand-side allows the usual BNF operators: vertical bar (|) to indicate a choice, parentheses to nest right-hand-sides, square brackets ([]) to enclose optional constructs, and curly braces ({}) to enclose repeated constructs. .sp Items in the right-hand-side are nonterminal symbols (those that appear in the left-hand-side of some production), terminal symbols (those declared by the ".terminal" and ".ext_term" declarations), quoted single characters, or two terminal symbols or quoted characters separated by a colon (:) (which matches any terminal symbol or character within the given limits). In addition, a right-hand-side may be preceded by a dollar sign ($), indicating that it represents a particularly common form of production: a number of alternatives, each of which is distinguished by a single leading terminal symbol. (This happens, for example, in parsing statements in most algorithmic languages; each different type of statement is preceded by a unique key word.) Recognition of this common case allows much faster special-purpose code to be generated. .sp After any terminal, nonterminal, or special construct in the right-hand-side there may appear semantic actions. Actions to be performed after a symbol is successfully matched are preceded by an exclamation point (!); actions to be performed after a symbol fails to be matched are preceded by a question mark (?). Actions extend from their initial character to the end of the line on which they appear. Actions appearing after terminal symbols are executed .ul after a symbol is matched and .ul before the lexical analyzer is called; thus, they may perform some operation based on characteristics of the symbol matched. If the terminal symbol or range of terminal symbols being matched is followed by a period (.), the automatic call of the lexical analyzer is disabled, allowing the user to substitute his own scanning actions. .sp Actions may appear immediately after the "rewrites as" symbol (->), in which case they are executed before any code generated by 'stacc', or immediately after the production-terminating semicolon (;), in which case they are executed unconditionally before control leaves the production. .sp A sample production: .sp .nf .in +10 parser -> { ( declaration | production ) ';' ? call error ("missing semicolon.") ! numprods = numprods + 1 } EOF. ? call error ("EOF expected.") ! call analyze ; .fi .in -10 .sp 2 .bf Using 'Stacc' With Ratfor .sp Ratfor users of 'stacc' should note that they must declare a common block "include" file with the ".common" declaration, so that the lexical analyzer can communicate with the parser. .sp The form of a 'stacc' output routine in Ratfor is a subroutine with one integer argument, which exports the parser state upon return. The parser state will either be NOMATCH (1) if the first symbol failed to match any legal alternative, FAILURE (2) if some symbols matched but some did not (and no error recovery succeeded), or ACCEPT (3) if the input was a legal sentence in the language being processed. The name of the status argument is presently fixed at "gpst"; this variable is reserved for 'stacc' and should not be used for any other purpose. .sp To use a 'stacc'-generated parser, the Ratfor programmer should simply call the subroutine whose name corresponds to the start symbol of his grammar, passing one integer variable as an argument. That variable will contain the parse state upon completion of the parse. .sp If the user specifies a grammar that is recursive, 'stacc' will produce recursive output; it will then be necessary to use Virtual-mode Fortran with the local-variables-allocated-in-stack-frame option. (This is the default under the Subsystem.) .es .fi The following sample 'stacc' input will generate a program to convert infix arithmetic expressions to reverse-Polish. An expression consists of letters, digits, and operators arranged in the usual manner. Multiplication and division have priority over addition and subtraction. .sp .nf .in +5 .cc # .scanner "getchar"; .symbol "char"; .common "rpn.com"; expression -> ! integer op term { ($ '+' ! op = '+'c | '-' ! op = '-'c ) term ! call putch (op, STDOUT) } ; term -> ! integer op factor { ($ '*' ! op = '*'c | '/' ! op = '/'c ) factor ! call putch (op, STDOUT) } ; factor -> 'a':'z' ! call putch (char, STDOUT) | '0':'9' ! call putch (char, STDOUT) | '(' expression ')' ; #cc .fi .in -5 .sp 2 'Stacc' produced the following Ratfor output: .sp 2 .nf .in +5 define(NOMATCH,1) define(FAILURE,2) define(ACCEPT,3) subroutine expression (gpst) integer gpst include 'rpn.com' integer state integer op call term (state) select (state) when (FAILURE) { gpst = FAILURE return } if (state == ACCEPT) { repeat { state = NOMATCH select (char) when (171) { state = ACCEPT op = '+'c call getchar } when (173) { state = ACCEPT op = '-'c call getchar } if (state == ACCEPT) { call term (state) select (state) when (FAILURE) { gpst = FAILURE return } when (ACCEPT) { call putch (op, STDOUT) } if (state ~= ACCEPT) { gpst = FAILURE return } } } until (state ~= ACCEPT) select (state) when (NOMATCH) state = ACCEPT if (state ~= ACCEPT) { gpst = FAILURE return } } gpst = state return end subroutine term (gpst) integer gpst include 'rpn.com' integer state integer op call factor (state) select (state) when (FAILURE) { gpst = FAILURE return } if (state == ACCEPT) { repeat { state = NOMATCH select (char) when (170) { state = ACCEPT op = '*'c call getchar } when (175) { state = ACCEPT op = '/'c call getchar } if (state == ACCEPT) { call factor (state) select (state) when (FAILURE) { gpst = FAILURE return } when (ACCEPT) { call putch (op, STDOUT) } if (state ~= ACCEPT) { gpst = FAILURE return } } } until (state ~= ACCEPT) select (state) when (NOMATCH) state = ACCEPT if (state ~= ACCEPT) { gpst = FAILURE return } } gpst = state return end subroutine factor (gpst) integer gpst include 'rpn.com' integer state state = NOMATCH if (225 <= char && char <= 250) { state = ACCEPT call putch (char, STDOUT) call getchar } if (state == NOMATCH) { if (176 <= char && char <= 185) { state = ACCEPT call putch (char, STDOUT) call getchar } if (state == NOMATCH) { if (char == 168) { state = ACCEPT call getchar } if (state == ACCEPT) { call expression (state) select (state) when (FAILURE) { gpst = FAILURE return } if (state ~= ACCEPT) { gpst = FAILURE return } state = NOMATCH if (char == 169) { state = ACCEPT call getchar } if (state ~= ACCEPT) { gpst = FAILURE return } } } } gpst = state return end .sp 2 .fi .in -5 The following main program and common block include file were necessary to finish the implementation: .sp 2 .in +5 .nf # rpn --- convert to Reverse Polish include "rpn.stacc.defs" integer state include "rpn.com" call getchar call expression (state) call putch (NEWLINE, STDOUT) if (state ~= ACCEPT || char ~= NEWLINE) call error ("syntax error.") stop end # getchar --- get next character from standard input subroutine getchar include "rpn.com" character getch char = getch (char, STDIN) return end include "rpn.stacc.r" # common blocks for 'rpn' character char common /chcom/ char .fi .in -5 .me (Note that all error messages are preceded by the number of the line in the input stream being processed at the time of the detection of the error.) .sp .nf -> symbol is ill-formed '-' was seen, but '>' was missing EOF expected there is data after the last legal production actions are illegal here actions are not allowed immediately after '$' bad symbol input string could not be lexically analyzed error actions illegal here error actions not allowed after quick-select terminal identifier or string expected missing declaration parameter illegal declarator keyword after '.' was not recognizable illegal term/nonterm expected a terminal or nonterminal; didn't find one missing '->' should be a '->' after the left-hand-side of a production missing alternative missing or illegal alternative after '|' missing choice missing or illegal quick-select alternative after '|' missing declarator missing keyword after '.' missing optional rhs there should be a right-hand-side within square brackets missing quote obvious, hopefully missing quote or string too long strings have a maximum length of about 100 characters missing repeated rhs there should be a right-hand-side within curly braces missing rhs in parentheses there should be a right-hand-side within parentheses missing right brace missing right bracket missing right parenthesis missing right-hand-side missing entire right-hand-side of production missing semicolon missing upper bound missing second terminal in range of form 'lower:upper' not yet available the language specified cannot be used with 'stacc' yet production expected input stream contained neither declaration nor production too many action/erroraction lines there are too many lines of code to store internally too many characters pushed back internal error --- see your Subsystem manager too much action/erraction text there is too much code to store internally unsupported language 'stacc' doesn't recognize the language name specified .fi .sp (Other messages may occasionally arise from the dynamic storage routines. If these occur, see your Subsystem manager for assistance.) .bu [cc]mc | .# The only languages currently supported are Ratfor and C, .# although adding others should be relatively easy. .# .sp [cc]mc 'Stacc' does not optimize its Ratfor output as well as it could; some redundant code shows up occasionally. .sp Error recovery in 'stacc' is still somewhat primitive. .sp No check is made to see if the input grammar is LL(1). .sa rp (1)