Writing a Parser using (F)LEX and YACC/BISON

2018-06-19 1411 words 7 minutes views

Contents

1 Task Description

We will implement few statements found in C. We will call this language SubC. Given the grammar of SubC we will write the F(LEX) and YACC/BISON specification file that parses code written in SubC. We will generate a parser for only assignment statements. Following is the grammar for assignment statements for SubC.

1
2
3
4
5
6
7


Stmt_list → Stmt | Stmt_list ; Stmt
Stmt → Variable **assignop** Expression
Variable → **id** | **id** [Expression]
Expression → Simple_expression | Simple_expression **relop** Simple_expression
Simple_expression → Term | Simple_expression **addop** Term
Term → Factor | Term **mulop** Factor
Factor → **id** | **num** | ( Expression) | **id** [Expression] | **not** Factor

Tokens/Terminal symbols of this grammar are in BOLD font. Below are the specifications for the tokens

Token Name	Pattern/Description
Id	Starts with a letter followed by letters, digits, or underscore. Max length 5 characters. Cannot be any of the keywords: “if”, “else”, “while”, “for”, “repeat”, “until”, “return” , “main”
Num	Integers ‐ real numbers that include a decimal point and at least one digit to the left and right of the point
Relop	“<“, “>”, “<=”, “>=”, “==”, “!=”
Addop	“+”,”‐”,"\pipe\pipe"
Mulop	“*”, “/”, “%”, “&&”
Assignop	“=”
Not	“!”

If successful in parsing a list of statements the parser will print “success” otherwise will print appropriate error message.

2 Implementation

We have to write (F)LEX and YACC/BISON specification files (e.g., subc.l and subc.y). The parser that yacc/bison generates should be able to call the lexical analyzer generated by lex/flex and perform parsing of the code (i.e., list of assignment statements) generated by the given grammar. Your LEX specification should be able to remove any white space from the code. All ids should be stored in a symbol table and printed out as output.

2.1 Samples

2.1.1 Input 1

1
2


b= b + 1;
input[b+2]   = b; a = input [i+2]

2.1.2 Output 1

1
2
3
4


b
input
a
Success

2.2.1 Input 1

1

a[]=0.2

2.2.2 Output 2

1
2


a
missing expression

3 Structures of Lex (analyzer) and YACC (parser)

3.1 Lex (Lexical Analyzer):

Lex is a computer program that generates lexical analyzers (what we did manually in the assignment 1 by java, but installing lex in the computer will do it automatically). Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language.

3.2 Structure of a Lex file

The structure of a Lex file is intentionally similar to that of a yacc file; files are divided into three sections, separated by lines that contain only two percent signs, as follows:

1
2
3
4
5


Definition section
%%
Rules section
%%
C code section

The definition section defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file.
The rules section associates regular expression patterns with C statements. When the lexer sees text in the input matching a given pattern, it will execute the associated C code.
The C code section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section. In large programs it is more convenient to place this code in a separate file linked in at compile time.

3.3 Example of a Lex file

The following is an example Lex file for the flex version of Lex. It recognizes strings of numbers (integers) in the input, and simply prints them out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


/*** Definition section ***/

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

/* This tells flex to read only one input file */
%option noyywrap

%%
    /*** Rules section ***/

    /* [0-9]+ matches a string of one or more digits */
[0-9]+  {
            /* yytext is a string containing the matched text. */
            printf("Saw an integer: %s\n", yytext);
        }

.|\n    {   /* Ignore all other characters. */   }

%%
/*** C Code section ***/

int main(void)
{
    /* Call the lexer, then quit. */
    yylex();
    return 0;
}

If this input is given to flex, it will be converted into a C file, lex.yy.c. This can be compiled into an executable which matches and outputs strings of integers. For example, given the input:

1

abc123z.!&*2gj6

The program will print:

Saw an integer: 123 Saw an integer: 2 Saw an integer: 6

3.4 YACC-Yet Another Compiler Compiler (Parser)

It is a LALR parser generator, generating a parser, the part of a compiler that tries to make syntactic sense of the source code.

For example, a C program may contain something like:

1
2
3
4
5


{
    int int;
    int = 33;
    printf("int: %d\n",int);
}

In this case, the lexical analyzer would have broken the input stream into a series of “tokens”, like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


{
int
int
;
int
=
33
;
printf
(
"int: %d\n"
,
int
)
;
}

Note that the lexical analyser (lex) has already determined that where the keyword int appears within quotes, it is really just part of a literal string. It is up to the parser to decide if the token int is being used as a keyword or variable. Or it may choose to reject the use of the name int as a variable name. The parser also ensures that each statement ends with a ; and that the brackets balance.

More precisely lexical analyzer (i.e. lex) checks correctness of the stream of characters and parser (i.e. yacc) checks correctness of the stream of tokens i.e. which token will be followed by which one etc.

3.5 Structure of a YACC source program

A YACC source program is structurally similar to a LEX one.

1
2
3
4
5


declarations
%%
rules
%%
routines

The declaration section may be empty. Moreover, if the routines section is omitted, the second %% mark may be omitted also.
Blanks, tabs, and newlines are ignored except that they may not appear in names.

THE DECLARATIONS SECTION

This section may contain the following items -

Declarations of tokens. Yacc requires token names to be declared as such using the keyword %token.
Declaration of the start symbol using the keyword %start
C declarations: included files, global variables, types.
C code between %{ and %}.

RULES SECTION

A rule has the form -

1
2
3
4
5


nonterminal : sentential form
            | sentential form
            .................
            | sentential form
            ;

Actions may be associated with rules and are executed when the associated sentential form is matched.

LEX-YACC INTERACTION

yyparse() calls yylex() when it needs a new token.

LEX	YACC
`return(TOKEN)`	%token TOKEN
	TOKEN is used in production

The external variable yylval

is used in a LEX source program to return values of lexemes,
yylval is assumed to be integer if you take no other action.
Changes related to yylval must be made
- in the definitions section of YACC specification
  - by adding new types in the following way %union { (type fieldname) (type fieldname) ............... }
  - and defining which token and non-terminals will use these types
    1 2
    
    %token <fieldname> token %type <fieldname> non-terminal
- in LEX specification by using the fieldnames in the assignment as follows
  1
  
  yylval.fieldname = ...........

If you need a record type, then add it in the union. Example:

1
2
3
4
5
6


%union {
   struct s {
      double fvalue;
      int ivalue;
   } t;
}

in the LEX specification use the record name and record field in assignments:

1

yylval.t.ivalue = ......

in the YACC rules specification use the record field only in the assignment:

1

$1.ivalue = ......

assuming that $1 has the appropriate type, whatever it denotes.

4 My Source code: Simple Parser using Lex and Yacc

4.1 How to run my code

Keep the parser folder anywhere in your computer.
Keep the Makefile.mk file in home.
Edit the first line of the Makefile.mk from “cd parser” to “cd path_of_the_parser_folder”
Go to command prompt and write ./Makefile.mk

4.2 Input

Initially input.txt is given in the parser folder. You can customize the inputs.
There is an additional input file named tricky_input.txt that contains some tricky cases.