This manual describes flexc++, a tool for generating lexical scanners: programs recognizing patterns in text. Usually, scanners are used in combination with parsers which can be generated by, e.g., bisonc++
Flexc++ reads one or more input files (called `lexer' in this manual),
containing rules: regular expressions, optionally associated with C++
code. From this Flexc++ generates several files, containing the declaration and
implementation of a class (Scanner by default). The member function
lex is used to analyze input: it looks for text matching the regular
expressions. Whenever it finds a match, it executes the associated C++
code.
Flexc++ is highly comparable to the programs flex and flex++, written by Vern Paxson. Our goal was to create a similar program, completely implementing it in C++, and merely generating C++ code. Most flex / flex++ grammars should be usable with flexc++, with minor adjustments (see also `differences with flex/flex++ 2').
This edition of the manual documents version 2.15.00 and provides detailed information on flexc++'s use and inner workings. Some texts are adapted from the flex manual. The manual page flexc++(1) provides an overview of the command line options and option directives, flexc++api(3) provides an overview of the application programmer's interface, and flexc++input(7) describes the organization of flexc++'s input s.
The most recent version of both this manual and flexc++ itself can be found at https://fbb-git.gitlab.io/flexcpp/. If you find a bug in flexc++ or mistakes in the documentation, please report it to the authors.
Flexc++ was designed and written by Frank B. Brokken, Jean-Paul van Oosten, and (up to version 0.5.3) Richard Berendsen.
Contrary to flex and flex++, flexc++ generates code that is
explicitly intended for use by C++ programs. The well-known flex(1)
program generates C source-code and flex++(1) merely offers a
C++-like shell around the yylex function generated by flex(1) and
hardly supports present-day ideas about C++ software design.
Flexc++ creates a C++ class offering a predefined member function lex which matches input against regular expressions and possibly executes C++ code once regular expressions are matched. The code generated by flexc++ is pure C++, allowing its users to apply all of the features offered by that language.
Flexc++'s synopsis is:
flexc++ [OPTIONS] rules-file
Its options are covered in section 1.1.1, the format of its
rules-file is discussed in chapter 3.
/); options accepting a
'pathname' may contain directory separators.
Some options may generate errors. This happens when an option conflicts with
the contents of an existing file which flexc++ cannot modify (e.g., a scanner
class header file exists, but doesn't define a name space, but a
--namespace option was provided). To solve the error the offending option
could be omitted, the existing file could be removed, or the existing file
could be hand-edited according to the option's specification. Note that flexc++
currently does not handle the opposite error condition: if a previously used
option is omitted, then flexc++ does not detect the inconsistency. In those
cases you may encounter compilation errors.
filename (-b)filename as the name of the file to contain the scanner
class's base class. Defaults to the name of the scanner class plus
base.h
It is an error if this option is used and an already
existing scanner-class header file does not include
`filename'.
pathname (-H)Pathname defines the path to the file preincluded in the
scanner's base-class header. This option (or the corresponding
directive) may be required when using a user-defined Input
class using types etc. which aren't yet provided by the scanner's
base class header. E.g., if a user-defined Input class uses a
stack, then #include <stack> must have been declared when
compiling that class (see the description of the %input-...
declarations in the flexc++input(7) man-page for details). By
specifying this option (or the corresponding directive) the
required header(s) are included when compiling the files generated
by flexc++. By default the pathname argument is surrounded by
double quotes (which can also explicitly be
provided). Alternatively, if pathname is surrounded by pointed
brackets then those are used.
pathname (-C)pathname as the path to the file containing the skeleton of
the scanner class's base class. Its filename defaults to
flexc++base.h.
When this option is specified the resulting scanner does not distinguish between the following rules:
First // initial F is transformed to f
first
FIRST // all capitals are transformed to lower case chars
With a case-insensitive scanner only the first rule can be matched,
and flexc++ will issue warnings for the second and third rule about
rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First rule is matched for
all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the
matched text (as returned by the scanner's matched() member)
always contains the original, unmodified text. So, with the above
input matched() returns, respectively first, First, FIRST
and firST, while matching the rule First.
filename (-c)filename as the name of the file to contain the scanner
class. Defaults to the name of the scanner class plus the suffix
.h
classNameclassName (rather than Scanner) as the name of the
scanner class. Unless overridden by other options generated files
will be given the (transformed to lower case) className* name
instead of scanner*.
It is an error if this option is used and an already
existing scanner-class header file does not define class
`className'
pathname (-C)pathname as the path to the file containing the skeleton of
the scanner class. Its filename defaults to flexc++.h.
`rules-file'.output. Details cover the used character ranges,
information about the regexes, the raw NFA states, and the final
DFAs.
lex and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off) member. Note that #ifdef DEBUG macros are not used
anymore. By rerunning flexc++ without the --debug option an
equivalent scanner is generated not containing the debugging
code. This option does not provide debug information about flexc++
itself. For that use the options --own-parser and/or
--own-tokens (see below).
genericName (-f)lex-function source file, see the --lex-source option for
that). By default the header file names will be equal to the name
of the generated class.
filename (-i)filename as the name of the file to contain the
implementation header. Defaults to the name of the generated
scanner class plus the suffix .ih. The implementation header
should contain all directives and declarations only used by
the implementations of the scanner's member functions. It is the
only header file that is included by the source file containing
lex()'s implementation. User defined implementation of other
class members may use the same convention, thus concentrating all
directives and declarations that are required for the compilation
of other source files belonging to the scanner class in one header
file.
It is an error if this option is used and an already existing
'filename' file does not include the scanner class header
file.
pathname (-I)pathname as the path to the file containing the skeleton of
the implementation header. Its filename defaults to
flexc++.ih.
pathname (-L)pathname as the path to the file containing the
lex() member function's skeleton. Its filename defaults to
flexc++.cc.
funnamefunname rather than lex as the name of the member
function performing the lexical scanning.
filename (-l)filename as the name of the source file to contain the
scanner member function lex. Defaults to lex.cc.
--debug option.
Displaying the matched rules can be suppressed by calling the
generated scanner's member setDebug(false) (or, of course, by
re-generating the scanner without using specifying
--matched-rules).
depth (-m)depth. By default the maximum depth is
set to 10. When more than depth specification files are used
the scanner throws a Max stream stack size exceeded
std::length_error exception.
identifier identifier. By default
no namespace is used. If this options is used the
implementation header is provided with a commented out using
namespace declaration for the requested namespace. In addition,
the scanner and scanner base class header files also use the
specified namespace to define their include guard directives.
It is an error if this option is used and an already existing
scanner-class header file does not define namespace
identifier.
lex function. By default #line directives
are entered at the beginning of the action statements in the
generated lex.cc file, allowing the compiler and debuggers
to associate errors with lines in your grammar specification
file, rather than with the source file containing the lex
function itself.
lex member function is
(re)written each time flexc++ is called. This option
should normally be avoided, as this file contains parsing
tables which are altered whenever the grammar definition is
modified.
This option does not result in the generated program optionally
displaying the actions of its lex function. If that is what
you want, use the --debug option.
This option does not result in the generated program displaying
returned tokens and matched text. If that is what you want, use
the --print-tokens option.
lex function are displayed on the standard
output stream, just before returning the token to lex's
caller. Displaying tokens and matched text is suppressed again
when the lex.cc file is generated without using this
option. The function showing the tokens (ScannerBase::print_)
is called from Scanner::printTokens, which is defined in-line
in Scanner.h. Calling ScannerBase::print_, therefore, can
also easily be controlled by an option controlled by the program
using the scanner object.
This option does not show the tokens returned and text matched
by flexc++ itself when reading its input s. If that is what
you want, use the --own-tokens option.
pathname (-S)-B -C, -H, and -I).
pathname
--construction and --show-filenames options.
%% [_a-zA-Z][_a-zA-Z0-9]* return 1;
The main() function below defines a Scanner object, and calls lex() as
long as it does not return 0. lex() returns 0 if the end of the input
stream is reached. (By default std::cin will be used).
#include <iostream>
#include "Scanner.h"
using namespace std;
int main()
{
Scanner scanner;
while (scanner.lex())
cout << "[Identifier: " << scanner.matched() << "]";
}
Each identifier on the input stream is replaced by itself and some surrounding
text. By default, flexc++ echoes all characters it cannot match to cout. If
you do not want this, simply use the following pattern:
%% [_a-zA-Z][_a-zA-Z0-9]* return 1; .|\n // ignore
The second pattern will cause flexc++ to ignore all characters on the input stream. The first pattern will still match all identifiers, even those that consist of only one letter. But everything else is ignored. The second pattern has no associated action, and that is precisely what happens in lex: nothing. The stream is simply scanned for more characters.
It is also possible to let the generated lexer do all the work. The simple lexer below shows all encountered identifiers.
%%
[_a-zA-Z][_a-zA-Z0-9]* {
std::cout << "[Identifier: " << matched() << "]\n";
}
.|\n // ignore
Note how a compound statement may be used instead of a one line statement at
the end of the line. The opening bracket must appear on the same line as the
pattern, however. Also note that inside an action, we can use Scanner's
members. E.g., matched() contains the text of the token that was last
matched. The following main function can be used to activate the
generated scanner.
#include "Scanner.h"
int main()
{
Scanner scanner;
scanner.lex();
}
Note how simple this function is. Scanner::lex() does not
return until the entire input stream has been processed, because none of
the patterns has an associated action using a return statement.
Command-line editing and history is provided by the Gnu readline library. The
bobcat library offers a class
FBB::ReadLineStream encapsulating Gnu's readline library's facilities.
This class wass used by the following example to implement the required
features.
The lexical scanner is a simple one. It recognizes C++ identifiers and
\n characters, and ignores all other characters. Here is its
specification:
%class-name Scanner
%interactive
%%
[[:alpha:]_][[:alnum:]_]* return 1;
\n return '\n';
.
Create the lexical scanner from this specification file:
flexc++ lexer
Assuming that the directory containing the specification file also
contains the file main.cc whose implementation is shown below, then
execute the following command to create the interactive scanner program:
g++ *.cc -lbobcat
This completes the construction of the interactive scanner. Here is
the file main.cc:
#include <iostream>
#include <bobcat/readlinestream>
#include "Scanner.h"
using namespace std;
using namespace FBB;
int main()
{
ReadLineStream rls("? "); // create the ReadLineStream, using "? "
// as a prompt before each line
Scanner scanner(rls); // pass `rls' to the interactive scanner
// process all the line's tokens
// (the prompt is provided by `rls')
while (int token = scanner.lex())
{
if (token == '\n') // end of line: new prompt
continue;
// process other tokens
cout << scanner.matched() << '\n';
if (scanner.matched()[0] == 'q')
return 0;
}
}
An interactive session with the above program might look like this
(end-of-line comment is not entered, but was added by us for documentary
purposes):
$ a.out
? hello world // enter some words
hello
world // echoed after pressing Enter
? hello world // this is shown after pressing up-arrow
? hello world^H^H^Hman // do some editing and press Enter
hello // the tokens as edited are returned
woman
? q // end the program
$
The interactive scanner only supports one constructor, by default using
std::cin, to read from, and by default using std::cout to write to:
explicit Scanner(std::istream &in = std::cin,
std::ostream &out = std::cout);
Interactive scanners only support switching output streams (through
switchOstream members).