This is an implementation of a PCRE2 (v 10.45) parser, using Text.ParserCombinators.ReadP. The syntactic rules are based on the man pages pcre2syntax and pcre2pattern.
It exports a function parsePCRE
that produces an abstract syntax
tree given a correct PCRE2 pattern.
See for instance https://docs.haskellstack.org/en/stable/install_and_upgrade/
git clone [email protected]:david-wahlstedt/parse-pcre2.git
cd parse-pcre2
make install
Running make install
will run the code generation script in
code_gen/, and then run stack install
. The reason we
generate some code automatically is due to the long lists of character
properties, with names as well as their shortcuts. This is based on
the output of pcre2test -LP
and pcre2test -LS
(this only works for
PCRE2 v 10.44 or higher); these outputs are in the
pcre2test/ directory. If you want to generate property
names for a different version of PCRE2, issue those commands and
replace corresponding files' contents with the new outputs.
There is a main program that can be called from the command-line, for testing purposes, that expects standard input. If the input has several lines, these are considered as part of the pattern to parse.
Examples:
echo -n 'a|b.'|parse-pcre2
Alt [ Seq [ Lit 'a' ] , Seq [ Lit 'b' , Chartype CTAny ] ]
or from a file (we omit the output of space reasons)
parse-pcre2 < examples/date-1.txt
or for files with several lines, one regex each:
while IFS= read -r line || [ -n "$line" ]; do printf "%s" "$line" | parse-pcre2; done < examples/quoting.txt
Most applications use regular expressions to match patterns in strings. However, there are cases where one might want to analyze the regular expression itself for various purposes.
In my case, I wanted to create a Haskell program that generates random strings matching a given PCRE2 expression—similar to regex-genex, but specifically for PCRE2. To do that, building a PCRE2 parser seemed like a natural first step.
The Haskell Regular Expressions page provides an overview of regex support, including C bindings for PCRE. However, the PCRE2 C library doesn't offer a function that returns an abstract syntax tree (AST) suitable for functional programming. As far as I know, no available tool or library can take a PCRE2 expression and return such an AST. While there is an ANTLR4 grammar for PCRE, it doesn’t account for option-sensitive parsing.
-
Almost Complete PCRE2 Syntax: The parser supports all of PCRE2’s syntax, with only a few exceptions related to option sensitivity (such as
ALT_BSUX
for ECMAScript compatibility) and non-UTF-8 character encoding handling. -
Dynamic option support: Options like
(?x)
and(?xx)
that affect parsing—such as ignoring whitespace and treating#.*\n
as comments—can be toggled on and off at any point in the expression. The parser also supports the(?n)
option to turn off capture group numbering, and correctly handles the(?| ... )
construction, which resets group counters for each alternative inside it. While most PCRE options affect matching rather than syntax, our parser treats these options syntactically, leaving it up to the user to handle their semantics. -
No matching functionality: This parser only deals with PCRE2 syntax. It produces a syntax tree without addressing the semantics of regex matching.
-
Lenient syntax handling: Some PCRE2 options—like those that control whether empty character classes are allowed—are always enabled in this parser (this behavior might change in future versions). In short, the parser allows some constructs that are semantically invalid but still generates a tree for them. For instance, the character class
[z-a]
is accepted and produces a range fromz
toa
, even though this is not allowed by PCRE2. It’s up to the user to handle such cases appropriately.
The examples in examples/ are not yet systematic, but
they cover many kinds of PCRE2 patterns. Some of them are taken from
RFCs and YANG specifications, and some are hand-made for this parser.
Some examples (under directories with names containing substring
fail
) should fail, but the others should succeed.
-
UTF-8 assumed: The parser accepts option settings that control which encoding the pattern is expected to match, but they have no effect. We only deal with UTF-8 code points so far.
-
Error handling: There is no error handling: the parser just answers "No parse" if parse is unsuccessful.
-
Testing: There are still no automatic tests. We have tested the examples in the aforementioned
examples
directory successfully, and more.
Our main objective was to develop a functional parser and make it resemble a grammar as closely as possible. The code largely follows the principles of the "Simple Haskell" initiative. There is certainly room for improvement!
-
Abstract Syntax: The current abstract syntax reflects my position between two goals, based on initial intuition: 1) retaining all the information necessary to reconstruct the exact input PCRE2 pattern, and 2) maintaining only the minimal information required to match the same set of strings as the original expression. We aim for the constructors to be self-documenting but not overly verbose. Some constructors mirror their PCRE2 counterparts (like Unicode script names), while others simplify their intent, such as Many for *, Many1 for +, and RepMin Int for the {,N} quantifier. Certain constructs, such as character classes or lookaround, could potentially benefit from being records, to better express properties like positive/negative and forward/behind.
-
No Lexical analysis: Currently, the parser operates directly on characters, but we believe it may be beneficial to refactor it to include an initial lexical analysis pass.
-
Module Headers: There are currently no export lists, and several warnings have been suppressed for convenience. These are practical decisions, though they could be revisited in future versions for cleaner and more robust module design.
Special thanks to @Thomas-H, for some nice simplifying contributions and discussion! Also thanks to @bengtj100 for earlier discussions and feedback!