Skip to content

david-wahlstedt/parse-pcre2

Repository files navigation

parse-pcre2

This is an implementation of a PCRE2 (v 10.45) parser, using Text.ParserCombinators.ReadP. The syntactic rules are based on the man pages pcre2syntax and pcre2pattern.

It exports a function parsePCRE that produces an abstract syntax tree given a correct PCRE2 pattern.

Installation

Install Haskell stack

See for instance https://docs.haskellstack.org/en/stable/install_and_upgrade/

Install parse-pcre2

git clone [email protected]:david-wahlstedt/parse-pcre2.git
cd parse-pcre2
make install

Running make install will run the code generation script in code_gen/, and then run stack install. The reason we generate some code automatically is due to the long lists of character properties, with names as well as their shortcuts. This is based on the output of pcre2test -LP and pcre2test -LS (this only works for PCRE2 v 10.44 or higher); these outputs are in the pcre2test/ directory. If you want to generate property names for a different version of PCRE2, issue those commands and replace corresponding files' contents with the new outputs.

Usage

There is a main program that can be called from the command-line, for testing purposes, that expects standard input. If the input has several lines, these are considered as part of the pattern to parse.

Examples:

echo -n 'a|b.'|parse-pcre2
Alt [ Seq [ Lit 'a' ] , Seq [ Lit 'b' , Chartype CTAny ] ]

or from a file (we omit the output of space reasons)

parse-pcre2 < examples/date-1.txt

or for files with several lines, one regex each:

while IFS= read -r line || [ -n "$line" ]; do printf "%s" "$line" | parse-pcre2; done < examples/quoting.txt

Why Build a PCRE2 Parser in Haskell?

Most applications use regular expressions to match patterns in strings. However, there are cases where one might want to analyze the regular expression itself for various purposes.

In my case, I wanted to create a Haskell program that generates random strings matching a given PCRE2 expression—similar to regex-genex, but specifically for PCRE2. To do that, building a PCRE2 parser seemed like a natural first step.

The Haskell Regular Expressions page provides an overview of regex support, including C bindings for PCRE. However, the PCRE2 C library doesn't offer a function that returns an abstract syntax tree (AST) suitable for functional programming. As far as I know, no available tool or library can take a PCRE2 expression and return such an AST. While there is an ANTLR4 grammar for PCRE, it doesn’t account for option-sensitive parsing.

Features

  • Almost Complete PCRE2 Syntax: The parser supports all of PCRE2’s syntax, with only a few exceptions related to option sensitivity (such as ALT_BSUX for ECMAScript compatibility) and non-UTF-8 character encoding handling.

  • Dynamic option support: Options like (?x) and (?xx) that affect parsing—such as ignoring whitespace and treating #.*\n as comments—can be toggled on and off at any point in the expression. The parser also supports the (?n) option to turn off capture group numbering, and correctly handles the (?| ... ) construction, which resets group counters for each alternative inside it. While most PCRE options affect matching rather than syntax, our parser treats these options syntactically, leaving it up to the user to handle their semantics.

  • No matching functionality: This parser only deals with PCRE2 syntax. It produces a syntax tree without addressing the semantics of regex matching.

  • Lenient syntax handling: Some PCRE2 options—like those that control whether empty character classes are allowed—are always enabled in this parser (this behavior might change in future versions). In short, the parser allows some constructs that are semantically invalid but still generates a tree for them. For instance, the character class [z-a] is accepted and produces a range from z to a, even though this is not allowed by PCRE2. It’s up to the user to handle such cases appropriately.

PCRE2 sample expressions

The examples in examples/ are not yet systematic, but they cover many kinds of PCRE2 patterns. Some of them are taken from RFCs and YANG specifications, and some are hand-made for this parser. Some examples (under directories with names containing substring fail) should fail, but the others should succeed.

Limitations

  • UTF-8 assumed: The parser accepts option settings that control which encoding the pattern is expected to match, but they have no effect. We only deal with UTF-8 code points so far.

  • Error handling: There is no error handling: the parser just answers "No parse" if parse is unsuccessful.

  • Testing: There are still no automatic tests. We have tested the examples in the aforementioned examples directory successfully, and more.

Some remarks on the code:

Our main objective was to develop a functional parser and make it resemble a grammar as closely as possible. The code largely follows the principles of the "Simple Haskell" initiative. There is certainly room for improvement!

  • Abstract Syntax: The current abstract syntax reflects my position between two goals, based on initial intuition: 1) retaining all the information necessary to reconstruct the exact input PCRE2 pattern, and 2) maintaining only the minimal information required to match the same set of strings as the original expression. We aim for the constructors to be self-documenting but not overly verbose. Some constructors mirror their PCRE2 counterparts (like Unicode script names), while others simplify their intent, such as Many for *, Many1 for +, and RepMin Int for the {,N} quantifier. Certain constructs, such as character classes or lookaround, could potentially benefit from being records, to better express properties like positive/negative and forward/behind.

  • No Lexical analysis: Currently, the parser operates directly on characters, but we believe it may be beneficial to refactor it to include an initial lexical analysis pass.

  • Module Headers: There are currently no export lists, and several warnings have been suppressed for convenience. These are practical decisions, though they could be revisited in future versions for cleaner and more robust module design.

Acknowledgments

Special thanks to @Thomas-H, for some nice simplifying contributions and discussion! Also thanks to @bengtj100 for earlier discussions and feedback!

About

A PCRE2 parser based on Haskell's ReadP parser combinators

Resources

License

Stars

Watchers

Forks

Packages

No packages published