parse-pcre2

This is an implementation of a PCRE2 (v 10.45) parser, using Text.ParserCombinators.ReadP. The syntactic rules are based on the man pages pcre2syntax and pcre2pattern.

It exports a function parsePCRE that produces an abstract syntax tree given a correct PCRE2 pattern.

Installation

Install Haskell stack

See for instance https://docs.haskellstack.org/en/stable/install_and_upgrade/

Install parse-pcre2

git clone [email protected]:david-wahlstedt/parse-pcre2.git
cd parse-pcre2
make install

Running make install will run the code generation script in code_gen/, and then run stack install. The reason we generate some code automatically is due to the long lists of character properties, with names as well as their shortcuts. This is based on the output of pcre2test -LP and pcre2test -LS (this only works for PCRE2 v 10.44 or higher); these outputs are in the pcre2test/ directory. If you want to generate property names for a different version of PCRE2, issue those commands and replace corresponding files' contents with the new outputs.

Usage

There is a main program that can be called from the command-line, for testing purposes, that expects standard input. If the input has several lines, these are considered as part of the pattern to parse.

Examples:

echo -n 'a|b.'|parse-pcre2
Alt [ Seq [ Lit 'a' ] , Seq [ Lit 'b' , Chartype CTAny ] ]

or from a file (we omit the output of space reasons)

parse-pcre2 < examples/date-1.txt

or for files with several lines, one regex each:

while IFS= read -r line || [ -n "$line" ]; do printf "%s" "$line" | parse-pcre2; done < examples/quoting.txt

Why Build a PCRE2 Parser in Haskell?

Most applications use regular expressions to match patterns in strings. However, there are cases where one might want to analyze the regular expression itself for various purposes.

In my case, I wanted to create a Haskell program that generates random strings matching a given PCRE2 expression—similar to regex-genex, but specifically for PCRE2. To do that, building a PCRE2 parser seemed like a natural first step.

The Haskell Regular Expressions page provides an overview of regex support, including C bindings for PCRE. However, the PCRE2 C library doesn't offer a function that returns an abstract syntax tree (AST) suitable for functional programming. As far as I know, no available tool or library can take a PCRE2 expression and return such an AST. While there is an ANTLR4 grammar for PCRE, it doesn’t account for option-sensitive parsing.

Features

Almost Complete PCRE2 Syntax: The parser supports all of PCRE2’s syntax, with only a few exceptions related to option sensitivity (such as ALT_BSUX for ECMAScript compatibility) and non-UTF-8 character encoding handling.
Dynamic option support: Options like (?x) and (?xx) that affect parsing—such as ignoring whitespace and treating #.*\n as comments—can be toggled on and off at any point in the expression. The parser also supports the (?n) option to turn off capture group numbering, and correctly handles the (?| ... ) construction, which resets group counters for each alternative inside it. While most PCRE options affect matching rather than syntax, our parser treats these options syntactically, leaving it up to the user to handle their semantics.
No matching functionality: This parser only deals with PCRE2 syntax. It produces a syntax tree without addressing the semantics of regex matching.
Lenient syntax handling: Some PCRE2 options—like those that control whether empty character classes are allowed—are always enabled in this parser (this behavior might change in future versions). In short, the parser allows some constructs that are semantically invalid but still generates a tree for them. For instance, the character class [z-a] is accepted and produces a range from z to a, even though this is not allowed by PCRE2. It’s up to the user to handle such cases appropriately.

PCRE2 sample expressions

The examples in examples/ are not yet systematic, but they cover many kinds of PCRE2 patterns. Some of them are taken from RFCs and YANG specifications, and some are hand-made for this parser. Some examples (under directories with names containing substring fail) should fail, but the others should succeed.

Limitations

UTF-8 assumed: The parser accepts option settings that control which encoding the pattern is expected to match, but they have no effect. We only deal with UTF-8 code points so far.
Error handling: There is no error handling: the parser just answers "No parse" if parse is unsuccessful.
Testing: There are still no automatic tests. We have tested the examples in the aforementioned examples directory successfully, and more.

Some remarks on the code:

Our main objective was to develop a functional parser and make it resemble a grammar as closely as possible. The code largely follows the principles of the "Simple Haskell" initiative. There is certainly room for improvement!

Abstract Syntax: The current abstract syntax reflects my position between two goals, based on initial intuition: 1) retaining all the information necessary to reconstruct the exact input PCRE2 pattern, and 2) maintaining only the minimal information required to match the same set of strings as the original expression. We aim for the constructors to be self-documenting but not overly verbose. Some constructors mirror their PCRE2 counterparts (like Unicode script names), while others simplify their intent, such as Many for *, Many1 for +, and RepMin Int for the {,N} quantifier. Certain constructs, such as character classes or lookaround, could potentially benefit from being records, to better express properties like positive/negative and forward/behind.
No Lexical analysis: Currently, the parser operates directly on characters, but we believe it may be beneficial to refactor it to include an initial lexical analysis pass.
Module Headers: There are currently no export lists, and several warnings have been suppressed for convenience. These are practical decisions, though they could be revisited in future versions for cleaner and more robust module design.

Acknowledgments

Special thanks to @Thomas-H, for some nice simplifying contributions and discussion! Also thanks to @bengtj100 for earlier discussions and feedback!

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
app		app
code_gen		code_gen
examples		examples
pcre2test		pcre2test
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Setup.hs		Setup.hs
package.yaml		package.yaml
parse-pcre2.cabal		parse-pcre2.cabal
stack.yaml		stack.yaml
stack.yaml.lock		stack.yaml.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parse-pcre2

Installation

Install Haskell stack

Install parse-pcre2

Usage

Why Build a PCRE2 Parser in Haskell?

Features

PCRE2 sample expressions

Limitations

Some remarks on the code:

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

david-wahlstedt/parse-pcre2

Folders and files

Latest commit

History

Repository files navigation

parse-pcre2

Installation

Install Haskell stack

Install parse-pcre2

Usage

Why Build a PCRE2 Parser in Haskell?

Features

PCRE2 sample expressions

Limitations

Some remarks on the code:

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages