Regular Expressions in Java
I used Java regular expressions (java.util.regex
) for years uncritically. University taught me that regular expression are efficient.
Some day in my project I had the challenge to search a large amount of words (>1.000) in a large number of documents (>10.000). I remembered college and instead of searching each word in each document. I decided to search in each document for the regular expression <Word1>|<Word2>|...|<Word1000>
. That should have been far more efficient than searching each word in each document … However, regular expression is not regular expression and the fast automatons known to me from college are not the same that are provided by Java.
Before resolving my mistake and giving hints how it would have worked better, I want to give a small introduktion to regular expressions, finite automata and their special features. The understanding of finite automata is essiential for understanding why java provides regular expression that are not efficient as expected.
Finite Automata
The typical way from a regular expression to a matcher is transforming the expression to an automaton. From theoretical computer science we know different variants of automata for regular languages, all being either a nondeterministic finite automaton (NFA) or a deterministic finite automaton (DFA).
For better understanding I sketch a typical NFA and DFA for the pattern (ab?|ac*)c*d
, to point out the differences between both variants.
NFA
DFA
The DFA is more strict, having additional constraints to the structure of the automaton graph:
- no transition without input character (no ε-transitions)
- every node/state has at most one transition/edge for each known input character
The fact that there exist multiple variants of NFAs/DFAs is originating from different advantages and disadvantages of each variant. Let us have a closer look on following variants:
- Backtracking NFA
- Bitparallel NFA
- Thompson NFA
- Graph DFA
- Tabled DFA
Backtracking NFA
The most common automaton is the Backtracking NFA. The text is processed along the edges of the automaton with backtracking if there are no alternatives at some node. Overall this is comparable with a depth-first search in the automaton graph.
The runtime of such an algorithm is dependend on the number of paths in the graph, since it is possible that the best match is on the path that is selected as last alternative (> polynomial runtime). The Backtracking NFA is not fast, yet it provides several features that other automatons do not.
Backtracking NFAs support capturing groups (submatch extraction), i.e. tagged (usually with parentheses around) sub groups of a pattern could be captured while processing. Science (and the POSIX standard) favor the leftmost longest submatch extraction. I.e. all matching paths are considered, each giving a subgroup partition. The partition having a group at the most left position is preferred, if there are multiple partitions the partition with the longest group is preferred. The upper pattern applied to accccd
captures a sub group 1 acccc
, since the best partitioning is given by the path through the states 1-3-…-3-4-5 (1 and 3 belong to the group, 4 does not).
In practice (i.e. in Java, C#, Python, Perl, …) the greedy leftmost submatch extraction is dominant. This could be generated by just matching the pattern in the definition sequence (and capturing the groups on the fly, preferring longer groups over shorter ones). This strategy is faster thatn the POSIX variant, because the partitioning is available as the complete expression is matched (and not as all possible matchings were considered). The upper pattern applied to accccd
captures a sub group 1 a
, because the first matching path traverses the states 1-2-4-4-…-4-5 (1 and 2 belong to the group, 4 does not).
Another feature of Backtracking NFAs is lookaheads and lookbehinds (more generically lookarounds). They allow to evaluate pattern conditions at some position for future input (lookahead) or consumed input (lookbehind). Science claims that lookarounds do not add power to finite automaton (there is no language that is recognizable with lookarounds and not without). In practice translating an automaton with lookarounds to an DFA was never tried. The reason is probably, that compiling lookarounds is quite hard and memory intensive.
A feature of Backtracking-NFAs, that strictly adds power to an NFA (making it more than an NFA) is Backreferences. Backreferences allow to refer to once captured groups again in the pattern.
Backtracking-NFA regular expressions are provided by most programming languages, e.g. Java, .NET, Perl or Python.
Thompson-NFA
Epsilon-Transitions and Alternatives with equal input character may lead to many target nodes. The Thompson NFA traverses the graph in a way that it holds a set of active states (initially the start state) performing all possible transitions (on all active states) on receiving an input character (the targets of such a transition are the new active states). This corresponds to a breadth-first search in the automaton graph.
The runtime of this algorithm is more predictable that the runtime of a Backtracking NFA. The parallel-traversing and caching of the actives states needs runtime and memory, yet the worst case coverage stays polynomial.
There are solutions providing capturing groups for Thompson NFAs, yet such groups deviate from the groups of a Backtracking NFA which are the de facto standard.
More detailed information on Thompson NFAs could be found at Russ Cox’ Website. The Thompson NFA is not very popular, in Java there is one library re2j.
Bitparallel-NFA
Even more parallel is the bitparallel NFA (proposed by Baeza-Yates and Navarro). The idea is to assign each active state a position in a bitset. The whole state of the automaton is characterized by the set of active bits. A Transition to the next state is accomplished by a “multiplication” with a transition matrix. There are some variants of bitparallel automatons, we will focus on the bitparallel Glushkov Automaton.
Bitparallel NFAs need some preprocessing, but much less than the preprocessing for a DFA. The best performance is reached if the set of bits fit into a computer word (64).
The bitparallel NFA is more flexible than a DFA. A special strength of a bitparallel NFA is search, because search for patterns does need a reverse matching automaton and this is easily derived from a bitparallel NFA. Morevoer the memory consumption could be easily adapted to the provided memory (by trading it with runtime).
Literature on Bitparallel NFAs could be found here. An implementation in Java could be found at stringsearchalgorithms (BPGlushkov).
Graph-DFA
A DFA may be traversed like a NFA as graph. The DFA property ensures, that each state and each input character can trigger et most one transition (if no transition is qualified, that the automaton rejects). The question whether to use depth-first or breadth-first search needs not to be answered, because each input sequence has exactly one path through the graph.
The runtime of a Graph DFAs is linear. Yet the preprocessing is expensive, Advantages over an NFA can only be reached when the automaton is used repeatedly. The Graph DFA is slower than a Tabled DFA, but also slightly more flexible.
The representation as graph has the advantage that edges could be annotated with tags or events which could emit additional output. Such output allows capturing groups in DFAs. Yet these capturing groups are similar to the Thompson NFA capturing groups and thus not compilant with the de facto standard of a Backtracking NFA.
An Implementation in Java could be found at brics, Rexlex, patternsearchalgorithms.
Tabled-DFA
Even faster than the Graph DFA is the Tabled DFA, which is the result of transforming the graph into a state transition table.
The runtime of a Tabled DFA is linear, yet the operations per transitions are reduced and even cache hits could be optimized through higher locality of data.
The transition table makes the algorithms somewhat inflexible, it is not that easy to tag state transitions with output.
An Implementation in Java could be found at brics, Rexlex, patternsearchalgorithms.
Regular Expression Search
Let us return to our original problem: We search <Word1>|<Word2>|...|<Word1000>
in a large number of documents with java.util.regex
. Why is it that slow?
The reason is:
* java.util.regex
has a Backtracking NFA engine. This means that search in each document degrade to trying each alternative (1000 alternatives). The algorithm is in no way more efficient than the naive solution (searching each word in each document)
* the idea was to use java.util.regex
at each position in the document. This means a failed matchs leads to a shift of one character and restarting the match on the next character. Many string search algorithms do better and avoid reading characters already read.
* the number of documents is only critical, because the effects of a slow algorithm are only revealed in presence of large numbers.
Would could we have improved?
At first we could have improved the automaton - not to check 1000 alternatives explicitly but to check the input stream to contain any of these patterns - the Thompson-NFA, the Bitparallel-NFA, the Tabled-DFA are valid alternatives. The Thompson-NFA appears to be the most robust (slower in search, but complex patterns do not affect preprocessing). On Average the fastest searches are performed with Tabled-DFAs, yet these are very costly at preprocessing complex patterns (sometimes domating search time!). Between them lies the Bitparallel-NFA.
The next step is improving the search algorithm. The naive algorithms starts a matching at each position of the document. A faster algorithm would skip all characters that could not belong to the pattern, after a match it should resume after this match. We can do this by modifying the automaton to .*(<Word1>|<Word2>|...|<Word1000>)
(which searches the first occurence of the pattern, not necessarily starting at the current position). This searches runtime is linear dependent on the number of characters in the document (and independent from the complexity of the pattern). Unfortunately such a modified automaton is definitely more complex than the original automaton (with critical effects to the preprocessing time of Tabled-DFA).
After a solution approach for the search with automatons I researched some time how others solved this problem. Finally I recognized that I had invested time on optimizing a known solution, instead of finding a better solution. There seem to exist many efficient algorithms for the search of multiple words in a document. Sadly these algorithms did not have Java implementations, so this was the initial motivation to forge stringsearchalgorithms, which provides more than one algorithm for multi-string search.
Regular Expressions in Java
Search for alternatives is obviously not a task which should be solved by regular expressions. But there are situations in which we depend on efficient regular expressions. It would be interesting to know which automaton (respectively which library) is suitable for which problem. First we should determine the parameters:
- Are the patterns user input or literals?
- Are we interested in searching the pattern or matching the pattern?
- How long is the average pattern?
- How often do we use a single pattern?
- Do we really need the full power of regular expressions?
- Do we really need nondeterminism (capturing groups, lookarounds)?
If the patterns are build of user input, then there is few scope for optimizing manually. Consequently the next optimization step is the transformation to a faster automaton (Thompson NFA, Bitparallel NFA or DFA). Constant regular expressions may be optimized manually, often reaching a performance boost that allows (if other parameters are uncritical) to omit this transformation.
The decision whether to search or to match is important, because search needs much more complex automatons. The Backtracking NFA is really bad for search. Smaller expressions could be efficiently searched with Bit-Parallel NFAs (short preprocessing), larger will probably need a DFA.
The optimization effect for short simple patterns is much smaller than for long complex patterns. Malicious regular expressions contain alternatives nested in repeated sections (high performance boost on transformation). Quite simple are flat expressions (no alternatives), that do only contains wildcards, characterclasses and optional characters (low performance boost on transformation).
The more often a pattern is applied, the more important is the optimization to a DFA. Patterns without reuse may be excluded from further optimizations because the generation time of a DFA gets dominant in such cases.
Often we do not really need the full power of regular expressions. Sometimes we can restrict the problem to multi-string search (no Kleene-Star (*
) allowed) or wildcard search (no alternatives allowed), which are both efficient without regular expressions.
There are features of “nondeterministic” regular exprssion, that cannot be easily mimicked: capturing groups and backreferences. This means we need an NFA. In case of regular expression search it yet might be efficient to collect the matches with a more efficient equivalent DFA and then apply validation and submatch extraction with the NFA.
There are also features, that could be avoided: lookarounds. The may be convenient in daily work (we usually do not process billions of bytes). However, in most cases we can find an equivalent regular expression without lookarounds. This is not only more efficient (without regard of the used automaton), but often also easier to understand. Even if the regular expressions are computed from user input we should consider whether to grant the user this flexibility.