Exact String Matching in Practice
Initially I was interested in string search from an academic view point - just understand the algorithm and be able to implement it. After this passion culminated in the project (StringSearchAlgorithms), I began to compare it to others …
It became obvious that many algorithms worked in theory, but were slow or even failing in practice. Actually most academic algoirthms on string search are optimized for a certain situation dependent on:
- processors
- encoding
- programming languages and APIs
The following sections describe the challenges string search is confronted to in practice.
I shall compare the different aspects for 4 java libraries:
Chars or Bytes (IO)
Many algorithms were designed in the ages of C, where the standard character size was 8 bit. Depending on programming language and encoding these assumption is not realistic any more:
- bytes in Java have 8 bit, bytes are the basic unit for files, but strings prefer chars
- chars in Java have 16 bit, chars are the basic unit of strings, but files prefer bytes
- UTF-8 is the most popular encoding scheme used for files. Java does not even have a type mapping 1:1 on UTF-8
So the Decision whether search should be based on byte[]
or char[]
is not easy:
The char[]
-based search works best if both, pattern and text, are of type String
- or char[]
. Then there is no need to decode to bytes and the search processes 2 bytes per character. Yet the char[]
-based search is inappropriate, if applied to large files. The text (which could be der potentially very long) has to be encoded completely (from byte[]
to char[]
). Some algorithms need random access on each char, this makes it worse, since common char encodings (UTF-8) do not allow to determine the byte index of a given char index n if not n-1 chars before have been encoded already (the problem does not occur if we restrict our files to charsets of fixed bit length, but this is not realistic).
On the other hand byte[]
-based search does not perform optimal for String
s. The string has to be decoded to bytes. Files do not suffer from this disadvantage, java can access the bytes of a file with random-access or as stream. Yet bytes are not that convenient. The pattern to search (usually a string) has to be converted to byte[]
- and we need the encoding of the concrete file to do the conversion with the correct char set.
Applied to the libraries under test:
- ESMAJ is char-based.
- StringSearch may use bytes and chars.
- ByteSeek is byte-based and support primitive converts on char-based input
- StringSearchAlgorithms support both bytes and chars.
Chars or Bytes (Algorithms)
Inspired by Efficient Text Searching in Java I analyzed each library on unicode-compatibility. Actually the question whether to support char
or byte
is not only relevant for IO, but also for the algorithm itself. Many string-matching algorithms are based on lookup tables containing entries for each character. Most algorithms were designed with 8 bit characters (256 entries) in mind, but today a typical character has 16 bit (Unicode) which means 65,536 entries. This means:
- the preprocessing is slowed down by a factor of 256
- the amount of memory to store the lookup table increases by the factor 256
- the lookup table is highly redudant or sparse for typical european patterns and texts
There are different methods to handle this problem:
- the problem does not exist, if the search is done based on bytes, the pattern gets longer in this case, but the table may not grow above 256 entries
- an algorithm without lookup table is not affected by unicode
- the table may be compressed, there are different strategies to do this:
- In many situations the range of valid character codes is dense. One could cut of all entries below and above this range (with some luck, e.g. ASCII-characters the range is even smaller than 256). The lookup will be slower because the key has to be mapped to the range and checked against the range limits.
- Some algorithms use tables for heuristics. Such heuristics have the property that they should not overestimate the result. A possible compression will be to map all entries to a smaller range (colliding keys will adjust the heuristic value to the minimum of all colliding values). Ideally there are no colliding keys, otherwise the algorithm will be somewhat slower.
Applied to the libraries under test:
- ESMAJ does not adjust the algorithms to unicode: Consequently algorithms using tables have 65.536 entries. ESMAJ consumes large amounts of memory. The most critical algorithms probably are those that are dependent on automata, where each automaton has a lookup table of this size.
- StringSearch compresses tables in a lossless way (as far as I could determine). A disadvantage of this lossless compression is, that a single lookup affords multiple computations.
- ByteSeek does not need to adjust its algorithms, because it is already
byte[]
-based (and thus not affected by the unicode issue). - StringSearchAlgorithms provides lossy and lossless compression of lookup tables (dependent on the algorithms).
Strings, Arrays oder Streams
Another aspect is how chars/bytes are provided. Really large files should not be processed as char[]
/byte[]
, because this would consume large parts of the memory, that otherwise could be reclaimed. The more memory efficient way to process large files would be to search in streams, readers or buffers.
Applied to the libraries under test:
- ESMAJ was not designed to support more than search in
char[]
. It ischar[]
-based and does not support search in streams or readers. Scalability for larger files is not given. - StringSearch provides search in
char[]
andbyte[]
, yet no search in streams, readers or buffers. Scalability for larger files is not given. - ByteSeek is
byte[]
-based and optimized to search in buffers or streams. ByteSeek is clearly scalable for larger files. - StringSearchAlgorithms is
char[]
- orbyte[]
-based and support search in readers or streams.
Fazit
Considering all libraries, ESMAJ and StringSearch proved not to scale on large text, since both cannot search in readers, streams or buffers. ESMAJ has a crucial performance overhead for char tables or char automata (causing multiples of 65,536 entries) which gets dominant over the algorithms performance. StringSearch does also suffer from this overhead on alphabets that use a wide range of unicode code points.
More practical prove to be ByteSeek (byte[]
-based and stream-searchable) and StringSearchAlgorithms (char[]
/byte[]
-variants and reader/stream-searchable). ByteSeek (and the byte[]
-variant of StringSeachAlgorithms) are optimal on large alphabets (because implicitly mapping to bytes). The char[]
-variant of StringSearchAlgorithms suffers from performance overhead if lossless table compression is chosen.
A practical Comparison of thes libraries considering different benchmarks can be found at StringBench.