String Search in Java (using the JDK-API)
String search is a common problem, appearing every time we check that a string is contained in another. The JDK Standard API provides some basic functionality for this, yet there are more efficient algorithms for larger patterns and larger texts.
In this posting I present an overview over the string searching features of the Java API, starting with a short definition of string searching and ending with an overview of libraries that provide more features for string searching (more details will follow in later postings).
String-Searching
String searching (or String matching) tries to find a pattern in a text. The result is the set of matches in the text. Contrary to regular expression searching the pattern must be a pure string (not containing wildcards or operators). In the following examples we will call the pattern variable needle
, and the text haystack
.
String Searching with java.lang.String
The most familiar way to use string searching in Java is provided by the class java.lang.String
- as the method String.indexOf
. If the position in the pattern is not necessary the method String.contains
is a even simpler alternative. A typical string matching would look like this:
int find(String needle, String haystack) {
return haystack.indexOf(needle);
}
Java uses a naive string searching algorithm, to determine the position of a match. Naive means that the text is processed character by character, looking for a match with the first character of the pattern. After encountering the first character match a new subroutine is started, comparing each following character with the following characters in the pattern and breaking on a character mismatch. This algorithm is sufficient for small patterns and small texts, but performs quite inefficient if used for long texts with real words.
String Searching with java.util.Matcher
Another not so obvious way to use string searching is provided by java.util.Matcher
with the method java.util.Matcher.find
. Properly this method searches for regular expressions, so one would not expect string search here. Yet each string literal has its equivalent as regular expression and so it is technically possible to search strings with the regular expression API:
int find(String needle, String haystack) {
String escapedNeedle = Pattern.quote(needle);
Pattern pattern = Pattern.compile(escapedNeedle);
Matcher matcher = pattern.matcher(haystack);
if (matcher.find()) {
return matcher.start();
} else {
return -1;
}
}
The concrete implementation would probably look different, but the main question is: Why should one use the slow Matcher.find
method for text search:
The main reason is that java.util.Matcher.find
does not execute in regex mode if given a fixed string as argument. This method first checks if the given argument is free from regex operators and if so it uses a Boyer-Moore algorithm to process the search. The performance is quite ok - even for longer texts and patterns.
More about String Search …
Generally the Java Standard API covers only a small part of the subject of string search. There are may algorithms, each optimized for other purpose, e.g. in the dimensions of:
- the size of the alphabet (smaller alphabets vs larger alphabets)
- the length of the pattern (shorter patters vs longer patterns)
- the length of the texts (short texts vs long texts)
I will probably discuss some algorithms in my blog in future.
… and Multi-String Search
Besides there exists another related problem, that is not solved by the java API - the Multi-String Search. Instead of searching only one pattern in the text, here multiple patterns are searched - with the expectation that it performs more efficient than multiple single pattern searches in sequence. I wished that calling Matcher.find
with the expression (pattern1|pattern2|...|patternN)
would be optimized in a way as with pattern1
, but unfortunately the regex feature of java does optimize only constant string search but not the search for finite languages (regular languages with a finite number of words).
Libraries for String Search
I suppose that string search is a subject that is usually solved by implementing an algorithm from scratch. The few projects that provide string searching features do not resemble the maturity of common popular open source projects - some can be considered at most experimental. To rate such libraries I have developed the benchmark library StringBench with following results:Features | Algorithms | Benchmark passed? | |
Java-API |
|
|
Yes |
ByteSeek |
|
|
Yes |
StringSearch |
|
|
No |
AhoCorasick |
|
|
No |
StringSearchAlgorithms |
|
|
Yes |
It is obvious that many libraries do not pass the benchmark. Most of the failing libraries fail with timeouts (search takes longer than a minute). This is not acceptable with look on the naive algorithm of the Java API, that passes the benchmark (taking a few seconds). Generally these libraries do their job fine, but probably the ordering/post processing of matches is not done efficiently (feeling like quadratic effort related to the number of matches). This performance problem will hopefully be changed by updates in future.