Efficient Text Searching in Java: Finding the Right String in Any Language
By Laura Werner2003-05-25
Boyer-Moore and Unicode
But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible character values, which would make the table too large. Actually, it's worse, because we're concerned with collation elements, which are 32-bit integers, not with the Unicode values themselves. That's true, but (of course) there's another trick...
First, consider what happens when a letter occurs twice in the search pattern. There are two possible shift distances for that letter, one for each occurrence. To make the Boyer-Moore algorithm work, we always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we don't shift the pattern too far, we're fine.
This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, it's not a problem as long as you enter the smaller of the shift distances in the table.
A simple way to map the collation elements to 256 values is to use the low byte of the element's primary component. There are other approaches that will lead to a better distribution of values throughout the shift table and will thus give slightly better performance, but in practice this approach is usually good enough.
To implement Boyer-Moore searching with JDK 1.2, we first need to construct a shift table that tells us how far to shift the pattern when a particular collation element is seen in the text. The hash function is used to map from a 32-bit collation order down to an index in the 256-element array.
|
Once we have the tables, the search routine is straightforward. It uses another new JDK 1.2 method: CollationElementIterator.previous . Also note that there is no longer an outer loop that calls a separate match method, since that only worked well when we were marching through the text one character at a time. Now that we can skip ahead an arbitrary distance through the text, it is easier to combine all of the logic into one method.
|
There you have it: a way to do fast, linear-time, international-friendly string searching in Java.
Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio
First published by IBM DeveloperWorks
