Efficient Text Searching in Java: Finding the Right String in Any Language
By Laura Werner2003-05-25
Text Searching in JDK 1.1
Now that you're familiar with the concepts behind collators and collation elements, we can put some of that knowledge to use. The same collation elements that
RuleBasedCollator uses to perform string comparisons can be used to do string searching as well. The basic concept is quite simple: instead of searching through the characters in the string, we'll search through its collation elements.Fast string-searching algorithms such as KMP and Boyer-Moore require the ability to back up or perform random access in the text being searched. Unfortunately, you can't do this with international text in JDK 1.1, because CollationElementIterator does not allow random access. It only has a next method and is lacking setOffset and previous . This means that Boyer-Moore searching cannot be implemented without a complicated buffering scheme that is very tricky to get right.
However, a traditional, brute-force string search is quite possible using the JDK 1.1 API. Essentially this comes down to comparing the search pattern against each individual character position in the text being searched. If the collation elements for the search pattern match the collation elements for that substring of text being searched, we've found a match. The outer loop looks like this:
|
Of course, I left out the hard part, the function match that decides whether two sequences of collation elements are equivalent. A simple, naive implementation would loop through both iterators and ensure that the elements they return are the same:
|
This will work, but only if you want to treat any difference, be it primary, secondary, or tertiary, as significant. In most applications, that is not enough; users will want an "Ignore Case" option, and possibly an "Ignore Case and Accents" option as well. Fortunately, this is not very hard to do. Collator provides the constants PRIMARY , SECONDARY , and TERTIARY that you can use to represent the level of comparison you want, and CollationElementIterator provides methods to break down a collation order into its three components.
All we need to do is create a variable, e.g. weight , that stores the desired level of comparison. If weight is PRIMARY , we check only the primary component of each collation element. If weight is SECONDARY , we check both the primary and secondary components, and if it is TERTIARY we check all three. Fortunately, these values of these constants are in ascending numerical order, so we can use simple comparisons such as: "if (weight > Collator.PRIMARY) ... "
To add this extra functionality we have to modify our match function a bit, but it is still fairly simple. Since the documentation for CollationElementIterator promises that "the first 16 bits of a collation order is its primary order; the next 8 bits is the secondary order and the last 8 bits is the tertiary order," we can simply mask away the portions of the collation element that we're not interested in.
|
Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio
First published by IBM DeveloperWorks
