Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


Text Searching in JDK 1.1

Now that you're familiar with the concepts behind collators and collation elements, we can put some of that knowledge to use. The same collation elements that RuleBasedCollator uses to perform string comparisons can be used to do string searching as well. The basic concept is quite simple: instead of searching through the characters in the string, we'll search through its collation elements.

Fast string-searching algorithms such as KMP and Boyer-Moore require the ability to back up or perform random access in the text being searched. Unfortunately, you can't do this with international text in JDK 1.1, because CollationElementIterator does not allow random access. It only has a next method and is lacking setOffset and previous . This means that Boyer-Moore searching cannot be implemented without a complicated buffering scheme that is very tricky to get right.

However, a traditional, brute-force string search is quite possible using the JDK 1.1 API. Essentially this comes down to comparing the search pattern against each individual character position in the text being searched. If the collation elements for the search pattern match the collation elements for that substring of text being searched, we've found a match. The outer loop looks like this:



String pattern = "for"; // What to search for
String text = "Now is the time for all good men"; // Text being searched

RuleBasedCollator c = (RuleBasedCollator)collator.getInstance();
CollationElementIterator patIter =
c.getCollationElementIterator(pattern);

for (int i = 0; i < text.length(); i++) {
String substr = text.substring(i);
CollationElementIterator textIter =
c.getCollationElementIterator(substr);
patIter.reset();
if (match(patIter, textIter)) {
// They matched! Do something.
}
}

Of course, I left out the hard part, the function match that decides whether two sequences of collation elements are equivalent. A simple, naive implementation would loop through both iterators and ensure that the elements they return are the same:



boolean match(CollationElementIterator text,
CollationElementIterator pattern)
{
while (true) {
int patternElement = pattern.next();
int targetElement = text.next();
if (patternElement = CollationElementIterator.NULLORDER) {
break; // End of the pattern
} else if (patternElement != targetElement) {
return false; // Mismatch
}
}
return true; // No mismatches
}

This will work, but only if you want to treat any difference, be it primary, secondary, or tertiary, as significant. In most applications, that is not enough; users will want an "Ignore Case" option, and possibly an "Ignore Case and Accents" option as well. Fortunately, this is not very hard to do. Collator provides the constants PRIMARY , SECONDARY , and TERTIARY that you can use to represent the level of comparison you want, and CollationElementIterator provides methods to break down a collation order into its three components.

All we need to do is create a variable, e.g. weight , that stores the desired level of comparison. If weight is PRIMARY , we check only the primary component of each collation element. If weight is SECONDARY , we check both the primary and secondary components, and if it is TERTIARY we check all three. Fortunately, these values of these constants are in ascending numerical order, so we can use simple comparisons such as: "if (weight > Collator.PRIMARY) ... "

To add this extra functionality we have to modify our match function a bit, but it is still fairly simple. Since the documentation for CollationElementIterator promises that "the first 16 bits of a collation order is its primary order; the next 8 bits is the secondary order and the last 8 bits is the tertiary order," we can simply mask away the portions of the collation element that we're not interested in.



// Return a mask for the part of the order we're interested in
static final int getMask(int weight) {
switch (weight) {
case Collator.PRIMARY:
return 0xFFFF0000;
case Collator.SECONDARY:
return 0xFFFFFF00;
default:
return 0xFFFFFFFF;
}
}

boolean match(CollationElementIterator text,
CollationElementIterator pattern)

{
int mask = getMask(weight);
int done = CollationElementIterator.NULLORDER &amp; mask;

while (true) {
int patternElement = pattern.next() &amp; mask;
int targetElement = text.next() &amp; mask;

if (patternElement == done) {
break; // End of pattern
} else if (patternElement != targetElement) {
return false; // Mismatch
}
}
return true; // No mismatches
}


Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT