Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


Boyer-Moore and Unicode

But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible character values, which would make the table too large. Actually, it's worse, because we're concerned with collation elements, which are 32-bit integers, not with the Unicode values themselves. That's true, but (of course) there's another trick...

First, consider what happens when a letter occurs twice in the search pattern. There are two possible shift distances for that letter, one for each occurrence. To make the Boyer-Moore algorithm work, we always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we don't shift the pattern too far, we're fine.

This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, it's not a problem as long as you enter the smaller of the shift distances in the table.

A simple way to map the collation elements to 256 values is to use the low byte of the element's primary component. There are other approaches that will lead to a better distribution of values throughout the shift table and will thus give slightly better performance, but in practice this approach is usually good enough.

To implement Boyer-Moore searching with JDK 1.2, we first need to construct a shift table that tells us how far to shift the pattern when a particular collation element is seen in the text. The hash function is used to map from a 32-bit collation order down to an index in the 256-element array.



// Member variables for storing precomputed pattern data
private int patLen;
private int[] patElem;
private int[] shifts;

// Map a collation element to an array index
int hash(int order) {
return CollationElementIterator.primaryOrder(order) % 256;
}

// Initialize the Boyer-Moore shift tables
void initialize(RuleBasedCollator c, String pat)
{
// First find out how many elements we're dealing with
patLen = 0;
CollationElementIterator iter = c.getCollationElementIterator(pat);
while (iter.next() != CollationElementIterator.NULLORDER)
patLen++;

// Allocate space to store the pattern elements and the shift tables
patElem = new int[patLen];
shifts = new int[256];

// Elements not found in the pattern get the maximum shift distance
for (int i = 0; i < 256; i++) {
shifts[i] = patLen;
}

// Now compute the shift distances for the elements in the pattern.
// While we're at it, save the elements themselves for quick access.
// The "-1" is in the calculation because Java indices are 0-based.
iter.reset();
for (int i = 0; i < patLen; i++) {
patElem[i] = iter.next();
int index = hash(patElem[i]);
shifts[index] = Math.min(shifts[index], patLen - i - 1);
}
}

Once we have the tables, the search routine is straightforward. It uses another new JDK 1.2 method: CollationElementIterator.previous . Also note that there is no longer an outer loop that calls a separate match method, since that only worked well when we were marching through the text one character at a time. Now that we can skip ahead an arbitrary distance through the text, it is easier to combine all of the logic into one method.



public int find(String text, String pattern)
{
RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator targIter =
coll.getCollationElementIterator(text);

// build the shift table and the constants we need
initialize(coll, pattern);
int mask = getMask(weight);
int done = CollationElementIterator.NULLORDER &amp; mask;

// Start at the text position corresponding to the end of the pattern
int textIndex = pattern.length();

while (textIndex <= text.length()) {
boolean getPattern = true, getTarget = true;
int targetElement=0, patternElement=0;

iter.setOffset(textIndex);
int patIndex = pattern.length();

// Iterate backward until we hit the beginning of the pattern
while (patIndex > 0)
{
if (getTarget) targetElement = targIter.previous() &amp; mask;
if (getPattern) patternElement = patElem[--patIndex] &amp; mask;
getTarget = getPattern = true;

if (targetElement == 0) {
getPattern = false; // skip targetElement
} else if (patternElement == 0) {
getTarget = false; // skip patternElement
} else if (targetElement != patternElement) {
// There is a mismatch at this position. Decide how far
// over to shift the pattern, then try again.
textIndex = iter.getOffset() +
shifts[hash(targetElement)];
break;
}
}
if (patIndex == 0) {
// We made it back to the beginning of the pattern,
// which means we matched it all. Return the location.
return targIter.getOffset();
}
// Otherwise, we're here because of a mismatch, so keep going....
}
return -1; // No match.
}

There you have it: a way to do fast, linear-time, international-friendly string searching in Java.



Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT