Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


It's Better in 1.2

In JDK 1.2, we have made quite a few improvements to the international classes in java.text and java.util . Among them are enhancements to CollationElementIterator that make it possible to write faster and more powerful search routines.

There are two major problems with the searching mechanism outlined above. First, it uses an inefficient algorithm that can, at worst, compare every character of the pattern against every character of the target, requiring a number of comparisons that is proportional to the size of the text being searched multiplied by the size of the pattern. (In practice, it's usually not quite that bad, however.) In computer science terms, if the size of the text is T and the size of the pattern is P, the search time is proportional to T·P, or is O(TP). Modern searching algorithms can do much better.

The second, and more obvious problem is that there is an awful lot of object creation going on in the last few examples. Every time through the outer loop we call substring , which creates a new String object, and then we create a new CollationElementIterator. This happens at every single position in the target string, which is woefully inefficient given the cost of object creation in Java (and in most other languages, for that matter).

This second problem is solved by two new CollationElementIterator methods that we have added in JDK 1.2:



public void setOffset(int newOffset)
public int getOffset()

These methods allow you to change the text position to which an iterator points and to retrieve its current position. With this flexibility, we can avoid all of the calls to substring and all of the extra iterator objects that we were creating before. The outer searching loop now looks like this:



String pat = "for"; // What to search for
String text = "Now is the time for all good men"; // Text to search in

RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator patIter = c.getCollationElementIterator(pat);
CollationElementIterator targIter = c.getCollationElementIterator(text);

for (int i = 0; i < text.length(); i++) {
targIter.setOffset(i);
patIter.reset(); // Or setOffset(0)
if (match(patIter, targIter)) {
// They matched! Do something.
}
}

This will be much faster, because we're no longer creating new objects each time through the loop. The algorithm is still O(TP), but the overhead per iteration is considerably lower, so the running time will a lot better.



Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT