Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char

I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applications. It is not that hard, and the benefits can be enormous, because global markets are becoming increasingly more important to the computer industry. For example, according to IBM's first quarter 1998 financial statement over half of IBM's revenue came from outside North America. Using the international features of the JDK can help you begin to tap into this huge market.

The code examples in this article were intended primarily for their educational value, but they do work. However, for clarity I have ignored a few remaining issues to make the code easier to understand. Expanding ("ä" to "ae") characters in the pattern are the chief difficulty. If the shorter, non-expanded version of the character occurs in the text being searched, you can end up shifting too far and missing a possible match. However, it is not too hard to compensate for this by using the getMaxExpansion method (also new in JDK 1.2) of CollationElementIterator to decrease the shift distances when expanded characters are seen in the pattern.

The other major feature I have left out is the ability to tell how much of the text matched the pattern. All of the code examples search for location in the text where a match starts, but they do not return the length of the text that matched. You would need to know this if you were writing a search and replace function in an editor, for example. In JDK 1.2, it is easy to tell where the match ends: just call the iterator's getOffset method. With the JDK 1.1 API, it is harder; you basically have to resort to the brute-force technique of comparing longer and longer substrings of the text to the pattern and stopping when the collator says that they are equal.

If you want to see a real, production-quality package that solves these last few problems, visit www.alphaworks.ibm.com and download a trial copy of StringSearch, our fully-functional text searching class based on the JDK collation API. It supports case- and accent-insensitive searches, backward searches, and "whole word" searching, among other features. alphaWorks also contains several other Java internationalization utilities that you might find useful, as well as a large number of JavaBeans, utility classes, and even a free XML parser written in Java.



Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT