Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


Optimized Searching

We've solved the easier of our two efficiency problems; now it's time for the hard one. As I explained above, the brute-force algorithm we're using is O(TP). String searching is a well-researched area, and there are algorithms that can do considerably better. Perhaps the best is the Boyer-Moore method, which is never worse than O(T+P) and in practice is often proportional to T/P. That's right: the size of the text divided by the size of the pattern. Rather than forcing us to examine characters in the text multiple times, this algorithm actually lets us skip characters.

Boyer-Moore can be a little bit tricky to explain, but once you "get" it, it seems almost too obvious. The trick is that instead of comparing the strings starting at the beginning of the pattern, you compare them starting at the end. If the characters don't match, we've still gleaned a bit of useful information: we now know what character occurs at that position in the text being searched. Often, we can take advantage of that information to skip several characters in the target text rather than simply sliding the pattern along by one position and trying again.

An example will make this more clear. Imagine that you're searching for the word "string" inside the phrase "silly_spring_string". To start off, you line up the beginning of the pattern with the beginning of the target, but you start comparing at the end, like so:



silly_spring_string
string

We compare the g in the pattern with a space character in the target, and there's no match. So far, there's nothing special. However, we know something else as well: the character at index 5 in the target is a space, and there are no space characters anywhere in the pattern we're searching for. Combining these two facts, we can slide the pattern over by six characters and try again:



silly_sp
ring_string
st
ring

This time, there is a match between the pattern and the text. Since we're going backwards, we now compare the n , i, and r and find that they match too. However, the p and the t do not. We know that there is not a p anywhere in the pattern, so we can slide it over again:



silly_spring_ string
string


silly_spring_string


                     
string

To implement this efficiently, you need to have a table that, for each possible character in the text, tells you how far from the end of the pattern that character first occurs. This is the distance by which the pattern can be shifted when that particular character is seen in the input. For the above example, the table would look like this:

Character:stringothers
Shift:5432106

This table can be computed once, before the search is started, by making a single pass through the pattern. After that, it can be used each time we search for that pattern, leading to a huge performance win.



Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT