Efficient Text Searching in Java: Finding the Right String in Any Language
By Laura Werner2003-05-25
Ignore That Character!
As you've probably guessed, I left something out again. There's one last complication: ignorable characters. In Unicode, an accented character can be represented in two different ways. The single Unicode character \u00e1 represents "á", but the pair of Unicode values \u0061\u0301 also represents "á". The \u0061 is just a lowercase "a", but the \u0301 is special. It's a "combining acute accent" that combines with the value before it to create the single "user-level" character "á".
These combining characters need special processing during a comparison. Since \u0301 is only an accent, its collation element has a secondary component but no primary or tertiary component. In a comparison that does not involve accents, we must ignore this element entirely. If we did not, we might end up comparing an accent in one string to a base letter in another string, which would give invalid results. For example, when doing an accent-insensitive comparison "a\u0301b" and "ab", we want to skip the "\u0301" and go on to the next character; otherwise we'd compare "\u0301" and "b".
This logic is relatively straightforward, but it does make the code a bit more complicated. The boolean variables getTarget and getPattern are used to decide whether to fetch the next collation element in the text and the pattern each time through the loop. Normally both variables are true, but one or the other of them can be set to false if we want to skip an element. For example, setting getPattern to false and getTarget to true causes the current pattern element to be re-used and compared with the next text element, thus skipping the current text element.
|
This is about the best you can do with the JDK 1.1 API. You can add bells and whistles, such as searching backward though text by reversing the order of the outer loop that calls match , but you can't really implement a more efficient search without a lot of work.
Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio
First published by IBM DeveloperWorks
