Efficient Text Searching in Java: Finding the Right String in Any Language
By Laura Werner2003-05-25
Under the Hood
Internally,
RuleBasedCollator.compare has an awful lot of bookkeeping to do. A byte-by-byte string comparison function like C's strcmp can walk though strings one character at a time and compare the bytes, but if Collator did something that simple it would rapidly get out of sync the first time it saw a contracting character like the Spanish "ch" or an expanding character like "Æ".To keep track of this, RuleBasedCollator first translates strings into a series of collation elements, which correspond to single entities in the input string. In English, each character in the input maps to a collation element, but "Æ" produces two elements and the Spanish "ch" produces just one. This translation is done by the utility class CollationElementIterator, which uses a set of mapping tables built from the rules passed to the collator's constructor.
CollationElementIterator is a public class, and you can use it yourself to do searches, as we will see below. As an introduction, let's use it to iterate over the elements for a simple string:
|
As you can see from the above example, a collation element is a fairly simple creature; it's an int that describes where a character or group of characters falls in the sorting sequence. Higher-numbered elements are sorted after lower-numbered ones.
Of course, it's really a bit more complicated than that. Each collation element can be broken down into three components (also known as weights or orders): primary, secondary, and tertiary. The primary component of a collation element corresponds to which base letter the order represents, so "A" and "B" will have different primary weights. The secondary components typically correspond to accents, so "á" and "é" will have the same secondary weight, which is different from the secondary weight of "a". The tertiary components usually represent the case of a character, so "a" and "A" will have different tertiary weighting. (There is a fourth level, the normalized original string itself, which can be used for a final comparison, but you usually don't need to worry about this level.)
Character | Primary | Secondary | Tertiary |
a | 1 | 0 | 0 |
á | 1 | 1 | 0 |
A | 1 | 0 | 1 |
... | |||
B | 2 | 0 | 1 |
... | |||
é | 5 | 1 | 0 |
The above table uses simple, made-up numbers to illustrate the components of a collation element. In practice, the numbers are usually larger, since most collators have many more possible elements. To see what real collation orders look like, you can modify the last code example as follows. The following code will print out the collation elements for the string "Foo":
|
While this notion of three ordering levels seems complicated at first, it actually makes some tasks easier. If you want to do a case-insensitive comparison, you simply ignore the tertiary component of each collation element. If you also don't want to include accents, you can ignore the secondary component too. Collator handles this by allowing you to set different strength levels using the setStrength method and constants such as Collator.PRIMARY .
|
Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio
First published by IBM DeveloperWorks
