Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Email:

Efficient Text Searching in Java: Finding the Right String in Any Language

By Laura Werner
2003-05-25


Under the Hood

Internally, RuleBasedCollator.compare has an awful lot of bookkeeping to do. A byte-by-byte string comparison function like C's strcmp can walk though strings one character at a time and compare the bytes, but if Collator did something that simple it would rapidly get out of sync the first time it saw a contracting character like the Spanish "ch" or an expanding character like "Æ".

To keep track of this, RuleBasedCollator first translates strings into a series of collation elements, which correspond to single entities in the input string. In English, each character in the input maps to a collation element, but "Æ" produces two elements and the Spanish "ch" produces just one. This translation is done by the utility class CollationElementIterator, which uses a set of mapping tables built from the rules passed to the collator's constructor.

CollationElementIterator is a public class, and you can use it yourself to do searches, as we will see below. As an introduction, let's use it to iterate over the elements for a simple string:



RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator iter = c.getCollationElementIterator("Foo");

int element;
while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
System.out.println("Collation element is: " +
Integer.toString(e,16) );
}

As you can see from the above example, a collation element is a fairly simple creature; it's an int that describes where a character or group of characters falls in the sorting sequence. Higher-numbered elements are sorted after lower-numbered ones.

Of course, it's really a bit more complicated than that. Each collation element can be broken down into three components (also known as weights or orders): primary, secondary, and tertiary. The primary component of a collation element corresponds to which base letter the order represents, so "A" and "B" will have different primary weights. The secondary components typically correspond to accents, so "á" and "é" will have the same secondary weight, which is different from the secondary weight of "a". The tertiary components usually represent the case of a character, so "a" and "A" will have different tertiary weighting. (There is a fourth level, the normalized original string itself, which can be used for a final comparison, but you usually don't need to worry about this level.)

Character

Primary

Secondary

Tertiary

a

1

0

0

á

1

1

0

A

1

0

1

...

B

2

0

1

...

é

5

1

0

The above table uses simple, made-up numbers to illustrate the components of a collation element. In practice, the numbers are usually larger, since most collators have many more possible elements. To see what real collation orders look like, you can modify the last code example as follows. The following code will print out the collation elements for the string "Foo":



RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator iter = c.getCollationElementIterator("Foo");

int element;
while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
System.out.println("Collation element is: " + element);
System.out.println(" primary: " + Integer.toString(
CollationElementIterator.primaryOrder(element), 16) );
System.out.println(" secondary: " + Integer.toString(
CollationElementIterator.secondaryOrder(element), 16) );
System.out.println(" tertiary: " + Integer.toString(
CollationElementIterator.tertiaryOrder(element), 16) );
}

While this notion of three ordering levels seems complicated at first, it actually makes some tasks easier. If you want to do a case-insensitive comparison, you simply ignore the tertiary component of each collation element. If you also don't want to include accents, you can ignore the secondary component too. Collator handles this by allowing you to set different strength levels using the setStrength method and constants such as Collator.PRIMARY .



Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // Ignore case and accents
if (c.compare(string1, string2) == 0) {
... // Strings matched
}


Tutorial Pages:
» Finding the right string in any language
» Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly
» Under the Hood
» Text Searching in JDK 1.1
» Ignore That Character!
» It's Better in 1.2
» Optimized Searching
» Boyer-Moore and Unicode
» But wait! At the very beginning of this article, I said that this kind of algorithm doesn't work well with Unicode because it has 65,535 possible char
» I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applicatio


First published by IBM DeveloperWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT