Retrieve Syndicated Content, Transform It, & Display The Result
In this article, Nick shows you how to retrieve syndicated content and convert it into headlines for your site. Since no official format for such feeds exists, aggregators are often faced with the difficulty of supporting multiple formats, so Nick also explains how to use XSL transformations to more easily deal with multiple syndication file formats.
With the popularization of weblogging, information overload is worse than ever. Readers now have more sites than ever to keep up with, and visiting all of them on a regular basis is next to impossible. Part of the problem can be solved through the syndication of content, in which a site makes its headlines and basic information available in a separate feed. Today, most of these feeds use an XML format called RSS, though there are variations in its use and even a potential competing format.
This article explains how to use Java technology to retrieve the content of a syndicated feed, determine its type, and then transform it into HTML and display it on a Web site. This process involves five steps:
1. Retrieve the XML feed
2. Analyze the feed
3. Determine the proper transformation
4. Perform the transformation
5. Display the result
This article chronicles the creation of a Java Server Page (JSP) that retrieves a remote feed and transforms it using a Java bean and XSLT, and then incorporates the newly transformed information into a JSP page. The concepts, however, apply to virtually any Web environment.
The Source File
Depending on whom you ask, RSS stands for RDF Site Summary, Rich Site Summary, or other acronyms that are less tactful. In any case, no fewer than four versions of RSS are in common usage, from the fairly simple 0.91, which doesn’t include namespaces and imposes some strict limits on content, to version 2.0, which encompasses versions back to 0.91 (so a valid 0.91 file is also a valid 2.0 file) but also allows the use of namespaces. By allowing namespaces, version 2.0 makes it possible for a syndicator to add elements to the feed, as long as they’re in a different namespace. Some syndicators use this capability to add information using Resource Definition Format (RDF).
A simple RSS 2.0 file might look like this feed from Adam Curry’s weblog (see Resources):
Listing 1. A sample RSS 2.0 message
To turn this feed into HTML, you can process it using XSL transformations.
The Primary Stylesheet
The ultimate goal is to generate HTML text that shows the information in an organized way, such as a list of links, included in the body of another page of information. The actual HTML output would be something like:
To create this HTML out of the XML, you’ll need an XSLT stylesheet:
The actual form of the page is entirely up to you, as is the data that you choose to include. In this case, you’re simply creating a bulleted list of entries, with a title (if there is one) that links back to the original post and the description for each post.
To actually perform the transformation, you need to create a JSP page.
The Basic JSP Page
Any number of ways of transforming XML data exist. In this article, I’ll show you how to create a JSP page that passes a feed to a Java bean for transformation. That bean creates a static file, and the JSP page incorporates it into the body of the page. (The reason for the static file will become clearer in the caching section below.)
The page itself is fairly straightforward:
Here you’re simply creating an instance of the RSSProcessor class. Because you’ve included it in the useBean element, the setRSSFile() method executes when the object is created. This method creates the headlines.html page that the JSP page then incorporates into the output.
Next, create the bean to do the transformation.
Transforming The File
The Java bean is nothing more than a Java class that has get and set methods. In this case, the set method, setRSSFile() also includes code that performs a transformation on that file:
This method simply takes an input source, which happens to be a remote RSS feed, and transforms it, using the final.xsl stylesheet, to the headlines.html file.
In the grand scheme of things, that’s it: Retrieve the file, transform it, and display the results. In reality, there are other issues to consider.
Adjusting For Multiple Formats
If all RSS files were like this sample, you wouldn’t need to do anything else. Unfortunately, this is not the case. Different vendors and toolkits can produce additional information, or can replace core information with RDF information or other namespaced modules, leading to complaints that supporting RSS is complex because of all the variations. But with the use of XSL transformations, it doesn’t have to be that way.
For example, an RSS 2.0 feed might also contain RDF information, like this feed from Typographica:
Notice that this feed actually contains two different descriptions of the content. The first is in the description element, and the second is in the encoded element, which is part of the http://purl.org/rss/1.0/modules/content/ namespace. Here you see the difference in how different feeds handle information. Adam Curry’s blog simply encodes information such as links and drops them into the description element, whereas Typographica (or rather the toolkit that produces Typographica’s feed) provides a non-markup version in the description element and a full version in the encoded element using a CDATA construct.
Although it is preferable to create a custom presentation for each feed type in order to take advantage of any extra information, this is not always practical from an application development standpoint. But that doesn’t mean you have to give up. Instead, you can create a transformation that simply takes different feeds and converts them to a standard structure, which you can then feed to the final transformation.
For example, you can create a stylesheet that takes an RSS 2.0 stylesheet and if it finds an encoded element, uses it to replace any description element:
This stylesheet makes copies of the elements that the final stylesheet will need, such as the channel’s title and description, and makes a copy of the item with the appropriate description information.
Now you just have to weave that new document into the final transformation:
Take a look at this one step at a time. First of all, you’re creating an interim transformation that takes the intial feed and transforms it according to the interim stylesheet in Listing 7, named 2.0.xsl. The result of this first transformation goes not to a file, but to a DOM Document object, which then gets passed as the source for the second transformation.
The name of the interim stylesheet, 2.0.xsl, was deliberate. By naming it after the version, you can create a more flexible system.
Choosing A Version
As long as you’re allowing for different formats, you can actually create a system that checks for the feed version before processing it. After all, only RSS 1.0 and 2.0 feeds can have RDF elements, so there’s no need to process other feeds. But how can you tell what version to apply?
To solve this problem, you can load the actual feed, analyze it, and use the information to set the proper stylesheet.
In this case, you’re loading the feed and checking it for the RSS version, and then using the version number as the file name. The advantage here is that should a new version of RSS be released, you can extend the application by simply adding a new stylesheet. Notice that I’ve added a check for Echo, or Atom, or whatever RSS’s competitor might eventually be called, and that you can also adjust support for it as it changes by simply changing the echo.xsl stylesheet.
The advantage here is that this interim stylesheet is completely generic. A “2.0 – .91″ stylesheet will work for anyone, anywhere, and you can make changes to the final output by simply editing final.xsl, whether you support one version or a hundred.
The final.xsl stylesheet is designed for a simple 0.91-style feed, so if you’re dealing with one, you’ll omit the stylesheet on the interim transformation. This creates an identity transform, in which the document is simply passed along as-is.
That takes care of the problem of multiple versions, but you have one more issue to deal with: concurrency.
Caching The Feed
This system would work fine on a personal server where you’re the only one accessing it, but in the real world, it would be impractical (and rude) to pull the feed every time someone wants to read it. Instead, you need to build the system with some sort of time delay, so if the feed’s been pulled recently, the existing headlines.html file is used.
To do that, you can take advantage of a Java application’s nature. A static variable that represents the last time the feed was pulled would be constant for all instances of the RSSProcessor class, so you can check the current time against it before actually pulling the feed:
The first time the server instantiates RSSProcessor, _LastUpdated gets initialized with the current date. At (essentially) the same time, the server executes the setRSSFile() method, and because the difference between the current time and the _LastUpdated time is zero, the transformation takes place.
The next time someone calls the page, a new instance of RSSProcessor is created, but because _LastUpdated is static, the new instance sees the existing value of _LastUpdated rather than initializing it. The interval is measured in minutes, with the difference between _LastUpdated and the current time measured in milliseconds. If the amount time that has elapsed is less than the interval, nothing else happens. The headlines.html file isn’t updated, so the server uses the old one instead.
If, on the other hand, the interval has passed, _LastUpdated gets the current time, which is passed on to any subsequent RSSProcessor objects, and the bean pulls a new copy of the feed to transform.
In this article, I’ve shown you how to create a syndicated feed reader that retrieves a single remote feed, transforms it using XSLT, and displays it as part of a Web page. The system can also adapt to multiple feed types through the use of XSLT stylesheets.
The application uses a DOM Document to analyze the feed and determine the appropriate stylesheet, but you can further extend it by moving some of that logic into an external stylesheet. You can also adapt the system so that it can pull more than one feed, perhaps based on a user selection, with each one creating its own cached file. Similarly, you can enable the user to determine the interval between feed retrievals.
- Check out
Syndic8, where you’ll find thousands of RSS feeds, searchable
by type and toolkit. It also includes a good reference section with spec documents.
- Read James Lewin’s “An introduction to RSS feeds” (developerWorks, November 2000).
- For another perspective, check out “The Python Web services developer: RSS for Python” by Mike Olson and Uche Ogbuji (developerWorks, November 2002).
- Read Michael Kay’s article explaining “What kind of language is XSLT?” (developerWorks, February 2001).
- Responsibility for RSS 2.0 was recently transferred to
the Berkman Center at Harvard. This may or may not have an effect on the (Not)Echo/(Not)Atom/WhateverTheyVoteToCallIt project.
- Visit Adam Curry’s Weblog.
- Read the XSLT 1.0 Recommendation, and get a heads-up on XSLT 2.0
at the World Wide Web Consortium’s XSL page.
- Find more resources on the developerWorks
XML and Web Services zones.
- IBM’s DB2 database provides not only relational database storage, but also XML-related tools such as the DB2 XML Extender which provides a bridge between XML and relational systems. Visit the DB2 Developer Domain to learn more about DB2.
- Find out how you can become an IBM Certified Developer in XML and related technologies.