Retrieve Syndicated Content, Transform It, & Display The Result
In this article, Nick shows you how to retrieve syndicated content and convert it into headlines for your site. Since no official format for such feeds exists, aggregators are often faced with the difficulty of supporting multiple formats, so Nick also explains how to use XSL transformations to more easily deal with multiple syndication file formats.
With the popularization of weblogging, information overload is worse than ever. Readers now have more sites than ever to keep up with, and visiting all of them on a regular basis is next to impossible. Part of the problem can be solved through the syndication of content, in which a site makes its headlines and basic information available in a separate feed. Today, most of these feeds use an XML format called RSS, though there are variations in its use and even a potential competing format.
This article explains how to use Java technology to retrieve the content of a syndicated feed, determine its type, and then transform it into HTML and display it on a Web site. This process involves five steps:
1. Retrieve the XML feed
2. Analyze the feed
3. Determine the proper transformation
4. Perform the transformation
5. Display the result
This article chronicles the creation of a Java Server Page (JSP) that retrieves a remote feed and transforms it using a Java bean and XSLT, and then incorporates the newly transformed information into a JSP page. The concepts, however, apply to virtually any Web environment.
The Source File
Depending on whom you ask, RSS stands for RDF Site Summary, Rich Site Summary, or other acronyms that are less tactful. In any case, no fewer than four versions of RSS are in common usage, from the fairly simple 0.91, which doesn’t include namespaces and imposes some strict limits on content, to version 2.0, which encompasses versions back to 0.91 (so a valid 0.91 file is also a valid 2.0 file) but also allows the use of namespaces. By allowing namespaces, version 2.0 makes it possible for a syndicator to add elements to the feed, as long as they’re in a different namespace. Some syndicators use this capability to add information using Resource Definition Format (RDF).
A simple RSS 2.0 file might look like this feed from Adam Curry’s weblog (see Resources):
Listing 1. A sample RSS 2.0 message
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Adam Curry: Adam Curry's Weblog</title>
<link>http://www.blognewsnetwork.com/members/0000001/</link>
<description>News and Views from Adam Curry</description>
<language>en-us</language>
<copyright>Copyright 2003 Adam Curry</copyright>
<lastBuildDate>Thu, 24 Jul 2003 09:26:48 GMT</lastBuildDate>
<docs>http://backend.userland.com/rss</docs>
<generator>Radio UserLand v8.0.9b2</generator>
<managingEditor>adam@curry.com</managingEditor>
<webMaster>adam@curry.com</webMaster>
<item>
<title>weblog at work again</title>
<link>
http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158
</link>
<description><a href="http://radio.weblogs.com/0001014/images/2003/07/24/ad
amwheely.jpg"><img src="http://radio.weblogs.com/0001014/images/2003/07/24/
adamwheely.jpg" width="250" height="187.5" border="0" align="right" hspace="15" v
space="5" alt="A picture named adamwheely.jpg"></a>A few days ago I aske
d if anyone had taken pictures of me at the annual ...</description>
<guid>
http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158
</guid>
<pubDate>Thu, 24 Jul 2003 09:21:25 GMT</pubDate>
</item>
<item>
<title>teens trouble with web</title>
<link>
http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156
</link>
<description>According to a report from Northumbria University, most teenagers
lack the <a href="http://www.web-user.co.uk/news/news.php?id=33621">inform
ation gathering skills</a> needed for using the internet efficiently. This
sounds like it shouldn't be happening in ...</description>
<guid>
http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156
</guid>
<pubDate>Wed, 23 Jul 2003 17:36:23 GMT</pubDate>
</item>
...
</channel>
</rss>
|
To turn this feed into HTML, you can process it using XSL transformations.
The Primary Stylesheet
The ultimate goal is to generate HTML text that shows the information in an organized way, such as a list of links, included in the body of another page of information. The actual HTML output would be something like:
Listing 2. The output HTML
<h2>Adam Curry: Adam Curry's Weblog</h2>
<h3>News and Views from Adam Curry</h3>
<ul>
<li>
<a href=
"http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158">weblog
at work again</a>
<p>
<a href="http://www.developertutorials.com/wp-content/uploads/2003/12/adamwheely.jpg">
<img src="http://www.developertutorials.com/wp-content/uploads/2003/12/adamwheely.jpg"
width="250" height="187.5" border="0" align="right" hspace="15" vspace="5" alt="A
picture named adamwheely.jpg"></a>A few days ago I asked if anyone had taken
pictures of me at the annual ...
</li>
<li>
<a
href="http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156">
teens trouble with web</a>
<p>According to a report from Northumbria University, most teenagers lack the
<a href="http://www.web-user.co.uk/news/news.php?id=33621">information gathering
skills</a> needed for using the internet efficiently. This sounds like it
shouldn't be happening in ...
</li>
...
</ul>
|
To create this HTML out of the XML, you’ll need an XSLT stylesheet:
Listing 3. The simple stylesheet
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet>
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet>
|
The actual form of the page is entirely up to you, as is the data that you choose to include. In this case, you’re simply creating a bulleted list of entries, with a title (if there is one) that links back to the original post and the description for each post.
To actually perform the transformation, you need to create a JSP page.
The Basic JSP Page
Any number of ways of transforming XML data exist. In this article, I’ll show you how to create a JSP page that passes a feed to a Java bean for transformation. That bean creates a static file, and the JSP page incorporates it into the body of the page. (The reason for the static file will become clearer in the caching section below.)
The page itself is fairly straightforward:
Listing 4. The JSP page
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<jsp:useBean id="rssBean" scope="request" class="RSSProcessor">
<%
rssBean.setRSSFile(
"http://wolk.datashed.net/users/adam@curry.com/curryCom.xml");
%>
</jsp:useBean>
<html>
<head>
<title>Syndicated Feeds</TITLE>
</head>
<body>
<jsp:include page="headlines.html" flush="true"/>
</body>
</html>
|
Here you’re simply creating an instance of the RSSProcessor class. Because you’ve included it in the useBean element, the setRSSFile() method executes when the object is created. This method creates the headlines.html page that the JSP page then incorporates into the output.
Next, create the bean to do the transformation.
Transforming The File
The Java bean is nothing more than a Java class that has get and set methods. In this case, the set method, setRSSFile() also includes code that performs a transformation on that file:
Listing 5. Transforming the feed
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.io.FileOutputStream;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
public class RSSProcessor {
public RSSProcessor(){ }
String _RSSFile;
public String getRSSFile(){
return _RSSFile;
}
public void setRSSFile(String fileName){
try {
StreamSource source = new StreamSource(fileName);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
This method simply takes an input source, which happens to be a remote RSS feed, and transforms it, using the final.xsl stylesheet, to the headlines.html file.
In the grand scheme of things, that’s it: Retrieve the file, transform it, and display the results. In reality, there are other issues to consider.
Adjusting For Multiple Formats
If all RSS files were like this sample, you wouldn’t need to do anything else. Unfortunately, this is not the case. Different vendors and toolkits can produce additional information, or can replace core information with RDF information or other namespaced modules, leading to complaints that supporting RSS is complex because of all the variations. But with the use of XSL transformations, it doesn’t have to be that way.
For example, an RSS 2.0 feed might also contain RDF information, like this feed from Typographica:
Listing 6. Excerpt from sample RSS 2.0 message with RDF
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Typographica</title>
<link>http://typographi.ca/</link>
<description>A daily journal of typography featuring news, observations,
and open commentary on fonts and typographic design.</description>
<dc:language>en-us</dc:language>
<dc:creator>Stephen Coles</dc:creator>
<dc:rights>Copyright 2003</dc:rights>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
<admin:generatorAgent rdf:resource="http://www.movabletype.org/?v=2.63" />
<admin:errorReportsTo rdf:resource="mailto:scoles@gomakecontact.com" />
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
<item>
<title>Hot and Cold Fonts</title>
<link>http://typographi.ca/000643.php</link>
<description>LettError have developed a multiple master font
for the Design Institute of the University of Minnesota that varies
along three...</description>
<guid isPermaLink="false">643@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href="http://www.letterror.com/">
LettError</a> have developed a multiple master font for the
<a href="http://design.umn.edu/">Design Institute</a> of the University of
Minnesota that varies along three dimensions: formality, informality, and
"weirdness." (It's apparently possible to be 100% formal and 100% informal at
the same time.) As the New York Times...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
</item>
<item>
<title>Textura Digita</title>
<link>http://typographi.ca/000642.php</link>
<description>CNN reports that the Gutenberg Bible is now available
on the web via the Ransom Center at the University of...</description>
<guid isPermaLink="false">642@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href=
"http://www.cnn.com/2003/TECH/internet/07/23/digital.scripture.ap/index.html">
CNN reports</a> that the Gutenberg Bible is now available on the web via the
<a href="http://www.hrc.utexas.edu/exhibitions/permanent/gutenberg/">Ransom
Center</a> at the University of Texas.</p>
...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-23T13:16:15-08:00</dc:date>
</item>
<item>
<title>Fight! Fight! Fight!</title>
<link>http://typographi.ca/000640.php</link>
<description>Angry because you had to miss TypeCon &#8217;03?
Work out that aggression with Helvetica vs. Arial....</description>
<guid isPermaLink="false">640@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p>Angry because you had to miss
<a href="http://www.typecon2003.com/">TypeCon ’03</a>? Work out that
aggression with <a href="http://www.engagestudio.com/helvetica/">Helvetica vs.
Arial</a>.</p>]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-22T08:52:36-08:00</dc:date>
</item>
...
</channel>
</rss>
|
Notice that this feed actually contains two different descriptions of the content. The first is in the description element, and the second is in the encoded element, which is part of the http://purl.org/rss/1.0/modules/content/ namespace. Here you see the difference in how different feeds handle information. Adam Curry’s blog simply encodes information such as links and drops them into the description element, whereas Typographica (or rather the toolkit that produces Typographica’s feed) provides a non-markup version in the description element and a full version in the encoded element using a CDATA construct.
Although it is preferable to create a custom presentation for each feed type in order to take advantage of any extra information, this is not always practical from an application development standpoint. But that doesn’t mean you have to give up. Instead, you can create a transformation that simply takes different feeds and converts them to a standard structure, which you can then feed to the final transformation.
For example, you can create a stylesheet that takes an RSS 2.0 stylesheet and if it finds an encoded element, uses it to replace any description element:
Listing 7. Transforming RDF information
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<rss>
<channel>
<xsl:apply-templates select="rss/channel" />
</channel>
</rss>
</xsl:template>
<xsl:template match="title|link|/rss/channel/description|image|text()">
<xsl:copy-of select="." />
</xsl:template>
<xsl:template match="item" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="description" /></description>
</item>
</xsl:template>
<xsl:template match="item[encoded]" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="encoded" /></description>
</item>
</xsl:template>
</xsl:stylesheet>
|
This stylesheet makes copies of the elements that the final stylesheet will need, such as the channel’s title and description, and makes a copy of the item with the appropriate description information.
Now you just have to weave that new document into the final transformation:
Listing 8. Chaining the transformation
...
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.dom.DOMResult;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
StreamSource interimSource = new StreamSource(fileName);
String XSLSheetName = "2.0.xsl";
StreamSource style = new StreamSource(XSLSheetName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
interimTransformer = transFactory.newTransformer(style);
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
Take a look at this one step at a time. First of all, you’re creating an interim transformation that takes the intial feed and transforms it according to the interim stylesheet in Listing 7, named 2.0.xsl. The result of this first transformation goes not to a file, but to a DOM Document object, which then gets passed as the source for the second transformation.
The name of the interim stylesheet, 2.0.xsl, was deliberate. By naming it after the version, you can create a more flexible system.
Choosing A Version
As long as you’re allowing for different formats, you can actually create a system that checks for the feed version before processing it. After all, only RSS 1.0 and 2.0 feeds can have RDF elements, so there’s no need to process other feeds. But how can you tell what version to apply?
To solve this problem, you can load the actual feed, analyze it, and use the information to set the proper stylesheet.
Listing 9. Choosing a stylesheet
...
import org.xml.sax.InputSource;
import org.w3c.dom.Element;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
InputSource docFile = new InputSource (fileName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document inputDoc = db.parse(docFile);
Element rss = inputDoc.getDocumentElement();
String version = null;
if (rss.getNodeName().equals("rss")){
version = rss.getAttribute("version");
if (version == null) {
version = "0.91";
}
} else if (rss.getNodeName().equals("feed")){
version = "echo";
}
String XSLSheetName = version+".xsl";
StreamSource style = new StreamSource(XSLSheetName);
DOMSource interimSource = new DOMSource(inputDoc);
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
if (version.equals("0.91")){
interimTransformer = transFactory.newTransformer();
} else {
interimTransformer = transFactory.newTransformer(style);
}
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
In this case, you’re loading the feed and checking it for the RSS version, and then using the version number as the file name. The advantage here is that should a new version of RSS be released, you can extend the application by simply adding a new stylesheet. Notice that I’ve added a check for Echo, or Atom, or whatever RSS’s competitor might eventually be called, and that you can also adjust support for it as it changes by simply changing the echo.xsl stylesheet.
The advantage here is that this interim stylesheet is completely generic. A “2.0 – .91″ stylesheet will work for anyone, anywhere, and you can make changes to the final output by simply editing final.xsl, whether you support one version or a hundred.
The final.xsl stylesheet is designed for a simple 0.91-style feed, so if you’re dealing with one, you’ll omit the stylesheet on the interim transformation. This creates an identity transform, in which the document is simply passed along as-is.
That takes care of the problem of multiple versions, but you have one more issue to deal with: concurrency.
Caching The Feed
This system would work fine on a personal server where you’re the only one accessing it, but in the real world, it would be impractical (and rude) to pull the feed every time someone wants to read it. Instead, you need to build the system with some sort of time delay, so if the feed’s been pulled recently, the existing headlines.html file is used.
To do that, you can take advantage of a Java application’s nature. A static variable that represents the last time the feed was pulled would be constant for all instances of the RSSProcessor class, so you can check the current time against it before actually pulling the feed:
Listing 10. Choosing a stylesheet
import java.util.Date;
public class RSSProcessor {
...
static Date _LastUpdated = new Date();
public Date getLastUpdated(){
return _LastUpdated;
}
public void setRSSFile(String fileName){
Date now = new Date();
long diff = now.getTime() - _LastUpdated.getTime();
double interval = .5;
if ((diff == 0) || (diff > (interval * 60 * 1000))){
_LastUpdated = now;
try {
InputSource docFile = new InputSource (fileName);
...
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
|
The first time the server instantiates RSSProcessor, _LastUpdated gets initialized with the current date. At (essentially) the same time, the server executes the setRSSFile() method, and because the difference between the current time and the _LastUpdated time is zero, the transformation takes place.
The next time someone calls the page, a new instance of RSSProcessor is created, but because _LastUpdated is static, the new instance sees the existing value of _LastUpdated rather than initializing it. The interval is measured in minutes, with the difference between _LastUpdated and the current time measured in milliseconds. If the amount time that has elapsed is less than the interval, nothing else happens. The headlines.html file isn’t updated, so the server uses the old one instead.
If, on the other hand, the interval has passed, _LastUpdated gets the current time, which is passed on to any subsequent RSSProcessor objects, and the bean pulls a new copy of the feed to transform.
Conclusion
In this article, I’ve shown you how to create a syndicated feed reader that retrieves a single remote feed, transforms it using XSLT, and displays it as part of a Web page. The system can also adapt to multiple feed types through the use of XSLT stylesheets.
The application uses a DOM Document to analyze the feed and determine the appropriate stylesheet, but you can further extend it by moving some of that logic into an external stylesheet. You can also adapt the system so that it can pull more than one feed, perhaps based on a user selection, with each one creating its own cached file. Similarly, you can enable the user to determine the interval between feed retrievals.
Resources
About the author
Written by Nicholas Chase.
Nicholas Chase, a Studio B author, has been involved in Web site development for companies such as Lucent
Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-
level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, and an Oracle
instructor. More recently, he was the Chief Technology Officer of Site Dynamics Interactive Communications in Clearwater,
Florida, USA, and is the author of four books on Web development, including XML Primer Plus (Sams). He loves to hear from readers and can be reached at
nicholas@nicholaschase.com.
If you found this post useful you may also want to check these out:
- Tip: Convert from HTML to XML with HTML Tidy
- Practical XML with Linux, Part 1
- The Basic Uses Of SSI – Server Side Includes
- Taming the Update Monster