|
Helping ordinary people create extraordinary websites! |
Make Database Queries Without the DatabaseBy Brian Goetz2005-07-14
No database needed The "obvious" approach to the problem was to put everything in a SQL database -- pages, links, metrics, HTTP result codes, timing results, and other metadata. The problem lent itself well to a relational representation, especially as it was not necessary to store the contents of the visited pages, just their structure and metadata. So far, this project looked like a typical database application, and there was no shortage of candidate persistence strategies. But maybe it was possible to avoid the complexity of using a database for persistence -- the crawler was only going to visit a few tens of thousand of pages. This number was few enough that it was possible to keep the whole database in memory, and persistence, if needed, could be achieved through serialization. (Yes, the load and save operations would take a long time, but they did not have to be performed frequently.) Laziness had yielded a dividend -- not having to deal with persistence can greatly reduce the amount of time to develop an application, resulting in a significant savings in terms of development effort. Building and manipulating an in-memory data structure is just a lot easier than having to go to the database for every operation that adds, fetches, or analyzes data. No matter which persistence model you choose, it always constrains how you structure any code that touches the data. The in-memory data structure had a sort of tree structure, as shown in Listing 1, rooted at the home pages of the various sites being crawled, so the Visitor pattern was ideal for searching it or extracting data from it. (It was not too hard to build a base Visitor class that prevented getting stuck in cycles of links -- A links to B, B links to C, C links back to A.) Listing 1. Simplified schema for Web crawler
The crawler application had a dozen or so Visitors, which did things like selecting pages for further analysis, selecting pages that had unfollowed links, selecting pages with errors, tabulating the "most linked" pages, and so on. Because all of these operations were so simple, the Visitor pattern, shown in Listing 2, worked quite well, and because the data structure could fit into memory, even exhaustive searches were cheap enough: Listing 2. Visitor pattern for Web crawler database
Oops, forgot about reporting So, the in-memory data structure, which was great for adding new results, finding a specific result, and all sorts of ad-hoc traversals, had become a liability when it came to reporting. For any report whose structure was not similar to that of the database, a Visitor would have to create a whole new data structure to hold the report data. Therefore, each type of report would require its own report-specific intermediate data structure for holding the results, a visitor for populating the intermediate data structure, and then post-processing code for turning the intermediate data structure into the final report. This approach seemed like a lot of work, especially when most of the prototyped reports were going to be thrown away. For example, let's say you wanted a report that listed all the pages from other sites linked to from a given site, and for each of those external pages, a list of pages on the site that link to it, and then sort the report by number of links, so the most-linked pages appear first. This plan basically turns the data structure inside-out. To accomplish this sort of data transformation with a Visitor, you need to take the list of external page links reachable from a given site and sort them by the linked-to page, as shown in Listing 3: Listing 3. Visitor tabulates most-linked pages, along with pages that link to them
The Visitor in Listing 3 produces a map that associates each external page with a list of internal pages that link to it. To prepare the report, you would then have to sort the entries by the size of the associated page set, and then create the report. While none of these steps is all that difficult, the amount of report-specific code for each report is significant, and given that rapid report prototyping was an important goal (because there were no stated reporting requirements), the overhead of trying out a new report was higher than ideal. Many reports required multiple passes on the data to select, aggregate, and sort the data. Tutorial Pages: » Borrowing a data model can simplify development and enhance performance » No database needed » My kingdom for a data model » XQuery to the rescue » Summary » Resources First published by IBM developerWorks |
|