What is Wera?

Sverre Bang

$Id: what-is-wera.xml,v 1.5 2005/10/26 09:21:02 sverreb Exp $


Table of Contents

1. Introduction
2. Wera simple setup
3. Practical use
4. The future of WERA

1. Introduction

WERA (WEb ARchive Access) is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page.

The Wera search interface and result list is shown below.

Figure 1. Wera Search

Wera Search

When the user clicks the Timeline link of a specific hit, the Timeline View shows up (shown below). Each version (timestamp) of the given url is marked along the timeline. The user may navigate between the different versions by clicking directly in the timeline or by clicking the arrows first, previous, next and last.

Figure 2. Wera Timeline View

Wera Timeline View

2. Wera simple setup

The simplest setup of Wera would be that Wera, NutchWax, the web archive (ARC files) and the interface to the web archive (arcretriever) are all installed on the same machine. This is illustrated in the figure below.

Figure 3. Wera overview

Wera overview

Explanation of above figure:

  • The user submits a query in the Wera search WUI.

  • Based on the query submitted Wera constructs a search request and sends it (1) to NutchWax (http get request, e.g. http://localhost:8082/nutchwax/opensearch?query=lux&start=0&hitsPerPage=10&hitsPerDup=1&dedupField=exacturl)

  • NutchWax constructs an a9 Opensearch RSS (XML) formatted result set and sends this as a reply to Wera (2).

  • Wera formats the result set for output to the user. For each hit, Wera sends two new queries to NutchWax (1, 2) for determining number of versions matching query and versions total (this functionality may be disabled in the Wera configuration in order to reduce the query load on NutchWax).

  • When the user click on the Timeline link of a hit two things happens:

    • Wera executes a search on exacturl in order to display the timeline with all the available versions (timestamps) of the given url marked along the line.

    • Wera executes searches on exacturl to find the version closest to the timestamp submitted as parameter to the timeline view script (1,2). For that particular version Wera constructs a request to the arcretriever containing the name of the ARC file where the version resides as well as the offset within that file where the version is stored (the ARC name and offset are stored in the index). Wera now requests, and receives an archived resource (3, 4) from the arcretriever (request example: http://localhost:8082/arcretriever/arcretriever?reqtype=getfile&aid=5902508/IAH-20051004171809-00000-test). If the resource is of type text/html (information in result set from NutchWax), a javascript link rewriter is inserted in the resource to ensure that links point to Wera rather than out to the internet. Before Wera delivers the resource to the users browser, header information on content type and encoding is set according to values received in the NutchWax result set. This is done to ensure that the users browser renders the resource correctly.

      Note

      A resource of type text/html will often contain inline references to images etc. Provided the javascript link rewriter does its job on these, the step above will be repeated for each of these.

3. Practical use

The original vision for the NwaToolset (the predecessor of Wera) was to enable search across the different Nordic Web Archives and provide seamless navigation within the different archives. The ability to search across the different indexes was solved by the using Fast Search & Transfer's multi node architecture. To enable Wera to retrieve a particular document with a given aid (Archive ID) from the right archive the collection field was introduced in the index (also present in the NutchWax index). The Wera config file holds the mapping from collection to archive (or rather Wera installation).

Another reason to include the collection field was to ensure that the actual link rewriting was done by the owner of the document. Each archive holder would have to set up their own Wera installation. When one Wera was requesting a document from a remote archive, the remote Wera should make the necessary changes to the document before delivering it to the calling Wera. The reason for this was to make sure that the owner had full control over what was delivered to the calling site, thus being able to threat the document in accordance with local policies rather than the policies of the caller site. The figure below illustrates the currently supported use of mapping between collection and archive nodes.

Figure 4. Wera interfacing several archive nodes

Wera interfacing several archive nodes

In the Wera installation of W1 the different collections indexed in NutchWax are mapped to corresponding Wera installations of W2- Wn. When the timeline view on W1 encounters a resource located on a different node (e.g. the collection mapping points to the Wera installation of W2) it requests that resource from the Wera installation at W2. Wera at W2 fetches the resource from its Retriever and does the necessary changes to the file before delivering it to Wera at W1 (e.g. inserts javascript link rewriter or rewrites it server side). When Wera at W1 receives this file it does an additional rewrite in order to have the links point to itself rather than to W2's Wera.

In a real-life large scale Web Archive where the ARC files are distributed across tens or hundreds of hosts it will not be practical to set up one Wera installation for each of these. A better solution will be to introduce communication between the different retrievers or have one front-end retriever interfacing all the other retrievers within one archive. This has to be added in a later release of Wera.

4. The future of WERA

As long as there are institutions using WERA, and these institutions see a need for fixing bugs and adding functionality, WERA will evolve. Of course, the actual work put into it will depend on the resources available at these institutions. We also hope that future enhancements of WERA will be funded, or partly funded by IIPC, as was the case with the work done to enable release 0.4.0 of WERA (and NutchWax).

The most important requirement for a future release of WERA will be to support retrieval from several Web Archive hosts through one single ARC retriever interface. In addition we need to do something with the remaining bugs that didn't make it into the 0.4.0. release (handling of redirects and better handling of frames). There are also a few requests for enhancements registered that needs attention, one of them being the advanced search interface.

One of the main complaints from users has been that WERA required the user to install and set up Tomcat, Apache + PHP and Perl + a number of CPAN modules. The dependency on Perl is long since removed but WERA still requires Tomcat (java Arc Retriever) and Apache (PHP web applications for searching and navigating). Over time, we would like WERA to move completely to Java, both for simplifying the install, setup and maintenance as well as improving the chances of getting users involved in the further development of WERA. Fortunately the move to Java may be done gradually because WERA is modular, and http is used to communicate between the different modules. The work of porting WERA to Java should be coordinated with the work done on wayback, to prevent implementing the same functionallity twice.

We strongly encourage users of WERA/NutchWax to contribute by submitting bugs and RFE's, as well as providing feedback on the usefullness of the tools.