WERA (WEb ARchive Access) is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page.
The Wera search interface and result list is shown below.
When the user clicks the Timeline link of a specific hit, the Timeline View shows up (shown below). Each version (timestamp) of the given url is marked along the timeline. The user may navigate between the different versions by clicking directly in the timeline or by clicking the arrows first, previous, next and last.
The simplest setup of Wera would be that Wera, NutchWax, the web archive (ARC files) and the interface to the web archive (arcretriever) are all installed on the same machine. This is illustrated in the figure below.
Explanation of above figure:
The user submits a query in the Wera search WUI.
Based on the query submitted Wera constructs a search request
and sends it (1) to NutchWax (http get request, e.g.
NutchWax constructs an a9 Opensearch RSS (XML) formatted result set and sends this as a reply to Wera (2).
Wera formats the result set for output to the user. For each hit, Wera sends two new queries to NutchWax (1, 2) for determining number of versions matching query and versions total (this functionality may be disabled in the Wera configuration in order to reduce the query load on NutchWax).
When the user click on the Timeline link of a hit two things happens:
Wera executes a search on exacturl in order to display the timeline with all the available versions (timestamps) of the given url marked along the line.
Wera executes searches on exacturl to
find the version closest to the timestamp submitted as parameter
to the timeline view script (1,2). For that particular version
Wera constructs a request to the arcretriever
containing the name of the ARC file where the version resides as
well as the offset within that file where the version is stored
(the ARC name and offset are stored in the index). Wera now
requests, and receives an archived resource (3, 4) from the
arcretriever (request example:
If the resource is of type
rewriter is inserted in the resource to ensure that links point to
Wera rather than out to the internet. Before Wera delivers the
resource to the users browser, header information on content type
and encoding is set according to values received in the NutchWax
result set. This is done to ensure that the users browser renders
the resource correctly.
A resource of type
text/html will often
link rewriter does its job on these, the step above will be
repeated for each of these.
The original vision for the NwaToolset (the predecessor of Wera) was to
enable search across the different Nordic Web Archives and provide
seamless navigation within the different archives. The ability to search
across the different indexes was solved by the using Fast Search & Transfer's multi
node architecture. To enable Wera to retrieve a particular document with a
aid (Archive ID) from the right archive the
collection field was introduced in the index (also present in the NutchWax
index). The Wera config file holds the mapping from collection to archive
(or rather Wera installation).
Another reason to include the collection field was to ensure that the actual link rewriting was done by the owner of the document. Each archive holder would have to set up their own Wera installation. When one Wera was requesting a document from a remote archive, the remote Wera should make the necessary changes to the document before delivering it to the calling Wera. The reason for this was to make sure that the owner had full control over what was delivered to the calling site, thus being able to threat the document in accordance with local policies rather than the policies of the caller site. The figure below illustrates the currently supported use of mapping between collection and archive nodes.
In the Wera installation of W1 the different
collections indexed in NutchWax are mapped to corresponding Wera
installations of W2- Wn. When the timeline view on W1
encounters a resource located on a different node (e.g. the collection
mapping points to the Wera installation of W2) it
requests that resource from the Wera installation at
W2. Wera at
W2 fetches the resource
from its Retriever and does the necessary changes to the file before
delivering it to Wera at
link rewriter or rewrites it server side). When Wera at
W1 receives this file it does an additional rewrite in
order to have the links point to itself rather than to
In a real-life large scale Web Archive where the ARC files are distributed across tens or hundreds of hosts it will not be practical to set up one Wera installation for each of these. A better solution will be to introduce communication between the different retrievers or have one front-end retriever interfacing all the other retrievers within one archive. This has to be added in a later release of Wera.
As long as there are institutions using WERA, and these institutions see a need for fixing bugs and adding functionality, WERA will evolve. Of course, the actual work put into it will depend on the resources available at these institutions. We also hope that future enhancements of WERA will be funded, or partly funded by IIPC, as was the case with the work done to enable release 0.4.0 of WERA (and NutchWax).
The most important requirement for a future release of WERA will be to support retrieval from several Web Archive hosts through one single ARC retriever interface. In addition we need to do something with the remaining bugs that didn't make it into the 0.4.0. release (handling of redirects and better handling of frames). There are also a few requests for enhancements registered that needs attention, one of them being the advanced search interface.
One of the main complaints from users has been that WERA required the user to install and set up Tomcat, Apache + PHP and Perl + a number of CPAN modules. The dependency on Perl is long since removed but WERA still requires Tomcat (java Arc Retriever) and Apache (PHP web applications for searching and navigating). Over time, we would like WERA to move completely to Java, both for simplifying the install, setup and maintenance as well as improving the chances of getting users involved in the further development of WERA. Fortunately the move to Java may be done gradually because WERA is modular, and http is used to communicate between the different modules. The work of porting WERA to Java should be coordinated with the work done on wayback, to prevent implementing the same functionallity twice.
We strongly encourage users of WERA/NutchWax to contribute by submitting bugs and RFE's, as well as providing feedback on the usefullness of the tools.