WERA Manual

Sverre Bang

$Date: 2006-03-20 14:51:53 +0000 (Mon, 20 Mar 2006) $


Table of Contents

1. Overview
2. Using WERA
2.1. Searching
2.2. Navigating
3. Installation
3.1. Obtaining WERA
3.2. System Requirements
3.3. Installing WERA using installer
3.4. Installing WERA manually
3.5. Configuring
3.6. Proxy support (EXPERIMENTAL)
4. Troubleshooting
4.1. Testing the Retriever

1. Overview

WERA (Web ARchive Access) is a freely available archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page.

WERA is based on, and replaces the NwaToolset. It was built using PHP and Java and utilizes open standards like the http protocol and XML extensively for communication between different parts of the system.

A web archive may consist of a large number of web documents, but also several versions of the same web document (i.e. the documents where downloaded from the same URL at different points in time). Potential users of WERA might be anyone that has a web archive. Examples of such users may be:

  • National Libraries or other organisations collecting parts of the internet for long term preservation.

  • Companies, organisations or private persons keeping a historical collection of their own web site and/or intranet.

Note that in the following text a archived file and a web page is not necessarily the same thing. What the user experience as one web document may consist of several archived files (e.g. a web page which comprises the html file and the inline images).

In order to use WERA for searching, browsing and navigating your archived web documents you will, in addition to a web browser with javascript enabled, need some additional components. These are:

  • A Search Engine which holds a full-text index of the archived web documents. Currently the Jakarta Nutch/Lucene based NutchWAX search engine is supported.

  • A Document Retriever which serves as the interface between the Access module and the web archive. The Document Retriever delivers archived files and associated metadata to WERA upon request.

Figure 1. Wera and friends

Wera and friends

A key requirement for the web archive is that the web documents contents are stored unaltered and that a metadata set consisting of at least the original url, timestamp and mime-type of the archived files is available.

Note

For more on the workings and architecture of WERA, see the article What is Wera? .

2. Using WERA

WERA provides the user with interfaces for searching, browsing and navigating the archived web pages.

When the user submits a query, WERA uses the search engine to find the archived files containing the text(s) satisfying the query. When the user asks for a specific URL, WERA will return the archived file with that particular URL (e.g. the archived file originally downloaded from the url http://www.nb.no/index.html). Before the file is delivered to the user's browser a javaScript is inserted in the file so that the inline links and references are altered by the browser to point into the archive rather than out to the Internet.

The resulting web page is presented with a timeline at the top and the web document below it. The timeline queries the index for all archived versions of the web page and displays the timestamps graphically along the line.

2.1. Searching

Searching a web archive through WERA resembles using a Internet search engine like Google. An example of WERA search interface and result list is shown in figure 2.

Figure 2. Search Result

Search Result

The results are grouped by site. To view the results from a specific site click the link More from this site. The resulting output is shown in figure 3.

To display all hits (no grouping) click the link Show All (bottom, left - figure 2). The resulting output is shown in figure 4.

Figure 3. Search Result - More from this site

Search Result - More from this site

Figure 4. Search Result - Show all

Search Result - Show all

Please note that in both Show all and More from this site the different versions of a given URL is shown as one hit.

Clicking the Timeline link in the Search result page will take the user to the Timeline page with the most recent versions displayed.

Clicking the Overview link of a specific hit will display all the dates for the versions found for the chosen URL (the overview does not contain any information of which versions that actually satisfied the query term given in the first place).

Figure 5. Overview

Overview

Clicking one of the links in the Overview page will take the user to the Timeline view with the chosen version displayed.

2.2. Navigating

Figure 6. Timeline

Timeline

When navigating from the Overview or the result list of a search interface, the URL of the chosen version is passed along and shown in the URL field. A URL may also be entered manually.

Navigation between the different versions is done by directly clicking a specific point on the timeline, or by using the arrows first, previous, next and last.

When entering the timeline view the resolution is set to auto. This means that the timeline automatically drills down to the resolution needed to display single versions along the line. The Auto checkbox may be unchecked in order to manually choose the resolution (choosing a different resolution when in auto also disables auto resolution).

Checking the Metadata checkbox of the timeline will bring up metadata for the chosen URL/Time in the lower part of the timeline view (see figure below).

Figure 7. Timeline - Metadata Viewer

Timeline - Metadata Viewer
It is also possible to perform a search from the Timeline view by typing in a query term and pressing Go.

3. Installation

3.1. Obtaining WERA

The latest version of WERA may be downloaded from the archive-access files pages at sourceforge.

3.2. System Requirements

WERA has been tested on different builds of RedHat, Fedora and Suse Linux. There is no reason to believe that the system will not work on other linux/unix ditributions.

  • A JVM

  • Apache http server w. PHP 4.3 or 4.4 (make sure that XML support is enabled, see below for details). WERA will NOT work properly with PHP 5, because of the new Object Model in PHP5.

    If PHP not installed, the quickest solution may be to install XAMPP, which also has XML support enabled.

    PHP XML support:

    XML support is needed by WERA to handle the search results returned from the NutchWAX search engine. To verify that XML support is enabled in php simply store the following text in a php-file (e.g. info.php) and save it in the apache DocumentRoot directory:

    <?php phpinfo(); ?>

    Open up http://<yourhost>/info.php in a browser and check that PHP has not been compiled with --disable-xml

  • Tomcat servlet container (http://jakarta.apache.org/tomcat/index.html). The arcretriever web app has been tested on v.5.0.27 and 5.0.28 as well as in 5.5.9.

  • NutchWAX. A bundling of Nutch and extensions for searching search Web Archive Collections (WACs) For how to install NutchWAX, see NutchWAX Getting Started.

    Note

    When indexing, make sure you invoke the NutchWAX indexer (indexarcs.sh) with the -n option. If not, nutchWAX will remove all duplicate urls from the index. Using WERA against such an index will give only one version per url on the WERA timeline.

WERA has been tested on different builds of RedHat, Fedora and Suse Linux. There is no reason to believe that the system will not work on other linux/unix ditributions.

3.3. Installing WERA using installer

The java based installer will install and configure the wera php webapp and arcretriever application.

  • Download wera-x-y-z-installer.tar.gz from sourceforge.

  • Unpack the gzipped tarball in a temporary directory on the host where you want wera installed.

  • Invoke the installer using java -jar wera-x-y-z-installer.jar.

  • Follow the on-screen instructions.

The installer will confgure WERA (and the arcretriever) in accordance with the input provided by you during the installation process. See the section on manual installation in order to view and change these settings manually (E.g if NutchWAX and/or your ARC file collection reside on different hosts than WERA.).

If the machine you are installing on does not have X installed, or if you are invoking the installer over ssh and X port forwarding is not working properly the installer should fall back to text mode. If this fails, try using the manual install preocedure described in the following section.

3.4. Installing WERA manually

To install WERA manually do the following:

  • Download wera-x-y-z.tar.gz from sourceforge. Untar and gunzip the bundle. Let the resultant directory be WERA_HOME (e.g. wera-x-y-z).

  • Move $WERA_HOME/webapps/wera into the Apache document root directory -- HTDOCS -- on the host where you want the WERA application to run.

  • Move the file arcretriever.war from $WERA_HOME/webapps/wera to the webapps directory of the tomcat installation of the host where your ARC-files reside (i.e. $TOMCAT_HOME/webapps).

    You must next configure the arcretreiver telling it where the directory of ARCs that it is to retrieve from resides. The configuration is inside in the WEB-INF/web.xml file. Dependent on your tomcat configuration, usually, tomcat will unjar the arcretriever.war file once the webapp has been deployed. If so, shutdown tomcat, remove arcretriever.war, leaving the arcretreiver directory in place, edit the arcretriever/WEB-INF/web.xml file setting full path to the arcdir and then restart tomcat. If tomcat does not undo your WAR file, you'll have to do it yourself. Move the WAR file out from under tomcat. Use the java jar command to undo the WAR.

        % cd /tmp
        % mkdir arcretriever
        % cd arcretriever
        % cp $WERA_HOME/webapps/arcretriever.war arcretriever
        % $JAVA_HOME/bin/jar xf arcretriever.war
        % rm arcretriever.war 
        (EDIT arcretriever/WEB-INF/web.xml. Set 'arcdir' param-value to full path to arcs.)
        % cd ../
        % mv arcretriever $TOMCAT_HOME/webapps
        % %TOMCAT_HOME/bin/shutdown.sh
        % %TOMCAT_HOME/bin/startup.sh
        

  • Edit the file HTDOCS/wera/lib/config.inc (see settings chapter for details).

3.5. Configuring

Settings for WERA can be found in the file HTDOCS/wera/lib/config.inc. Edit this file in order to configure WERA for your environment. Parameters to adapt:

Table 1. Settings in config.inc

$conf_debug

In order to have WERA produce some debug output, set to 1

In the Search and Overview pages the query request to NutchWax and the result set returned will be displayed.

In the Timeline view the the timeline datas is printed as html comment (view source to see).

$conf_rootpath = "/opt/lampp/htdocs/wera";Change this so that it corresponds with your environment i.e. HTDOCS/wera (you may of course rename the extracted wera directory to something else, and even choose to place it further down in the directory structure)
$conf_searchengine_url = "http://localhost:8080/nutchwax/opensearch";Open the url http://<nutchwaxhost>:<port>/nutchwax/ and click the RSS icon. The url of this page is the url you want to enter as conf_searchine_url (do not include the query part i.e. the ? and everything preceding it). If nutchwax is installed on the same host as you installed WERA on and tomcat is serving on port 8080, the default setting should work.
$document_retriever = "http://localhost:8080/arcretriever/arcretriever";Change the host name and port to point the tomcat installation of the host where your ARC-files reside.
$conf_http_host = "http://localhost/wera";Change localhost to the host name of the machine where you are installing WERA. Add the port number if different from 80 (<hostname>:<port>). If you renamed the wera directory or unpacked it further down relative to HTDOCS, update this parameter accordingly.
$conf_url_canonize_rules_immediate = "removefragment|userinfo|sessionids|querystrprefix"; $conf_url_canonize_rules_try = "addwww|lowercase|stripwww";

URI canonicalization rules

Change these according to the canonicalization rules used during harvesting. The rules may be applied immediately, i.e. before the initial exacturl query, or as try-rules. In this case the rule is applied to the URL and the exacturl query is repeated. If the query does not return a valid result, the next try-rule is applied and query repeated until the rule-list is expired. The following rules are implemented:

  • lowercase - If uppercase characters in url, lowercase the url

  • stripindex - Strip trailing index.* from url

  • userinfo - Strip userinfo from url (e.g http://user:pass@nwa.nb.no/ becomes http://nwa.nb.no/

  • addwww - If no 'www.' as first part of host and no hits, insert 'www.'

  • addwwwaddslash - Same as addwww but combines addslash

  • stripwww - If 'www.' as first part of host and no hits, strip 'www.'

  • stripwwwaddslash - Same as stripwww but combines addslash

  • querystrprefix - Fix up the question mark that leads off the query string

  • sessionids - If any known session id's present in url, strip out the session id's. Removes JSESSIONID, ASPSESSIONID, PHPSESSID, and sid session ids.

  • removefragment - Removes fragment identifiers used to link to anchor within a page. Currently Wera does not support fragment ids, so this should always be enabled (rules_immediate).

  • addslash - Add trailing slash to the url (if not already present).

Please note that every try rule potentially adds one extra query request to the search engine.

The addwww and stripwww rules are mutually exclusive (if one is applied, the other is not). The addwwwaddslash and stripwwwaddslash are also mutually exclusive.

The rules are applied in the order specified in the config. To remove a rule, simply delete it from the $conf_url_canonize_rules_*. To disable all rules simply replace the list with an empty string e.g. $conf_url_canonize_rules_try = "". If you suspect that a specific immediate rule doesn't fit your archive, consider deleting it or moving it the the try-rule list.

$conf_url_canonize_debug_onIf set to true, the rules applied will be displayed below the timeline when no version found, and when the metadata viewer is enabled.

There are other parameters to tweak as well, but for a simple setup of WERA the above settings should do. Information on setting other parameters and how to distribute the different components of WERA on different hosts will be provided in a later release.

3.6. Proxy support (EXPERIMENTAL)

The Javascript inserted by WERA before the html page is delivered to the users browser does not catch all links. To prevent this undesired behaviour the web server hosting WERA can be set up as Proxy server so that all the requests for other hosts than the WERA host can be redirected back to WERA. Of course, the user will have to change the browsers proxy setting so that all requests goes to the WERA host.

To enable this functionality for an XAMPP Apache installation save the text below in the file <$APACHE_INSTALLDIR>/etc/extra/httpd-wera.conf and add the line Include etc/extra/httpd-wera.conf to the Apache configuration file (httpd.conf). Make sure you change the hostname in httpd-wera.conf from example.com to the hostname of your server.

# WERA Proxy test (experimental)
ProxyRequests On
ProxyVia On
<Proxy *>
Order deny,allow
Allow from all
</Proxy>

RewriteEngine on
#RewriteLog logs/rewrite_log
#RewriteLogLevel 9

RewriteCond   %{HTTP_HOST}    !^(example\.com) 
RewriteRule   (.*)            http://example.com/wera/urlProxyRedirect.php?url=$1

4. Troubleshooting

4.1. Testing the Retriever

In order to test the Retriever try accessing the following urls in a browser (or use wget [URL] from the command line):

  • http://<hostname>.<domainname>[:port]/<retriever>?reqtype=<reqtype>&aid=<aid>

Where retriever is the retriever script doing the retrieval, reqtype is the request type and the aid is the unique identifier (within the archive) for a harvested file. The getmeta request will return archived technical metadata for the file in question and the getfile request will return the archived file itself.

To find the aid of one partcular document in your archive open the url http://<nutchwaxhost>:<port>/nutchwax/ and enter execute a query. Scroll down to the RSS icon and click it. For one particular result copy the value of nutch:arcoffset and nutch:arcname and build the aid: <arcoffset>/<arcname><conf_aid_suffix>

An example of the result of the getmeta request http://localhost:8080/arcretriever/arcretriever?aid=5160509/IAH-20041102080031-00007-test1.nb.no.arc.gz&reqtype=getmeta is given below.

<?xml version="1.0" encoding="UTF-8"?>
  <retrievermessage>
  <head>
  <reqtype>getmeta</reqtype>
  <aid>5160509/IAH-20041102080031-00007-test1.nb.no.arc.gz</aid>
  </head>
  <body>
    <metadata>
      <url>http://www.nla.gov.au/raam/</url>
      <archival_time>20041102080756</archival_time>
      <last_modified_time>20041102080756</last_modified_time>
      <content_length></content_length>
      <contenttype>
        <type>text/html</type>
        <charset></charset>
      </contenttype>
      <filestatus>online</filestatus>
      <filestatus_long></filestatus_long>
      <content_checksum>ZBYZIFD6PK5ZHCUQGTKZSZ2LJMZUD554</content_checksum>
      <http-header>HTTP/1.1 200 OK
       Date: Tue, 02 Nov 2004 08:07:57 GMT
       Server: Apache/1.3.29 (Unix) PHP/4.1.2 mod_perl/1.27 mod_jk/1.2.0 mod_ssl/2.8.16 OpenSSL/0.9.6l
       X-Powered-By: PHP/4.1.2
       Connection: close
       Content-Type: text/html</http-header>
    </metadata>
  </body>
</retrievermessage>

To retrieve the actual archived resource (in this case http://www.nla.gov.au/raam/) from the arcretriever change getmeta to getfile in the request described above.