HOWTO: Configure Wayback to use NutchWAX index

Here is how to configure the open source wayback so it uses a NutchWAX index and can be used rendoring NutchWAX search results. The wayback-NutchWAX bridge is currently experimental.

The below instructions require NutchWAX 0.8.0 and a wayback that is at least version 0.7.0-200612132241.

Thanks to Maximilian Schoefmann for his contributions getting this bridge working.

NutchWAX

The main change you need to make in NutchWAX is setting it to go to your wayback install rendoring pages. This will usually be some combination of setting wax.host in your hadoop-site.xml to point at your wayback install and an edit of search.jsp to remove including of collection name in rendoring URL. For example, to have NutchWAX use a wayback that is deployed in the same servlet container -- i.e. both the NutchWAX and wayback WAR files are running in the same tomcat or jetty or JBOSS, etc. -- with wayback deployed to the context /wayback, make the following changes:

  • Add to your hadoop-site.xml the following property:
            <property>
              <name>wax.host</name>
              <value>localhost:8080</value>
              <description>
              Used at search time by the nutchwax webapp.
             
              The name of the server hosting collections.
              Used by the webapp conjuring URLs that point to page renderor
              (e.g. wayback).
              </description>
            </property>
            
  • Make the following change to search.jsp
    $ diff ~/workspace/nutchwax/src/web/search.jsp search.jsp
    211c213
    <     String archiveCollection = detail.getValue("collection");
    ---
    >     String archiveCollection = "wayback"; // detail.getValue("collection");
After making the above changes, redeploy your NutchWAX. Check that when you click on URLs, they look something like: http://localhost:8080/wayback/200612121212/http://archive.org (The date and URL will be different in your case but the prefix should align).

If the wayback is running elsewhere, adjust the wax.host in the above accordingly.

You might also consider changing the default value of the property wax.index.redirects from false to true at indexing time. This will make nutchwax index redirects. The wayback can automatically follow redirects.

wayback

Below are the changes made to the wayback web.xml to make it use a NutchWAX index. The below disables wayback indexing of ARCS, comments out the PipeLineFilter, and it enables the Remove-Nutch ResourceIndex option:

--- /home/stack/workspace/wayback/src/webapp/WEB-INF/web.xml	2006-12-12 14:05:28.000000000 -0800
+++ wayback/WEB-INF/web.xml	2006-12-12 15:03:30.000000000 -0800
@@ -415,7 +415,7 @@
 
     <context-param>
         <param-name>resourcestore.autoindex</param-name>
-        <param-value>1</param-value>
+        <param-value>0</param-value>
         <description>
             If this is set to '1', then a background thread is launched that
             detects new ARC files appearing in arcpath. New ARCs are indexed,
@@ -582,7 +582,6 @@
 an optional index update thread, which will scan a directory for new index data,
 in CDX format, and will automatically add new index records to the index.This
 is the default index storage implementation.
--->
 	
     <filter>
         <filter-name>PipelineFilter</filter-name>
@@ -680,6 +679,7 @@
         <description>Maximum number of results to return from the resourceindex</description>
     </context-param>
 
+-->
 <!-- END OF  Local-BDB ResourceIndex  OPTIONS -->
 
 <!-- START OF  Local-CDX ResourceIndex  OPTIONS
@@ -726,7 +726,6 @@
 
 These options are not used by default.
 -->
-<!--
 	
     <context-param>
         <param-name>resourceindex.classname</param-name>
@@ -745,7 +744,7 @@
         <param-value>1000</param-value>
         <description>Maximum number of results to return from the resourceindex</description>
     </context-param>
-    -->
+
 <!-- END OF  Remote-Nutch ResourceIndex  OPTIONS -->
 
 <!-- START OF  Remote-BDB ResourceIndex  OPTIONS

After making the above changes, redeploy your wayback.