Here is how to configure the
open source wayback
so it uses a NutchWAX index and can be used rendoring NutchWAX search results.
wayback-NutchWAX bridge is currently experimental.
The below instructions require NutchWAX 0.8.0 and a wayback that is at least version 0.7.0-200612132241.
Thanks to Maximilian Schoefmann for his contributions getting this bridge working.
The main change you need to make in NutchWAX is setting it to go to your
wayback install rendoring pages. This will usually be some combination of
wax.host in your
hadoop-site.xml to point at your
wayback install and an edit
search.jsp to remove including of collection name in rendoring
URL. For example, to have NutchWAX use a wayback that is deployed in the
same servlet container -- i.e. both the NutchWAX and wayback WAR files
are running in the same tomcat or jetty or JBOSS, etc. -- with wayback
deployed to the context
/wayback, make the following
hadoop-site.xmlthe following property:
<property> <name>wax.host</name> <value>localhost:8080</value> <description> Used at search time by the nutchwax webapp. The name of the server hosting collections. Used by the webapp conjuring URLs that point to page renderor (e.g. wayback). </description> </property>
$ diff ~/workspace/nutchwax/src/web/search.jsp search.jsp 211c213 < String archiveCollection = detail.getValue("collection"); --- > String archiveCollection = "wayback"; // detail.getValue("collection");
http://localhost:8080/wayback/200612121212/http://archive.org(The date and URL will be different in your case but the prefix should align).
If the wayback is running elsewhere, adjust the
wax.host in the above accordingly.
You might also consider changing the default value of the property wax.index.redirects from false to true at indexing time. This will make nutchwax index redirects. The wayback can automatically follow redirects.
Below are the changes made to the wayback web.xml to make it use a NutchWAX index. The below disables wayback indexing of ARCS, comments out the PipeLineFilter, and it enables the Remove-Nutch ResourceIndex option:
--- /home/stack/workspace/wayback/src/webapp/WEB-INF/web.xml 2006-12-12 14:05:28.000000000 -0800 +++ wayback/WEB-INF/web.xml 2006-12-12 15:03:30.000000000 -0800 @@ -415,7 +415,7 @@ <context-param> <param-name>resourcestore.autoindex</param-name> - <param-value>1</param-value> + <param-value>0</param-value> <description> If this is set to '1', then a background thread is launched that detects new ARC files appearing in arcpath. New ARCs are indexed, @@ -582,7 +582,6 @@ an optional index update thread, which will scan a directory for new index data, in CDX format, and will automatically add new index records to the index.This is the default index storage implementation. ---> <filter> <filter-name>PipelineFilter</filter-name> @@ -680,6 +679,7 @@ <description>Maximum number of results to return from the resourceindex</description> </context-param> +--> <!-- END OF Local-BDB ResourceIndex OPTIONS --> <!-- START OF Local-CDX ResourceIndex OPTIONS @@ -726,7 +726,6 @@ These options are not used by default. --> -<!-- <context-param> <param-name>resourceindex.classname</param-name> @@ -745,7 +744,7 @@ <param-value>1000</param-value> <description>Maximum number of results to return from the resourceindex</description> </context-param> - --> + <!-- END OF Remote-Nutch ResourceIndex OPTIONS --> <!-- START OF Remote-BDB ResourceIndex OPTIONS
After making the above changes, redeploy your wayback.