NutchWAX 0.11.0-SNAPSHOT API

NutchWAX ("Nutch + Web Archive eXtensions") searches web archive collections.

See:
          Description

Packages
org.archive.access.nutch Provides mapreduce jobs to import ARCs and plugins to add 'collection', and ARC repository location to nutch index.
org.archive.access.nutch.jobs  
org.archive.access.nutch.mapred  

 

NutchWAX ("Nutch + Web Archive eXtensions") searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc.

Table of Contents

Getting Started

NutchWAX runs on hadoop. The general pattern is you ask the hadoop platform to run one of the set of bundled mapreduce jobs from the NutchWAX jar. Whether the job runs on one machine or many or whether it uses local disk or the hadoop distributed file system is a matter of hadoop configuration.

Requirements

  1. Linux: NutchWAX may run on systems other than linux but linux is the only OS we've tested it on.
  2. Java: NutchWAX requires a 1.5.x or above JDK.
  3. Servlet Container: The NutchWAX search war has been tested working in version 5.0.28 and 5.5.x of tomcat.
  4. Hadoop: hadoop is the platform we use to run indexing jobs atop. Hadoop is an open source implementation of Google mapreduce and Google GFS. NutchWAX 0.10.0 requires Hadoop 0.9.2. It will not work with later versions. Hadoop has its own set of requirements. See Requirements about midways down on the Hadoop API page. Hadoop binaries are available for download off the apache site. The NutchWAX README.txt lists patches made to Hadoop indexing on Internet Archive hardware.

Running NutchWAX in non-distributed, Standalone mode

In this section we'll index a few ARCs and then setup the NutchWAX war file to run queries against the produced index. We will run NutchWAX on a hadoop platform set to run non-distributed, all on a single box, with all hadoop functions performed in a single process using the local filesystem: i.e. Standalone operation.

Before you can begin indexing, you must first get your hadoop platform up and running. See hadoop API Getting Started section for how to install and configure your hadoop platform. Review Standalone operation. Run the Grep example. Ensure your setup is error free by inspecting emissions on STDOUT and the contents of the hadoop logs subdirectory Let your Hadoop install be located at ${HADOOP_HOME}.

Download the NutchWAX binary from sourceforge (or from under Build Artifacts on the archive's continuous build server). Undo the NutchWAX tar.gz bundle. Let the unbundled NutchWAX binary be at ${NUTCHWAX_HOME}.

The NutchWAX binary has within it, a NutchWAX jar and a NutchWAX war file. The jar is used at indexing time. The war is for searching run in a servlet container to field search queries. The jar contains all code and supporting libraries to run six distinct mapreduce jobs:

The above are permutations on Nutch operations only in our case we've amended the nutch fetch to instead import ARCs, and to the other steps we've added plugins to add in extra NutchWAX facility such as extra fields in the index.

Indexing

First, lets set up our environment variables for NutchWAX and HADOOP and list out the NutchWAX usage (You run the NutchWAX jar in the same manner used above running Grep out of the hadoop-*-examples.jar). After listing the general usage, we list the particular usage for the all NutchWAX command.

  %  export HADOOP_HOME=/home/stack/tmp/nwtesting/hadoop-nightly/
  %  export NUTCHWAX_HOME=/home/stack/tmp/nwtesting/nutchwax
  %  ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar
        Usage: hadoop jar nutchwax.jar <job> [args]
        Launch NutchWAX job(s) on a hadoop platform.
        Type 'hadoop jar nutchwax.jar help <job>' for help on a specific job.
        Jobs (usually) must be run in the order listed below.
        Available jobs:
         import  Import ARCs.
         update  Update dbs with recent imports.
         invert  Invert links.
         index   Index segments.
         dedup   Deduplicate by URL or content MD5.
         merge   Merge segment indices into one.
         all     Runs all above jobs in order.
         class   Run the passed class's main.
  %  ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar help all
        Usage: hadoop jar nutchwax.jar all <input> <output> <collection>
        Arguments:
         input       Directory of files listing ARC URLs to import.
         output      Directory write indexing product to.
         collection  Collection name. Added to each resource.

Review the output. Now, lets index a couple of ARCs. Of note, inputs for mapreduce tasks are always directories not files. To specify the ARCs to index, we pass a path to a directory that has a file listing the ARCs to index.

Lets assume we want to index the ARCs 1.arc.gz, 2.arc.gz, and 3.arc.gz. First we make a file that lists the full-path to the ARCs-to-index. Here's how the file content might look:

/tmp/1.arc.gz
/tmp/2.arc.gz
/tmp/3.arc.gz
...assuming the ARCs are in /tmp. Let this file be /tmp/inputs/arcs.txt.

Note, you could also point to the ARCs using URLs as in http://localhost/~stack/1.arc.gz, etc. assuming the ARC was in stack's published web directory on localhost.. NutchWAX includes URL Handlers for an rsync scheme that does a Runtime.exec of rsync and a pseudo-S3 scheme for pulling ARCs out of Amazons' S3 simple storage service.

Now lets run all of the indexing steps in one go by passing the all directive to NutchWAX. Have the indexing steps store their output to a directory /tmp/outputs and let the collection name for this test indexing be test.

  % ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test

It will run for a while stepping through each of the indexing stages (You might want to redirect the output to a log file). When done, you can inspect the /tmp/outputs directory. It should look the following:

 % ls /tmp/outputs
        crawldb index indexes linkdb segments

Searching

Deploy the NutchWAX war file under a Servlet Container such as tomcat by copying it to the container's webapps folder. The Servlet Container will (usually -- unless configured not to) expand the deployed war file. After expansion, go into the expanded war directory -- it will be named for the war file absent the .war suffix -- and edit the hadoop-site.xml file in the WEB-INF/classes subdirectory. Add the following properties between the configuration elements:

<property>
  <name>searcher.dir</name>
  <value>${FULLPATH-OUTPUT-DIR-ON-LOCAL-FILESYSTEM}</value>
  <description>
  Used at search time by the nutchwax webapp.

  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.

  Set to an absolute path.  The alternative is having to start the
  container -- e.g. tomcat -- so its current working directory contains
  a subdirectory named 'searcher.dir'.
  </description>
</property>

<property>
  <name>wax.host</name>
  <value>${COLLECTIONS_HOST}</value>
  <description>
  Used at search time by the nutchwax webapp.
 
  The name of the server hosting collections.
  Used by the webapp conjuring URLs that point to page renderor
  (e.g. wayback).

  URLs are conjured in this fashion:

    ${wax.host}/COLLECTION/DATE/URL

  To override the COLLECTION obtained from the search result,
  add a path to wax.host: e.g. localhost:8080/web.
  </description>
</property>

...replacing ${FULLPATH-OUTPUT-DIR-ON-LOCAL-FILESYSTEM} with the full path to the output directory, e.g. /tmp/outputs, and ${COLLECTIONS_HOST} to the name of the host running an application that can render found web pages such as the
open-source wayback or WERA. If you do not want to edit your war in-situ under webapps, set the configuration before deploy by unpacking the war, making your changes, and then repacking the war file. See the tomcat-deployer application for a tool to help do this that works with the tomcat container (recommended!).

Now, browse to where your container is running -- usually on port 8080 -- and add to the path the name of the webapp: E.g. if the deployed webapp was named nutchwax, then browse to http://container-host:8080/nutchwax. You should see the NutchWAX query box. Try some queries with terms you know to be present in the indexed ARCs. If using tomcat, see your ${TOMCAT_HOME}/logs, particularly catalina.out, if the webapp does not deploy successfully or if there are no search results returned.

Pseudo-distributed Mode

Now lets try running your indexing job in distributed mode using the Hadoop Distributed File System (HDFS or DFS) rather than local store. The most basic Distributed Operation mode is that which is described in the Pseudo-distributed Configuration section of the hadoop documentation, where all daemons -- the controlling job daemon, JobTracker, the controlling file system daemon, NameNode, and the job slave, TaskTracker and data slave, DataTracker -- all run on a single machine. Ensure ssh is setup according to the hadoop instructions (i.e. passwordless ssh login works), that you've configured your hadoop-site.xml with locations of mapreduce, and DFS head nodes, and that you've bootstrapped your DFS using the -format command.

To access DFS, use the DFS client: ${HADOOP_HOME}/bin/hadoop dfs. Try it. You will get a list of all commands for the DFS file system (They generally work like their UNIX file system counterparts).

Now, upload our file of ARCs to index, /tmp/inputs/arcs.txt to DFS into the DFS directory named inputs.

  % ${HADOOP_HOME}/bin/hadoop dfs -mkdir inputs
  % ${HADOOP_HOME}/bin/hadoop dfs -put /tmp/inputs/arcs.txt inputs
  % ${HADOOP_HOME}/bin/hadoop dfs -ls inputs

As we did in standalone mode above, run all of the indexing steps in one go by passing the 'all' directive to NutchWAX but this time we'll be working against the distributed filesystem. The 'inputs' and 'outputs' directories specified above refer to locations up on the distributed filesystem (because above we configured hadoop to use DFS instead of the local filesystem) .

  %{HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all inputs outputs test
Be aware that NutchWAX runs sub-optimally when all daemons are hosted on a single computer but this mode is good for familiarizing yourself with distributed operation without the headache of multiple machines.

When done, you can inspect the DFS outputs directory. It should look the following:

 % ${HADOOP_HOME}/bin/hadoop dfs -ls outputs
    060425 183147 parsing file:/home/stack/tmp/nwtesting/hadoop-nightly/conf/hadoop-default.xml
    060425 183147 parsing file:/home/stack/tmp/nwtesting/hadoop-nightly/conf/hadoop-site.xml
    060425 183147 No FS indicated, using default:localhost:9000
    060425 183147 Client connection to 127.0.0.1:9000: starting
    Found 5 items
    /user/stack/outputs/crawldb     <dir>
    /user/stack/outputs/index       <dir>
    /user/stack/outputs/indexes     <dir>
    /user/stack/outputs/linkdb      <dir>
    /user/stack/outputs/segments    <dir> 

While its possible to run the war file against the content in DFS, the latency will annoy. Better to copy the content needed to service the war application to local disk. Do the following:

  %  ${HADOOP_HOME}/bin/hadoop dfs -get outputs .
Set your webapp to point at the local directory as you did for standalone mode and restart.

Where to go next

See Fully-distributed Operation in the hadoop documentation for how to set Hadoop running on a cluster of more than a single host. Also, visit the tutorials and wiki pages up on the nutch and hadoop sites. In particular a skim of the NutchHadoopTutorial should prove fruitful. While nutch-centric -- in particular the way it pivots on the ${NUTCH_HOME}/bin/nutch command and though the discussion of crawling does not apply -- the same general concepts are at the core of NutchWAX operation. It also presents at a level of detail beyond that given here in this overview.

NutchWAX Configuration

To configure your Hadoop install or to override the default configuration built into the NutchWAX jar or war, add your settings to the hadoop file hadoop-site.xml. If your configuration is an index-time setting, add it to the file at $HADOOP_CONF_DIR/hadoop-site.xml on all slaves (See ${NUTCHWAX_HOME}/bin for a primitive script to disperse the content of the ${HADOOP_CONF_DIR} directory across the cluster: amend to suit your environment). If its a search time configuration, edit the war file and add your configuration to WEB-INF/classes/hadoop-site.xml. The content of this file, hadoop-site.xml, file will always win out over settings in any other file. Also, be aware of ${HADOOP_CONF_DIR}/hadoop-env.sh. The contents of this file are sourced before any hadoop command is run. It has defines for the location of the logging directory, ${HADOOP_LOGS_DIR}, ${JAVA_HOME}, etc.

As to what the configuration options are, they are myriad. There is configuration applicable to nutch, nutchwax and to hadoop. For the hadoop set, see hadoop-default.xml, for the nutch set, see nutch-default.xml, and for nutchwax, see wax-default.xml. Regards nutchwax, this template file includes a short list of the more important properties: hadoop-site.xml.template. Out of the box, the NutchWAX configuration should work indexing. The hadoop platform will need to be configured with the mapreduce and HDFS head nodes, the number of children per slave, the number of mappers and reducers to run, etc. For search time, the NutchWAX war file will need to know where to find the created index and how to write the search result URLs (See above under Searching).

Building NutchWAX from source

Requirements

To build from src, you will need to satisfy the above runtime requirements plus the following.

  1. Ant: Tested working with version 1.6.2.
  2. Maven: If you want to build distributions and the website, you'll need Maven 1.0.2.
  3. For NutchWAX 0.10.0 BRANCH, Nutch Revision: 492357
  4. (See the NutchWAX README for details).

Checkout NutchWAX [See Source Repository for how]. As the checkout runs, subversion will fetch the version of nutch the NutchWAX trunk is pegged against into the ${NUTCHWAX_HOME}/third-party directory using the svn:externals mechanism.

To build NutchWAX and its nutch dependency, run the default 'all' target:

   % cd ${NUTCHWAX_HOME}
   % ant all
This will generate the NutchWAX jar and war.

To build the NutchWAX site or distribution, run maven:

   % ${MAVEN_HOME}/bin/maven dist site:generate

Eclipse

See Eclipse for notes on how to setup your eclipse environment.


$Id: overview.html 1505 2007-02-20 21:13:09Z stack-sf $



Copyright © 2005-2007 Internet Archive. All Rights Reserved.