Regression Tests

Nutchwax has a suite of regression tests. They reside in ${NUTCHWAX}/src/regression. When run they fetch the latest nutchwax build from the continuous build server, they also download versions of Hadoop and Tomcat deployer. Then Hadoop runs Nutchwax in standalone mode and indexes some test arcs. Tomcat deployer uploads the nutchwax war to a specified tomcat server. Then we run a series of regression checks against the deployed Nutchwax.


  • 0.8 Nutchwax source or a later svn checkout
  • Java 1.5x
  • A running Tomcat 5.5 or greater
  • Standard UNIX environment: sh, which, echo, basename, awk, gnu tar, getopt, lynx, wget, cut, dirname, zegrep.
NOTE: Later debians (unstable) -- including ubuntu 6.10 -- have /bin/sh linked to dash. The recursion test scripts will not work with dash and hadoop start scripts use source which is unimplemented in dash.


Get a copy of nutchwax source. Edit ${NUTCHWAX}/src/regression/nutchwax_test_config to match your local config. Run ${NUTCHWAX}/src/regression/ ${NUTCHWAX}/src/regression/nutchwax_test_config. For more verbose output use the -v switch (Recommended).


Parameters for the regression tests are read from a config file like ${NUTCHWAX}/src/regression/nutchwax_test_config. This is a fragment of a shell script. All the scripts in ${NUTCHWAX}/src/regression take the location of the config file as an argument. If omitted the current directory is searched for nutchwax_test_config.
  • WORKING_DIR - This specifies the path all files will be written to.
  • NUTCHN - URL to cruise control continuous build server, no need to change.
  • HADOOP_VER, HADOOP_URL - Version of hadoop to use, may change between releases.
  • TOMACT_VER, TOMCAT_DURL - Version of tomcat deployer to use, should be the same as the version of tomcat you intend to deploy to.
  • cat > $WORKING_DIR/ - this is an inline copy of the used for tomcat. It is read by tomcat deployer and parse by some regression scripts. username and password should be changed to equal the username and password used to log into the management interface of your tomcat server. url is the URL used to access the management interface of your tomcat server. path and webapp can be set to anything. Any extra parameters specified will be read by tomcat deployer.
  • ARCS - List of arcs to index. Extra arcs may be added separated by spaces. Extra arcs will be run through the same set of regression tests.
  • HADOOP_HEAPSIZE - Tuning var used for machines with low memory.
  • TAR - Used to specify a different version of tar (eg. gtar)
  • ANT - Used to specify ant.


  • - Run then nutchwax_check_*. -t (test only) omitts running
  • - Get and deploy the latest build of nutchwax to a tomcat server.
  • - Test all documents in specified arcs are searchable (rss and html) and return only a single match via exacturl: keyword .
  • - Check queries of the form 'exacturl:someurl arcname:name_of_arc_containing_someurl' work.
  • - Check queries of the form 'exacturl:someurl collection:test-collection' work.
  • - Check queries of the form 'exacturl:someurl date:arcdate_of_url' and 'exacturl:someurl date:arc_daterange_of_entire_arc' work.
  • - Test all content types in specified arcs are searchable (rss and html) return the correct number of matches via type: keyword and .
All scripts take the same arguments (location of config file) and a verbose switch. A -v will make the script verbose, without a -v scripts will only output errors to STDERR.


The below needs to be built into the regression test:
  • Check all documents in arc searchable via exacturl
  • Check type: returns the correct number of documents (doesnt work with more than one arc currently - combine with arcname
  • Check arcname: returns the number of documents in the arc
  • Other keywords specified in nutchwax faq 
  • Ranking tests.