Nutchwax has a suite of regression
tests. They reside in
${NUTCHWAX}/src/regression. When run they fetch
the latest nutchwax build from the continuous
build server, they also download versions of Hadoop and Tomcat
deployer. Then Hadoop runs Nutchwax in
standalone mode and indexes some test arcs. Tomcat deployer
uploads the nutchwax war to a specified tomcat server. Then we run a
series of regression checks against the deployed Nutchwax.
sh, which, echo, basename,
awk, gnu tar, getopt, lynx, wget, cut, dirname, zegrep./bin/sh
linked to dash. The
recursion test scripts will not work with dash and hadoop start scripts use
source which is unimplemented in dash.
${NUTCHWAX}/src/regression/nutchwax_test_config to match
your local config. Run
${NUTCHWAX}/src/regression/nutchwax_regress.sh
${NUTCHWAX}/src/regression/nutchwax_test_config. For
more verbose output use the -v switch (Recommended).
${NUTCHWAX}/src/regression/nutchwax_test_config.
This is a fragment of a shell script. All the scripts in
${NUTCHWAX}/src/regression take the location of the
config file as an argument. If omitted the current directory is
searched for nutchwax_test_config.
WORKING_DIR - This specifies the path
all files will be written to.NUTCHN - URL to cruise control
continuous build server, no need to change.HADOOP_VER, HADOOP_URL
- Version of hadoop to use, may change between releases. TOMACT_VER,
TOMCAT_DURL - Version of tomcat deployer to
use, should be the same as the version of tomcat you intend to deploy
to.cat > $WORKING_DIR/deployer.properties
- this is an inline copy of the deployer.properties used for tomcat.
It is read by tomcat deployer and parse by some regression scripts.
username and password should be
changed to equal the username and password used to log into the
management interface of your tomcat server. url is
the URL used to access the management interface of your tomcat server.
path and webapp can be set to
anything. Any extra parameters specified will be read by tomcat
deployer. ARCS - List of arcs to index. Extra
arcs may be added separated by spaces. Extra arcs will be run through
the same set of regression tests.HADOOP_HEAPSIZE - Tuning var used for
machines with low memory.TAR - Used to specify a different
version of tar (eg. gtar)ANT - Used to specify ant.nutchwax_regress.sh - Run
nutchwax_deploy.sh then nutchwax_check_*.
-t (test only) omitts running
nutchwax_deploy.sh.nutchwax_deploy.sh - Get and deploy the latest
build of nutchwax to a tomcat server.nutchwax_check_all_urls.sh - Test all
documents in specified arcs are searchable (rss and html) and return
only a single match via exacturl: keyword .nutchwax_check_arcname.sh - Check queries of
the form 'exacturl:someurl
arcname:name_of_arc_containing_someurl' work.nutchwax_check_collection.sh - Check queries
of the form 'exacturl:someurl collection:test-collection'
work.nutchwax_check_date.sh - Check queries of the
form 'exacturl:someurl date:arcdate_of_url' and
'exacturl:someurl date:arc_daterange_of_entire_arc'
work.nutchwax_check_types.sh - Test all content
types in specified arcs are searchable (rss and html) return the
correct number of matches via type: keyword and .-v will make the script verbose,
without a -v scripts will only output errors to
STDERR.