Nutch TREC tools

This project contains extensions to nutch to allow indexing of collections used in the TREC conference (primarily focusing on the .GOV2 collection and Terabyte track) and searching of those collections in a format compatible with trec_eval and others.

Building

To build you need a JDK, ant and the nutch sources. The ant buildfile assumes the nutch sources are in the subdirectory nutch/. So if you have a checkout of the nutch subversion repository:

	$ cd ${ARCHIVE_ACCESS}/projects/nutch-trec
	$ ln -s ${NUTCH_SVN}/trunk nutch
	$ ant

The parser for the TREC format is generated from a JavaCC grammar. If you wish to rebuild the JavaCC generated sources from the .jj javacc file you need a copy of JavaCC in JavaCC/, eg:

	$ cd ${ARCHIVE_ACCESS}/projects/nutch-trec
	$ ln -s ${JAVACC_HOME} JavaCC
	$ ant javacc

The default ant build target (jar) will build a nutch-trec.jar in the build/ subdirectory. As part of the build, a symlink will also be added under ${NUTCH_HOME}/build to the generated nutch-trec.jar -- named nutch-trec.job -- so we can get TREC querying classes onto the nutch CLASSPATH.

Indexing

To index a collection you need to patch the bin/nutch script to accept Hadoop job jars (see:https://issues.apache.org/jira/browse/NUTCH-352). If you wish to run with a distributed hadoop configuration you'll have to change the config files appropriately. To index in standalone mode:

	$ ${NUTCH_HOME}/bin/nutch jar ${NUTCH-TREC_HOME}/build/nutch-trec.jar \
	  /input/directory /output/directory
Where /input/directory is an existing directory containing text files with the locations of collection files to be indexed (one location per line). The referenced collection files can be local (/path/to/file.gz) or remote over http (http://domain.com/path/to/file.gz), uncompressed or gzipped (denoted by a .gz suffix). The collection files are assumed to be in the general format used in .GOV(2).

The above step results in at least one /output/directory/segments/timestamp/ being created. Nutch itself is then used to build an index:

	$ ${NUTCH_HOME}/bin/nutch updatedb /output/directory/crawldb \
	  /output/directory/segments/timestamp
	$ ${NUTCH_HOME}/bin/nutch invertlinks /output/directory/linkdb \
	  /output/directory/segments/timestamp
	$ ${NUTCH_HOME}/bin/nutch index /output/directory/indexes \
	  /output/directory/crawldb /output/directory/linkdb \
	  /output/directory/segments/timestamp

In our experience, indexing the gov2 collection took a few days on a rack of 20-odd dual-core 2Ghz Athlon machines. Also, the parser, as of 09/2006, fails on some of the gov2 collection documents -- to be fixed -- and the gov2 redirects are not yet considered.

Querying

To test querying the index, pass the file http://trec.nist.gov/data/terabyte/05/05.efficiency_topics.gz decompressed to the TRECBean (subclass of NutchBean). This file has 50k lines of queries in the format queryid:query. Pass the file as follows:

	$ ln -s /output/directory crawl
	$ ${NUTCH_HOME}/bin/nutch org.archive.nutch.trec.TRECBean \
	  query.txt runid limit
       
...where runid is a string describing the run, limit is the maximum number of documents to return (defaults to 20), and TRECBean is a subclass of NutchBean, added to the nutch CLASSPATH by symlinking nutch-trec.jar as nutch-trec.job under $NUTCH_HOME/build (See tail of Building section above). Each line will be run serially. Its slow since we startup nutch everytime but is good to confirm system is basically working.

TODO: Relevancy and efficency tests (The 'np' items from here: terabyte 05.

Misc

If you wish to use this project in Eclipse, do an ant build import from the ant build.xml and make sure your compiler compliance level is set to 5.0.

To run the JUnit test from ant you need to have the junit.jar added to your ${ANT_HOME}/lib as outlined in http://ant.apache.org/manual/OptionalTasks/junit.html