This project contains extensions to nutch to allow indexing of collections used in the TREC conference (primarily focusing on the .GOV2 collection and Terabyte track) and searching of those collections in a format compatible with trec_eval and others.
To build you need a JDK, ant and the nutch sources. The ant buildfile
assumes the nutch sources are in the subdirectory nutch/. So if
you have a checkout of the nutch subversion repository:
$ cd ${ARCHIVE_ACCESS}/projects/nutch-trec
$ ln -s ${NUTCH_SVN}/trunk nutch
$ ant
The parser for the TREC format is generated from a JavaCC grammar. If you wish to rebuild the JavaCC generated sources from the .jj javacc file you need a copy of JavaCC in JavaCC/, eg:
$ cd ${ARCHIVE_ACCESS}/projects/nutch-trec
$ ln -s ${JAVACC_HOME} JavaCC
$ ant javacc
The default ant build target (jar) will build a nutch-trec.jar
in the build/ subdirectory. As part of the build,
a symlink will also be added under ${NUTCH_HOME}/build to the
generated nutch-trec.jar -- named nutch-trec.job --
so we can get TREC querying classes onto the nutch CLASSPATH.
To index a collection you need to patch the bin/nutch
script to accept Hadoop
job jars (see:https://issues.apache.org/jira/browse/NUTCH-352).
If you wish to run with a distributed hadoop configuration you'll have to
change the config files appropriately.
To index in standalone mode:
$ ${NUTCH_HOME}/bin/nutch jar ${NUTCH-TREC_HOME}/build/nutch-trec.jar \
/input/directory /output/directory
Where /input/directory is an existing directory containing
text files with the locations of collection files to be indexed (one
location per line). The referenced collection files can be local
(/path/to/file.gz) or remote over http
(http://domain.com/path/to/file.gz), uncompressed or
gzipped (denoted by a .gz suffix). The collection files
are assumed to be in the
general format used in .GOV(2).
The above step results in at least one
/output/directory/segments/timestamp/ being created. Nutch
itself is then used to build an index:
$ ${NUTCH_HOME}/bin/nutch updatedb /output/directory/crawldb \
/output/directory/segments/timestamp
$ ${NUTCH_HOME}/bin/nutch invertlinks /output/directory/linkdb \
/output/directory/segments/timestamp
$ ${NUTCH_HOME}/bin/nutch index /output/directory/indexes \
/output/directory/crawldb /output/directory/linkdb \
/output/directory/segments/timestamp
In our experience, indexing the gov2 collection took a few days on a rack of 20-odd dual-core 2Ghz Athlon machines. Also, the parser, as of 09/2006, fails on some of the gov2 collection documents -- to be fixed -- and the gov2 redirects are not yet considered.
To test querying the index, pass the file
http://trec.nist.gov/data/terabyte/05/05.efficiency_topics.gz decompressed
to the TRECBean (subclass of NutchBean). This file has
50k lines of queries in the format queryid:query.
Pass the file as follows:
$ ln -s /output/directory crawl
$ ${NUTCH_HOME}/bin/nutch org.archive.nutch.trec.TRECBean \
query.txt runid limit
...where runid is a string describing the run, limit is the maximum
number of documents to return (defaults to 20), and TRECBean is a subclass
of NutchBean, added to the nutch CLASSPATH by symlinking nutch-trec.jar
as nutch-trec.job under $NUTCH_HOME/build (See tail of Building
section above). Each line will be run serially. Its slow since we
startup nutch everytime but is good to confirm system is basically
working.
TODO: Relevancy and efficency tests (The 'np' items from here: terabyte 05.
If you wish to use this project in Eclipse, do an ant build import from the ant build.xml and make sure your compiler compliance level is set to 5.0.
To run the JUnit test from ant you need to have the junit.jar added
to your ${ANT_HOME}/lib as outlined in http://ant.apache.org/manual/OptionalTasks/junit.html