NutchWAX ("Nutch + Web Archive eXtensions") searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc.
Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster. See release notes for details and notes on significant changes.
NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents. See release notes for detail on new features and fixes.
With this release, NutchWAX moves on to a mapreduce Nutch base (Nutch 0.8-dev+). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. See Getting Started for an introduction. Also see release notes.
Bug fix release. See release notes for detail. This time, for sure, its the last release before move to mapreduce nutch platform.
Minor fixes. Built for 1.4.x Java and added Google-like paging. Last release against Nutch-0.7 and move to mapreduce.
NutchWAX 0.4.0 is built against Nutch-0.7. Lots of Bug Fixes (See Release Notes). This release has been coordinated with a new release of WERA, a web archive collection viewer application.
Announcing the initial coordinated alpha release of NutchWAX and WERA. WERA is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections.