Introduction

NutchWAX ("Nutch + Web Archive eXtensions") searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc.

Project Sponsors

IIPC logoThe International Internet Preservation Consortium (IIPC) is a consortium of twelve National Libraries and the Internet Archive. The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.
Nordic Web Archive logoThe Nordic Web Archive (NWA) is the Nordic National Libraries' forum for co-ordination and exchange of experience in the fields of harvesting and archiving web documents.
Internet Archive logoThe Internet Archive (IA) is a 501(c)(3) non-profit organization whose mission is to build a public Internet digital library.

News

Release 0.10.0 - 01/17/2007

Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster. See release notes for details and notes on significant changes.

Release 0.8.0 - 12/12/2006

NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents. See release notes for detail on new features and fixes.

Release 0.6.0 - 05/01/2006

With this release, NutchWAX moves on to a mapreduce Nutch base (Nutch 0.8-dev+). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. See Getting Started for an introduction. Also see release notes.

Release 0.4.3 - 03/20/2006

Bug fix release. See release notes for detail. This time, for sure, its the last release before move to mapreduce nutch platform.

Release 0.4.2 - 11/28/2005

Minor fixes. Built for 1.4.x Java and added Google-like paging. Last release against Nutch-0.7 and move to mapreduce.

Release 0.4.1 - 11/03/2005

Bug fix for double encoding issue in NutchWAX 0.4.0.

Release 0.4.0 - 10/21/2005

NutchWAX 0.4.0 is built against Nutch-0.7. Lots of Bug Fixes (See Release Notes). This release has been coordinated with a new release of WERA, a web archive collection viewer application.

Initial alpha release 0.2.1 07/27/2005

Announcing the initial coordinated alpha release of NutchWAX and WERA. WERA is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections.