Introduction

NutchWAX ("Nutch + Web Archive eXtensions" ) searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc.

Project Sponsors

IIPC logo The International Internet Preservation Consortium (IIPC) is a consortium of twelve National Libraries and the Internet Archive. The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.
Nordic Web Archive logo The Nordic Web Archive (NWA) is the Nordic National Libraries' forum for co-ordination and exchange of experience in the fields of harvesting and archiving web documents.
Internet Archive logo The Internet Archive (IA) is a 501(c)(3) non-profit organization whose mission is to build a public Internet digital library.

News

Release 0.13 - 03/19/2010

NutchWAX 0.13 is now available.

For more information on NutchWAX 0.13, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.13.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.9 - 01/13/2010

NutchWAX 0.12.9 is now available.

For more information on NutchWAX 0.12.9, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.9.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.8 - 09/21/2009

NutchWAX 0.12.8 is now available.

For more information on NutchWAX 0.12.8, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.8.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.7 - 07/24/2009

NutchWAX 0.12.7 is now available.

For more information on NutchWAX 0.12.7, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.7.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.6 - 07/09/2009

NutchWAX 0.12.6 is now available.

For more information on NutchWAX 0.12.6, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.6.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.5 - 06/26/2009

NutchWAX 0.12.5 is now available.

For more information on NutchWAX 0.12.5, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.5.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.4 - 05/05/2009

NutchWAX 0.12.4 is now available.

For more information on NutchWAX 0.12.4, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.4.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.3b - 03/08/2009

NutchWAX 0.12.3b is now available.

This is a documentation patch to the 0.12.3 release which adds a section to the INSTALL.txt file with information on the start/stop scripts for NutchWAX search services.

For more information on NutchWAX 0.12.3b, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.3b.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.3 - 12/18/2008

NutchWAX 0.12.3 is now available.

For more information on NutchWAX 0.12.3, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release notes summarizing the changes from release to release, can be found here:

A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.3.tar.gz

In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).

Release 0.12.2 - 10/13/2008

NutchWAX 0.12.2 is now available.

For more information on NutchWAX 0.12.2, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release 0.12.1 - 07/28/2008

NutchWAX 0.12.1 is now available. This release addresses a few minor issues related to integration with the Wayback Machine. It also adds support for server-side XSL transforms of the OpenSearch XML results.

For more information on NutchWAX 0.12.1, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release 0.12 - 07/03/2008

NutchWAX 0.12 is now available. During the beta test period various bugs were found and fixed and a support for de-duplication during import and indexing was added.

For more information on NutchWAX 0.12, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Release 0.12-beta-1 - 06/02/2008

Re-factored to leverage Nutch plugin system and de-couple from many Nutch internal classes. NutchWAX 0.12 beta-1 is now available for beta testing. Any and all feedback is appreciated.

For more information on NutchWAX 0.12, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX

Upcoming Release 0.12.0 - 05/22/2008

With this upcoming release, NutchWAX 0.12 will "catch-up" to Nutch 1.0-dev (which uses Hadoop 0.16), thereby benefiting from numerous bug fixes and enhancements contained therein.

We are on target for releasing a public beta on June 2nd. Watch this space for further announcements.

Release 0.10.0 - 01/17/2007

Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster. See release notes for details and notes on significant changes.

Release 0.8.0 - 12/12/2006

NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents. See release notes for detail on new features and fixes.

Release 0.6.0 - 05/01/2006

With this release, NutchWAX moves on to a mapreduce Nutch base (Nutch 0.8-dev+). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. See Getting Started for an introduction. Also see release notes .

Release 0.4.3 - 03/20/2006

Bug fix release. See release notes for detail. This time, for sure, its the last release before move to mapreduce nutch platform.

Release 0.4.2 - 11/28/2005

Minor fixes. Built for 1.4.x Java and added Google-like paging. Last release against Nutch-0.7 and move to mapreduce.

Release 0.4.1 - 11/03/2005

Bug fix for double encoding issue in NutchWAX 0.4.0.

Release 0.4.0 - 10/21/2005

NutchWAX 0.4.0 is built against Nutch-0.7. Lots of Bug Fixes (See Release Notes ). This release has been coordinated with a new release of WERA , a web archive collection viewer application.

Initial alpha release 0.2.1 07/27/2005

Announcing the initial coordinated alpha release of NutchWAX and WERA . WERA is an archive viewer application that gives an Internet Archive Wayback Machine -like access to web archive collections.