NutchWAX ("Nutch + Web Archive eXtensions" ) searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc.
NutchWAX 0.13 is now available.
For more information on NutchWAX 0.13, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.13.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.9 is now available.
For more information on NutchWAX 0.12.9, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.9.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.8 is now available.
For more information on NutchWAX 0.12.8, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.8.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.7 is now available.
For more information on NutchWAX 0.12.7, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.7.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.6 is now available.
For more information on NutchWAX 0.12.6, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.6.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.5 is now available.
For more information on NutchWAX 0.12.5, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.5.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.4 is now available.
For more information on NutchWAX 0.12.4, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.4.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.3b is now available.
This is a documentation patch to the 0.12.3 release which adds a section to the INSTALL.txt file with information on the start/stop scripts for NutchWAX search services.
For more information on NutchWAX 0.12.3b, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.3b.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.3 is now available.
For more information on NutchWAX 0.12.3, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Release notes summarizing the changes from release to release, can be found here:
A binary release package can be downloaded from the Archive-Access project page on SourceForge: nutchwax-0.12.3.tar.gz
In addition, a document describing best practices when using NutchWAX to full-text index very large collections in excess of 500 million documents (or more) is available: NutchWAX Best Practices: Indexing Very Large Collections (pdf).
NutchWAX 0.12.2 is now available.
For more information on NutchWAX 0.12.2, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
NutchWAX 0.12.1 is now available. This release addresses a few minor issues related to integration with the Wayback Machine. It also adds support for server-side XSL transforms of the OpenSearch XML results.
For more information on NutchWAX 0.12.1, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
NutchWAX 0.12 is now available. During the beta test period various bugs were found and fixed and a support for de-duplication during import and indexing was added.
For more information on NutchWAX 0.12, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
Re-factored to leverage Nutch plugin system and de-couple from many Nutch internal classes. NutchWAX 0.12 beta-1 is now available for beta testing. Any and all feedback is appreciated.
For more information on NutchWAX 0.12, including instructions on downloading, building, installing and running, please see the project wiki page at: http://webteam.archive.org/confluence/display/search/NutchWAX
With this upcoming release, NutchWAX 0.12 will "catch-up" to Nutch 1.0-dev (which uses Hadoop 0.16), thereby benefiting from numerous bug fixes and enhancements contained therein.
We are on target for releasing a public beta on June 2nd. Watch this space for further announcements.
Bug fixes and improvements in the quality of search results but the main benefit of NutchWAX 0.10.0 is a move to hadoop 0.9.2 from 0.5.0. The upgraded hadoop platform makes indexing much more robust and noticeably faster. See release notes for details and notes on significant changes.
NutchWAX 0.8.0 is built against Nutch 0.8.1, released 09/24/2006. A version of this software was recently used to make an index of greater than 400 million documents. See release notes for detail on new features and fixes.
With this release, NutchWAX moves on to a mapreduce Nutch base (Nutch 0.8-dev+). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes bout its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. See Getting Started for an introduction. Also see release notes .
Bug fix release. See release notes for detail. This time, for sure, its the last release before move to mapreduce nutch platform.
Minor fixes. Built for 1.4.x Java and added Google-like paging. Last release against Nutch-0.7 and move to mapreduce.
NutchWAX 0.4.0 is built against Nutch-0.7. Lots of Bug Fixes (See Release Notes ). This release has been coordinated with a new release of WERA , a web archive collection viewer application.
Announcing the initial coordinated alpha release of NutchWAX and WERA . WERA is an archive viewer application that gives an Internet Archive Wayback Machine -like access to web archive collections.