ARC Tools

This is the home for Internet Archive ARC access tools. Tools are maintained as autonomous subprojects of this archive-access parent project.

Subprojects

Active

  • NutchWAX is Web Archive Collection Search based on Nutch.
  • wayback is an open-source version of the Internet Archive Wayback Machine.
  • WAXToolbar is a firefox extension for browsing Web Archives.

Not-so-active

  • Tom Emerson's libarc, "A C++ library for processing Internet Archive ARC, CDX, and DAT files." This project used to reside at libarc home page but was moved here, 09/14/2004. See the README.
  • Hedaern, an ARC 'access' tool, puts up a WebUI that allows URL+timestamp lookups and full-text searching of ARCs. Hedaern is currently 'alpha' and is LGPL. It is written in python -- it includes python ARC reader/writers -- and was donated by Mark Williamson of the British Library. To learn more about Hedaern, start with the guide.
  • Nutch TREC tools has a parser for the TREC format.
  • wera is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections. Wera is a php5 application based on -- and replaces -- the NwaToolset. Currently wera uses NutchWAX as its search engine core and the ARCRetriever webpp (included) fetching records from ARCs.
  • infiniteurl is an infinite source of pages used testing crawlers.