For more up to date info please visit http://netpreserve.org/openwayback

Introduction

wayback is an open source java implementation of the The Internet Archive Wayback Machine.

The current production version of the Wayback Machine is implemented in perl, and lacks in maintainability and extensibility. Also, the code is not open source. Primary motivation for the new version is to address these three issues, enabling public distribution of the application, and easy experimentation with new features and access technologies.

The current Java version of the Wayback Machine supports three access, or Replay modes of operation: "Archival Url" mode "Proxy" mode, and "Domain Prefix" mode.

Archival URL mode provides a user experience very close to the current production Wayback Machine. All query and replay access requests can be expressed as URLs. In Archival Url replay mode, archived content is modified as it is returned to users, attempting to make links and embedded content refer back to the Wayback Machine by rewriting them as Archival URLs.

Proxy URL mode allows replaying of archived documents within a client browser by configuring the browser to proxy all HTTP requests through the Wayback Machine. This has the strong advantage that no Javascript or server side page markup is required to coerce the client browser to request additional URLs and embedded content from the Wayback Machine -- content just works as-is. When used with the Firefox plugin extension, available here , client browsers can navigate between versions of the current document, and the Wayback Machine server will attempt to display images from the same time period as pages being viewed. The Proxy URL mode requires special configuration of the client web browser to access the Wayback Service. This browser configuration is not complex, but it means that content cannot be accessed as a global URL.

DomainPrefix mode is similar to ArchivalUrl mode, but uses a wildcard DNS scheme to rewrite URLs, allowing all URL substitution to occur on the server. This mode is considered experimental.

See the Administrator Manual to learn more about access modes.

The current Java version can operate in several deployment modes, ranging from a stand alone application on a single host holding all archived documents and indexes, up to a highly distributed system where indexes and archived content is spread across hundreds of machines.

In the local, standalone mode, this software includes the capability to scan for new archived content in a specified location, and to automatically index and serve the new content as it appears. Directing the Wayback to look for W/ARC files in the directory where an instance of the Heritrix web crawler is writing W/ARC output should provide the capability to browse content archived by Heritrix as it is crawled.

News

New Release - 1.6.0, 1/3/2011

The 1.6.0 release is now available, with improved server-side rewriting of HTML, CSS, Javascript, and SWF content. This version includes other new features and bug fixes, which are detailed on the release notes page. Upgrading to this version will require changes to Wayback Spring XML configuration.

Maintenance Release - 1.4.2, 7/17/2009

Release 1.4.2 fixes several problems discovered in the 1.4.1 release. Please see the release notes for a detailed list of changes.

Maintenance Release - 1.4.1, 11/10/2008

Release 1.4.1 fixes several problems discovered in the 1.4.0 release, and most notably disables by default the AnchorDate and AnchorWindow features which generated some confusion. Please see the release notes for a detailed list of changes.

New Release - 1.4.0, 8/20/2008

Release 1.4.0 has several new features, as well as several bug-fixes. This version includes substantial code changes and refactoring to consolidate and expand current functionality as well as enable several future planned features. Notable new features include the ability to insert content in replayed HTML pages within Proxy and DomainPrefix replay modes, the ability to scan multiple local directories recursively for ARC/WARC files, and a feature within Archival URL replay mode to prevent time drift during a replay session.

Please see the Release Notes for specific features and bug fixes.

New Release - 1.2.0, 1/30/2008

Release 1.2.0 has several new features, as well as several bug-fixes. Wayback now supports compressed and uncompressed ARC and WARC formats. Previously there was only support for compressed ARC files. This version also includes a new Archival URL replay mechanism, where all URL rewriting occurs on the server, obviating the need for client-side Javascript, and preventing some request leakage. This version also includes the capability to replace the default URL canonicalization scheme(currently there is still only one implementation available, but the groundwork for using different schemes is now in place.) This version also includes support for de-duplicated WARC records.

Please see the Release Notes for specific features and bug fixes.

New Release - 1.0.0, 10/12/2007

Release 1.0.0 has several significant changes, most notably a completely new configuration mechanism using Spring IOC. This new configuration system introduces some deployment concepts:

  • WaybackCollections define a set of documents via the previously existing ResourceStore and ResourceIndex implementations.
  • AccessPoints define a method by which users can access and interact with a WaybackCollection. A single WaybackCollection may be exposed to users through several AccessPoints simultaneously. Each AccessPoint specifies an access URL, a Query interface, a Replay interface, and several optional access restrictions, including limiting who can connect to the AccessPoint, and which documents in the WaybackCollection are available through the AccessPoint.
This new configuration frameworks allows hosting of hundreds of individual collections within a single wayback installation, each with potentially multiple AccessPoints.

This version also includes a major refactoring of the Replay User Interface framework, simplifying extension and the creation of novel replay modes. Specifically, one or more external .jsp files can be used to generate additional HTML content within replayed HTML pages. The Timeline Replay mode has been completely replaced by one of these external .jsp files, which inserts the Timeline banner inside replayed HTML pages.

This version includes a very experimental new Replay mode, domain-prefix replay mode, which performs all markup and recontextualization of replayed HTML documents on the server-side, eliminating the need for client-side Javascript execution. Please ask on the discussion list for assistance in using this Replay mode.

Lastly, this version has some internal improvements which should reduce memory consumption, and the software is now built using maven2.

New Release - 0.8.0, 01/11/2007

Release 0.8.0 offers several new features, most notably a CDX format flat file ResourceIndex implementation, improved character set detection, and many smaller improvements, bug-fixes, and optimizations.

    Major Features:
  • Added Sorted CDX flat file ResourceIndex implementation, allowing for much larger data sets.
  • Improved character set detection so pages are not mangled when server side modification occurs.
  • Several new command-line tools, for generating and updating each ResourceIndex type.
  • Bug-fixes to allow integration with NutchWax full-text searching.

New Release - 0.6.0, 07/14/2006

Release 0.6.0 offers:

  • Timeline Mode - comparable with WERA user interface.
  • Manual Exclusions - allows for blocking sites and paths from the index for specific ranges of time.

New Release - 0.4.0, 03/28/2006

Release 0.4.0 offers many new features and improvements, including:

  • Distributed ARC storage.
  • Improved Javascript and document rewriting for Archival URL replay mode.
  • Several new ResourceIndex implementations: Remote BDB, NutchWax.
  • live web robots.txt caching and retroactive compliance.
  • "Classic" Wayback Machine query User Interface.

First Release - 0.2.0, 12/09/2005

First public release of the open source wayback.