News

New Release - 1.0.0, 10/12/2007

Release 1.0.0 has several significant changes, most notably a completely new configuration mechanism using Spring IOC. This new configuration system introduces some deployment concepts:

  • WaybackCollections define a set of documents via the previously existing ResourceStore and ResourceIndex implementations.
  • AccessPoints define a method by which users can access and interact with a WaybackCollection. A single WaybackCollection may be exposed to users through several AccessPoints simultaneously. Each AccessPoint specifies an access URL, a Query interface, a Replay interface, and several optional access restrictions, including limiting who can connect to the AccessPoint, and which documents in the WaybackCollection are available through the AccessPoint.
This new configuration frameworks allows hosting of hundreds of individual collections within a single wayback installation, each with potentially multiple AccessPoints. This version also includes a major refactoring of the Replay User Interface framework, simplifying extension and the creation of novel replay modes. Specifically, one or more external .jsp files can be used to generate additional HTML content within replayed HTML pages. The Timeline Replay mode has been completely replaced by one of these external .jsp files, which inserts the Timeline banner inside replayed HTML pages.

This version includes a very experimental new Replay mode, domain-prefix replay mode, which performs all markup and recontextualization of replayed HTML documents on the server-side, eliminating the need for client-side Javascript execution. Please ask on the discussion list for assistance in using this Replay mode.

Lastly, this version has some internal improvements which should reduce memory consumption, and the software is now built using maven2.

New Release - 0.8.0, 01/11/2007

Release 0.8.0 offers several new features, most notably a CDX format flat file ResourceIndex implementation, improved character set detection, and many smaller improvements, bug-fixes, and optimizations.

    Major Features:
  • Added Sorted CDX flat file ResourceIndex implementation, allowing for much larger data sets.
  • Improved character set detection so pages are not mangled when server side modification occurs.
  • Several new command-line tools, for generating and updating each ResourceIndex type.
  • Bug-fixes to allow integration with NutchWax full-text searching.

New Release - 0.6.0, 07/14/2006

Release 0.6.0 offers:
  • Timeline Mode - comparable with WERA user interface.
  • Manual Exclusions - allows for blocking sites and paths from the index for specific ranges of time.

New Release - 0.4.0, 03/28/2006

Release 0.4.0 offers many new features and improvements, including:
  • Distributed ARC storage.
  • Improved Javascript and document rewriting for Archival URL replay mode.
  • Several new ResourceIndex implementations: Remote BDB, NutchWax.
  • live web robots.txt caching and retroactive compliance.
  • "Classic" Wayback Machine query User Interface.

First Release - 0.2.0, 12/09/2005

First public release of the open source wayback. See below in the Introduction section for a listing of initial features.

Introduction

wayback is an open source java implementation of the The Internet Archive Wayback Machine. The current production version of the Wayback Machine is implemented in perl, and lacks in maintainability and extensibility. Also, the code is not open source. Primary motivation for the new version is to address these three issues, enabling public distribution of the application, and easy experimentation with new features and access technologies. The current Java version of the Wayback Machine supports two access, or replay modes of operation: "Archival Url" mode and "Proxy" mode. Archival URL mode provides a user experience very close to the current production Wayback Machine. All query and replay access requests can be expressed as URLs. In Archival Url replay mode, HTML documents are delivered with additional Javascript embedded in the page. This Javascript alters the document within the browser, attempting to make links and embedded content refer back to the Wayback Machine by rewriting them as Archival URLs. Proxy URL mode allows replaying of archived documents within a client browser by configuring the browser to proxy all HTTP requests through the Wayback Machine. This has the strong advantage that no Javascript page markup is required to coerce the client browser to request additional URLs and embedded content from the Wayback Machine -- content just works as-is. When used with the Firefox plugin extension, available here , client browsers can navigate between versions of the current document, and the Wayback Machine server will attempt to display images from the same time period as pages being viewed. The Proxy URL mode requires special configuration of the client web browser to access the Wayback Service. This browser configuration is not complex, but it means that content cannot be accessed as a global URL. Timeline Mode allows for navigation between different dates collected of the current page, similar to the WERA application, using framesets. See the User Manual to learn more about access modes.

The current Java version is intended to operate as a standalone webapp, maintaining an index on the machine hosting the webapp. This index contains records of the resources within a set of ARC files, which are also assumed to be stored on the same machine hosting the webapp.

This software includes the capability to scan for ARC files in a specified location, and to automatically index and serve content in newly discovered ARC files as they appear. Directing the Wayback Machine to look for ARC files in the directory where an instance of the Heritrix web crawler is writing ARC output should provide the capability to browse content archived by Heritrix as it is crawled.

The 0.4.0 version includes the capability to retrieve documents from ARC files stored on remote hosts using HTTP 1.1. Please see the User Manual for more information about using this and other new features.

Future versions of this software may integrate more tightly with the Heritrix web crawler application.