News
New Release - 1.0.0, 10/12/2007
Release 1.0.0 has several significant changes, most notably a
completely new configuration mechanism using Spring IOC. This new
configuration system introduces some deployment concepts:
-
WaybackCollections define a set of documents via the
previously existing ResourceStore and ResourceIndex
implementations.
-
AccessPoints define a method by which users can access
and interact with a WaybackCollection. A single
WaybackCollection may be exposed to users through several
AccessPoints simultaneously. Each AccessPoint specifies an
access URL, a Query interface, a Replay interface, and
several optional access restrictions, including limiting who
can connect to the AccessPoint, and which documents in the
WaybackCollection are available through the AccessPoint.
This new configuration frameworks allows hosting of hundreds of
individual collections within a single wayback installation, each
with potentially multiple AccessPoints.
This version also includes a major refactoring of the Replay User
Interface framework, simplifying extension and the creation of novel
replay modes. Specifically, one or more external .jsp files can
be used to generate additional HTML content within replayed HTML
pages. The Timeline Replay mode has been completely replaced by one
of these external .jsp files, which inserts the Timeline banner
inside replayed HTML pages.
This version includes a very experimental new Replay mode,
domain-prefix replay mode, which performs all markup and
recontextualization of replayed HTML documents on the server-side,
eliminating the need for client-side Javascript execution. Please
ask on the discussion list for assistance in using this Replay mode.
Lastly, this version has some internal improvements which should
reduce memory consumption, and the software is now built using
maven2.
New Release - 0.8.0, 01/11/2007
Release 0.8.0 offers several new features, most notably a CDX
format flat file ResourceIndex implementation, improved
character set detection, and many smaller improvements,
bug-fixes, and optimizations.
Major Features:
-
Added Sorted CDX flat file ResourceIndex implementation,
allowing for much larger data sets.
-
Improved character set detection so pages are not
mangled when server side modification occurs.
-
Several new command-line tools, for generating and
updating each ResourceIndex type.
-
Bug-fixes to allow integration with NutchWax full-text
searching.
New Release - 0.6.0, 07/14/2006
Release 0.6.0 offers:
-
Timeline Mode - comparable with WERA user interface.
-
Manual Exclusions - allows for blocking sites and paths
from the index for specific ranges of time.
New Release - 0.4.0, 03/28/2006
Release 0.4.0 offers many new features and improvements,
including:
-
Distributed ARC storage.
-
Improved Javascript and document rewriting for Archival
URL replay mode.
-
Several new ResourceIndex implementations: Remote BDB,
NutchWax.
-
live web robots.txt caching and retroactive compliance.
-
"Classic" Wayback Machine query User Interface.
First Release - 0.2.0, 12/09/2005
First public release of the open source wayback.
See below in the
Introduction
section for a listing of initial features.
Introduction
wayback is an open source java implementation of the
The Internet Archive
Wayback Machine.
The current production version of the Wayback Machine is implemented in
perl, and lacks in maintainability and extensibility. Also, the code is
not open source. Primary motivation for the new version is to address
these three issues, enabling public distribution of the application, and
easy experimentation with new features and access technologies.
The current Java version of the Wayback Machine supports two access, or
replay modes of operation: "Archival Url" mode and "Proxy" mode.
Archival URL mode provides a user experience very close to the current
production Wayback Machine. All query and replay access requests can be
expressed as URLs. In Archival Url replay mode, HTML documents are
delivered with additional Javascript embedded in the page. This
Javascript alters the document within the browser, attempting to make
links and embedded content refer back to the Wayback Machine by
rewriting them as Archival URLs.
Proxy URL mode allows replaying of archived documents within a client
browser by configuring the browser to proxy all HTTP requests through
the Wayback Machine. This has the strong advantage that no Javascript
page markup is required to coerce the client browser to request
additional URLs and embedded content from the Wayback Machine -- content
just works as-is. When used with the Firefox plugin extension, available
here
, client browsers can navigate between versions of the current
document, and the Wayback Machine server will attempt to display images
from the same time period as pages being viewed. The Proxy URL mode
requires special configuration of the client web browser to access the
Wayback Service. This browser configuration is not complex, but it
means that content cannot be accessed as a global URL.
Timeline Mode allows for navigation between different dates collected
of the current page, similar to the WERA application, using framesets.
See the
User Manual to learn more
about access modes.
The current Java version is intended to operate as a standalone webapp,
maintaining an index on the machine hosting the webapp. This index
contains records of the resources within a set of ARC files, which are
also assumed to be stored on the same machine hosting the webapp.
This software includes the capability to scan for ARC files in a
specified location, and to automatically index and serve content in
newly discovered ARC files as they appear. Directing the Wayback
Machine to look for ARC files in the directory where an instance of the
Heritrix web crawler is writing ARC output should provide the
capability to browse content archived by Heritrix as it is crawled.
The 0.4.0 version includes the capability to retrieve documents from ARC
files stored on remote hosts using HTTP 1.1. Please see the User Manual
for more information about using this and other new features.
Future versions of this software may integrate more tightly with the
Heritrix web crawler application.