Please see the System Requirements .
Please see the Software Downloads page .
Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the webapp file, wayback-webapp-1.6.0.war.
Installation and configuration of this software involves the following steps:
The wayback software provides Query and Replay access to archived documents. Query access allows users to locate particular documents within the collection by URL and date. Replay access allows users to view archived pages within their web browsers. Some Replay modes require altering the original pages and resources, so embedded and referenced content is also loaded from the Wayback service, and not from the live web.
A WaybackCollection defines a set of archived documents and an index which allows documents to be quickly located within the collection. A WaybackCollection may be exposed to end users through one or more AccessPoints, which define:
Wayback is configured using Spring IOC, to specify and configure concrete implementations of several basic modules. Please see the Spring website for more information on configuring beans using Spring XML.
An AccessPoint's configuration must specify the following implementations:
An AccessPoint's configuration may optionally specify the following, but must specify at least one of replayPrefix, queryPrefix, or staticPrefix:
AccessPoints can be used to provide different levels and types of access to the same collection for different users. For example, you can provide both Proxy and Archival URL mode access to a single collection by defining 2 AccessPoints with different Replay User Interfaces but the same WaybackCollection. Using AccessPoints, you can also provide different levels of access to a collection. For example, users within a particular subnet may be able to access all documents within a collection via one AccessPoint, but users outside that subnet may be restricted to viewing documents allowed by a web sites current robots.txt file.
Please refer to wayback.xml within the wayback .war file for detailed example AccessPoint configurations.
A WaybackCollection's configuration must specify the following implementations:
A WaybackCollection's configuration may optionally specify the following:
For more information on WaybackCollection configuration options and automatic indexing, please refer to the following documentation pages and to the example Spring .xml configuration files within the wayback .war:
There are presently 3 Replay modes supported by the Wayback software, Archival URL mode, Proxy mode, and an experimental DomainPrefix mode.
Archival URL Replay mode uses a modified URL to designate
documents stored in ARC/WARC files. The general form of an
Archival URL is:
Following the date portion of a timestamp, the following flags can be appended:
The properties parser and uriConverter for Archival URL Access Points must be set to the following implementations:
<property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser" init-method="init"> <property name="maxRecords" value="1000" /> <property name="earliestTimestamp" value="1996" /> </bean> </property> <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> <property name="replayURIPrefix" value="http://wayback.example.org:8080/collection/" /> </bean> </property>
|maxRecords||optional||Sets the default maximum requested records for Archival URL query requests.|
|earliestTimestamp||optional||Set the default start date for requested records for Archival URL query requests.|
|replayURIPrefix||required||Points to the Archival URL prefix of the Access Point as illustrated in Access Point Naming document.|
For additional configuration examples and information about ArchivalUrl Replay mode, please see the file ArchivalUrlReplay.xml
Wayback can be configured to act as an HTTP proxy server. To utilize this mode, the wayback webapp must be deployed as the ROOT context, no other AccessPoints can use the port dedicated to the Proxy AccessPoint, and client browsers must be configured to proxy all HTTP requests through the Wayback Machine application. Instead of retrieving documents from the live web, the Wayback Machine will retrieve documents from the configured WaybackCollection.
Additionally, there is an experimental Firefox-specific plugin developed by Oskar Grenholm, which provides a novel interface to navigate between different captured versions of a page within Proxy mode, and also sends a special HTTP header which allows Wayback to uniquely associate the correct date with browsers, even those behind a NAT system. You can find out more about this plugin and download it here .
The following is an example Proxy Replay Access Point definition. It assumes to be running on a host wayback.somehost.org, that a Tomcat Connector has been added for port 8090, that the Wayback webapp has been deployed at the ROOT context, and that another Archival URL Access Point named "8080:wayback" has been configured.
<bean name="8090" parent="8080:wayback"> <property name="queryPrefix" value="http://wayback.somehost.org/" /> <property name="replay"> ref="proxyreplay" /> <property name="uriconverter"> <bean class="org.archive.wayback.proxy.RedirectResultURIConverter"> <property name="redirectURI" value="http://wayback.somehost.org/jsp/Redirect.jsp" /> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.proxy.ProxyRequestParser" > <property name="localhostNames"> <list> <value>wayback.somehost.org</value> </list> </property> <property name="maxRecords" value="1000" /> </bean> </property> </bean>
redirectURI is required, and must be set to the name of the host where the Wayback application is running. If this is not the primary name of the machine running the Wayback application, then you may need to also specify the hostname used for the Wayback application in the localhostNames configuration list.
For additional configuration examples and information about Proxy Replay mode, please see the file ProxyReplay.xml
Wayback includes an additional, experimental Replay mode which is similar to Archival URL mode, in that any document can be refernced as a global URL, without any browser configuration requirements. This mode requires deploying the Wayback webapp in ROOT context, and a special DNS wildcard aliasing, so that all hostnames with a common suffix will be directed to your host running Wayback.
The general form of a DomainPrefix URL is:
Here is an example DomainPrefix URL, on an assumed host
wayback.somehost.org, with a wayback webapp deployed as
ROOT, via the Access Point named 8081 (which indicates the
port Wayback requests will be recieved on) for the
page http://www.yahoo.com/foo.gif on Dec 31, 1999 at 12:00:00 UTC.
Wayback provides several opportunities for customizing the user interface presented to users, which can be grouped into 4 categories:
All content returned by Wayback in response to Query requests is generated by .jsp files, which are executed and provided access to the results found within the ResourceIndex. Wayback is distributed with several sample implementations.
To alter the default behavior, you may either provide your own .jsp files, and configure the Renderer to use them instead of the default .jsp files, or the default .jsp files may be modified directly.
Wayback allows for embedding additional content within replayed HTML pages in all Replay modes. This is accomplished by executing one or more .jsp files with access to context information about the request, the results, and the actual Resource being returned. The output of each .jsp file is included within the returned page.
Wayback is distributed with several example .jsp insert files that can be used as is, modified to suit installation requirements, or used as examples for more elaborate customizations:
Wayback is distributed with a default ExceptionRenderer that allows customization of several types of anticipated exceptions that can occur through normal operations. The BaseExceptionRenderer allows installations to provide alternate .jsp files which are executed, and the output of these .jsp files are returned to end users. To alter the default behavior, you may either provide your own .jsp files, and configure the BaseExceptionRenderer to use them instead of the default .jsp files, or the default .jsp files may be modified directly.
Wayback is packaged with a set of reference implementation .jsp files for generating Query, Replay, and Exception user interface pages. References to actual user visible text is abstracted within these .jsp files so the specific text to display in various pages are read from a .properties file. Wayback will automatically search for a Locale-specific .properties file from which these text values should be loaded, allowing the language presented to users to be changed.
By default, Wayback will use the language preference indicated by the users web browser to find an appropriate .properties files, defaulting to the standard English text if the users preferred language is not available. Particular AccessPoints can be forced to a particular Locale using the AccessPoint.locale property.
Several language customization .property files have already been contributed by users in the community and are now included with the standard Wayback distribution. We plan for a completely new and improved UI implementation for version 1.6, and plan a more active outreach program to create customizations in as many languages as possible once this new UI is completed, and the required text elements are determined.
<property name="exclusionFactory" ref="excluder-factory-robot" />
<bean id="static-exclusion" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory" init-method="init"> <property name="file" value="/tmp/exclude.txt" /> <property name="checkInterval" value="600" /> </bean>
<property name="exclusionFactory" ref="static-exclusion">
<property name="authentication"> <bean class="org.archive.wayback.authenticationcontrol.IPMatchesBooleanOperator"> <property name="allowedRanges"> <list> <value>192.168.1.16/24</value> </list> </property> </bean> </property>which would have the affect of blocking users outside the 192.168.1.16/24 network.
<security-role> <description>Secured-Wayback</description> <role-name>wayback</role-name> </security-role> <security-constraint> <web-resource-collection> <web-resource-name>Secured-Wayback</web-resource-name> <url-pattern>/usersecure/*</url-pattern> </web-resource-collection> <auth-constraint> <role-name>wayback</role-name> </auth-constraint> </security-constraint> <login-config> <auth-method>BASIC</auth-method> <realm-name>Secured-Wayback</realm-name> </login-config>
<role rolename="wayback"/> <user password="changeM3" roles="wayback" name="brad"/>
The following configuration can be added to an Access Point:
<property name="configs"> <props> <prop key="inst">Acrobatic Association</prop> <prop key="logo">http://images.somehost.com/logos/acro.jpg</prop> </props> </property>
These configurations are then accessible in the common .jsp rendering pages, allowing Collection or Access Point specific text to be relayed to shared .jsp files, which can then retrieve the Access Point specific configuration with the following code:
UIResults results = UIResults.getGeneric(request); String instString = results.getContextConfig("inst"); String logoString = results.getContextConfig("logo"); ...
The wayback distribution includes several command-line tools that assist in creating and testing index files, and populating the ArcProxy location db.
All the command line tools can be found which can be found underneath the directory where you unpacked your distribution at:bin/* (example: bin/location-client).
This tool allows several maintenance operations to be performed on BDB files. There are two primary modes, read and write.
bin/bdb-client -r BDB_DIR BDB_NAME [PREFIX]
bin/bdb-client -w BDB_DIR BDB_NAME
This tool allows binary searching against large sorted text files. It will output lines prefixed with a particular key on STDOUT.
bin/bin-search KEY FILE [FILE2 ...]
These tools create a CDX format index for the ARC/WARC file at PATH, either on STDOUT, or at the path specified by CDX_PATH. The resulting file can be sorted and merged with other CDX format index files to generate CDX format ResourceIndex.
bin/cdx-indexer [-identity] PATH [CDX_PATH]
Note that when manually constructing CDX files using these tools, you must set the environment variable LC_ALL=C when using the standard UNIX sort command line tool.
The -identity option causes the tools to skip canonicalization of URLs. When using this option, you will need to pass the CDX records through the url-client tool before sorting them into a production CDX index. See the documentation for the url-client tool, and the URL Canonicalization section for more information.
If you have already populated your ResourceIndex, and just need to inform the ArcProxy LocationDB of where ARC files are located. This script will allow you to synchronize the ArcProxy LocationDB with the directories holding your ARC files.
Execute the script once for each directory containing ARC files (on each machine containing ARC files.) Again, this script will not index the content of the ARC files, but will only populate the ArcProxy LocationDB with the locations of ARC files.
bin/location-client sync LOCATION_URL ARC_DIR ARC_URL_PREFIX
URLs stored in BDB and CDX format ResourceIndexes are canonicalized to a more generic form. Before performing a lookup operation on the ResourceIndex, the same canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL.
This tool is required when using the cdx-indexer tool with the -identity option. Typical usage involves generating an identity CDX index, then passing the lines in that index through this tool to canonicalize the record URL key for queries. If the identity CDX files are kept, then canonicalization schemes can be swapped without reindexing the original ARC/WARC content. This tool can also be useful for debugging the canonicalization function. See the section URL Canonicalization for more information.
bin/url-client [-cdx] [-d DELIMITER] [-f FIELD] [-f FIELD2] ...
The Wayback software includes an additional application, the FileProxy, which can simplify some distributed ResourceStore implementations. The FileProxy application exposes two external services, one used to configure the underlying database mapping ARC/WRC filenames to the actual, fully qualified HTTP 1.1 URL or local path, and a second service which reverse proxies incoming HTTP 1.1 range requests to appropriate back-end storage nodes.
The fileproxy reverse proxy service allows one or more SimpleResourceStore instances to configure a single URL prefix where all ARC/WARC files are assumed to be located. This reverse proxy then uses a BDB JE to find the actual current location of the ARC/WARC file, and forward the request to the actual host holding the ARC/WARC file.
The locationdb service allows population and management of the BDB JE database(the locationDB) used by the fileproxy service. There is also a command line tool, location-client described elsewhere in this document which provides command line access to the management of the locationDB.
Adding the following configuration to wayback.xml will expose the fileproxy and locationdb services:
<bean id="filelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB" init-method="init"> <property name="bdbPath" value="/tmp/wayback/file-db/db/" /> <property name="bdbName" value="DB1" /> <property name="logPath" value="/tmp/wayback/file-db/db.log" /> </bean> <bean name="8080:fileproxy" class="org.archive.wayback.resourcestore.locationdb.FileProxyServlet"> <property name="locationDB" ref="filelocationdb" /> </bean> <bean name="8080:locationdb" class="org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet"> <property name="locationDB" ref="filelocationdb" /> </bean>