Requirements

Third Party Packages

Please see the System Requirements .

Wayback Software

Please see the Software Downloads page .

Installing

Installing Tomcat

Please refer to the README file included with your Tomcat distribution.

Installing Wayback

Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the webapp file, wayback.war.

Installation and configuration of this software involves the following steps:

  1. Placing .war file in appropriate location.
  2. Waiting for Tomcat to unpack the .war file.
  3. Customizing base wayback.xml file.
  4. Restarting tomcat.

Wayback Configuration Overview

The wayback software provides Search and Replay access to documents contained in a WaybackCollection. Search access allows users to query a collection to locate documents, and is presently limited to URL based queries. Replay access allows users to view archived content in collections within a web browser. A WaybackCollection is a combination of a ResourceStore, which contains the actual archived documents, and a ResourceIndex, which provides URL based search of the documents in the ResourceStore. The Wayback machine is configured using Spring IOC, to specify and configure concrete implementations of several basic modules. For information about using Spring, please see this page .

Defining WaybackCollections

The XML configuration template for a Wayback collection follows:

<bean id="localbdbcollection"
	class="org.archive.wayback.webapp.WaybackCollection">
	<property name="resourceStore" ... />
	<property name="resourceIndex" ... />
</bean>

        

The resourceStore property refers to a bean implementing org.archive.wayback.ResourceStore.

The resourceIndex property refers to a bean implementing org.archive.wayback.ResourceIndex.

org.archive.wayback.ResourceStore implementations

LocalARCResourceStore

This implementation works well for small collections, where all the ARC files can be placed in a single directory on the same computer running the wayback application. Using NFS or another network filesystem technology and symbolic links can allow this implementation to deal with ARC files in multiple directories, or across multiple storage nodes. This implementation also includes the capability to run a background thread to automatically notice new ARC files appearing, index those ARC files, and hand off the index data for merging with a BDBResourceIndex.

The XML configuration template for a LocalARCResourceStore follows:


<property name="resourceStore">
  <bean class="org.archive.wayback.resourcestore.LocalARCResourceStore"
    init-method="init">
    <property name="arcDir" value="/tmp/wayback/arcs/" />
    <property name="queuedDir" value="/tmp/wayback/arc-indexer/queued" />
    <property name="workDir" value="/tmp/wayback/arc-indexer/work" />
    <property name="runInterval" value="10000" />
    <property name="indexClient">
      <bean class="org.archive.wayback.resourceindex.indexer.IndexClient">
        <property name="tmpDir" value="/tmp/wayback/arc-indexer/tmp" />
        <property name="target" value="/tmp/wayback/index-data/incoming" />
      </bean>
    </property>
  </bean>
</property>

		      

Required configuration:

  • arcDir is the local directory where ARC files will be located.

Optional configuration (only needed for automatic indexing)

  • queuedDir names a local directory where the indexer will maintain state about ARC files that have already been indexed.
  • workDir names a local directory where the indexer will maintain state about ARC files that are about to be indexed.
  • runInterval indicates the number of milliseconds between polling arcDir for newly created ARC files. Default is 10000.
  • tmpDir names a local directory where index data will be stored temporarily before handing off to target.
  • target names:
    1. a local directory where an BDBIndexUpdater is configured to look for new index data to be merged with a BDBIndex.
    2. a remote http:// URL where index data should be PUT, for merging with a remote BDBIndex.

HttpARCResourceStore

This implementation allows the wayback application to access documents in remote ARC files via HTTP 1.1, and scales to millions of ARC files. The XML configuration template for an HttpARCResourceStore follows:

<property name="resourceStore">
  <bean class="org.archive.wayback.resourcestore.HttpARCResourceStore">
    <property name="urlPrefix" value="http://localhost:8080/arcproxy/" />
  </bean>
</property>

	        
Required configuration:
  • urlPrefix this is the http:// prefix where ARC files are exported with an ArcProxy installation. See elsewhere in this document for information about setting up an ArcProxy.

org.archive.wayback.ResourceIndex implementations

LocalResourceIndex

This ResourceIndex implementation allows wayback to search one of several index formats hosted on the same machine as the wayback application. See below for details on which specific index formats are available. The XML configuration template for a LocalResourceIndex follows:

<property name="resourceIndex">
  <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
    <property name="source" ... />
    <property name="maxRecords" value="10000" />
  </bean>
</property>

          
maxRecords specifies the maximum number of records to process, and thus that can be returned, during a single query.

source defines the format to be used for storing and searching records in the ResourceIndex. There are several possible implementations available:
  • BDBIndex This implementation is good for smaller scale installations, up to 10's of millions of documents, and allows for fast incremental updates to the index. It also allows for automated index updating.
    
    <bean class="org.archive.wayback.resourceindex.bdb.BDBIndex"
      init-method="init">
      <property name="bdbName" value="DB1" />
      <property name="bdbPath" value="/tmp/wayback/index/" />
      <property name="updater">
        <bean class="org.archive.wayback.resourceindex.bdb.BDBIndexUpdater">
          <property name="incoming" value="/tmp/wayback/index-data/incoming/" />
          <property name="failed" value="/tmp/wayback/index-data/failed/" />
          <property name="merged" value="/tmp/wayback/index-data/merged/" />
          <property name="runInterval" value="10000" />
        </bean>
      </property>
    </bean>
    
                  
    The updater property is optional. If used, a background index merging thread will be started. Every runInterval milliseconds, the thread will look for new files in the incoming directory. Any files present are assumed to be in CDX file format, and will be merged into the index and immediately available for access. Files that are not successfully merged with the index are left in place (or moved to the failed directory, if it is specified.) Files that are successfully merged are deleted (or moved to the merged directory, if it is specified.)

  • CDXIndex This implementation is good for larger scale installations, bounded mostly by the size of the index you can (first create, and later) store on a single machine. Using the command line tool arc-indexer, and the standard UNIX sort tool (see note below on LC_ALL), you create a sorted flat text file that is searched on each request. Building these sorted files, and updating the index are manual operations presently.
    
    <bean id="cdxsearchresultsource" class="org.archive.wayback.resourceindex.cdx.CDXIndex">
      <property name="path" value="/tmp/wayback/cdx-index/index.cdx" />
    </bean>
    
                  
  • CompositeSearchResultSource This implementation allows for searching multiple CDXIndex text files for each request. For optimal search efficiency, multiple index files should be merged (sort -mu) prior to production use, but this implementation allows a trade-off in simplified index management for a decrease in search performance.
    
    <bean id="compositecdxresultsource" class="org.archive.wayback.resourceindex.CompositeSearchResultSource">
      <property name="CDXSources">
        <list>
          <value>/tmp/wayback/cdx-index/index.cdx.1</value>
          <value>/tmp/wayback/cdx-index/index.cdx.2</value>
        </list>
      </property>
    </bean>
    
                  

RemoteResourceIndex configuration

This ResourceIndex option allows hosting of a ResourceIndex on a machine other than the machine hosting the Wayback webapp. The XML configuration template for a RemoteResourceIndex follows:

<bean id="remoteindex" class="org.archive.wayback.resourceindex.RemoteResourceIndex" init-method="init">
  <property name="searchUrlBase" value="http://wayback-index.archive.org:8080/wayback/xmlquery" />
</bean>

          
searchUrlBase indicates the URL prefix to which OpenSearchQuery parameters are appended to access a Wayback AccessPoint running a LocalResourceIndex on a remote host to the Wayback application.

NutchResourceIndex configuration

This ResourceIndex option allows the wayback to query a Nutch full-text search engine. This ResourceIndex option is highly experimental. For help setting up a NutchResourceIndex, please see this page.

The XML configuration template for a NutchResourceIndex follows:


        <property name="remotenutchindex">
          <bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
            <property name="searchUrlBase" value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" />
            <property name="maxRecords" value="100" />
          </bean>
        </property>

          
searchUrlBase indicates the URL prefix to which OpenSearchQuery parameters are appended to access a Nutch servers XML query interface.

Defining AccessPoints for WaybackCollections

Once you have defined one or more WaybackCollections, you need to specify how those collections are exposed to end users. Collections are exposed by defining an AccessPoint for that collection.

An AccessPoint is a combination of a WaybackCollection, a Query User Interface, a Replay User Interface, and a URL by which users interact with that AccessPoint. AccessPoints can also describe mechanisms for excluding documents, and for limiting what users are allowed to interact with the AccessPoint.

AccessPoints can be used to provide different levels and types of access to the same collection for different users. For example, you can provide both Proxy and Archival URL mode access to a single collection by defining 2 AccessPoints with different Replay User Interfaces but the same WaybackCollection. Using AccessPoints, you can also provide different levels of access to a collection. For example, users within a particular subnet may be able to access all documents within a collection via one AccessPoint, but users outside that subnet may only be restricted to viewing documents currently allowed by a web sites current robots.txt file.

The XML configuration template for an AccessPoint follows:


<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint">
  <property name="collection" ... />
  <property name="query" ... />
  <property name="replay" ... />
  <property name="parser" ... />
  <property name="uriConverter" ... />
  <property name="exclusionFactory" ... />
  <property name="authentication" ... />
  <property name="configs" ... />
</bean>

        

Required property configurations:

  • collection is a reference to the WaybackCollection for this AccessPoint.
  • query defines what .jsp files to use to render results for queries to this AccessPoint. See the section "Query .jsp configuration" for more information.
  • replay defines what Replay User Interface to use for this AccessPoint. See the section "Setting up the Replay User Interface within an AccessPoint" for more information.
  • parser defines how incoming requests are parsed and subsequently processed, and is usually dependent on the Replay User Interface being used with this AccessPoint.See the section "Setting up the Replay User Interface within an AccessPoint" for more information.
  • uriConverter defines how public URLs are constructed to provide Replay access to this AccessPoint. This is usually dependant on the Replay User Interface used with this AccessPoint. See the section "Setting up the Replay User Interface within an AccessPoint" for more information.
Optional property configurations:
  • exclusionFactory defines how documents are excluded within this AccessPoint. See the section "Excluding Documents within an AccessPoint" for more information.
  • authentication defines who is allowed to interact with this AccessPoint. See the section "Limiting Access to an AccessPoint" for more information.
  • configs Allows additional customizations within this AccessPoint. See the section "Adding Additional Configurations to an AccessPoint" for more information.

Query .jsp configuration

Wayback provides query results to a .jsp handler page, which is responsible for rendering final output to users. The actual .jsp file invoked for the various response types can be configured as described below. Included with the Wayback package are several reference .jsp implementations, including one which outputs XML. This XML interface is used by the Wayback software in distributed index configurations, but can also be used as an extension point for further user interface customizations.

The XML configuration template for the query Renderer follows below, including the default configuration for each value. The values indicate the path to the .jsp file that will be executed to generate the output for each class of query.

<bean class="org.archive.wayback.query.Renderer">
  <property name="errorJsp" value="/jsp/HTMLError.jsp" />
  <property name="xmlErrorJsp" value="/jsp/XMLError.jsp" />
  <property name="captureJsp" value="/jsp/HTMLResults.jsp" />
  <property name="urlJsp" value="/jsp/HTMLResults.jsp" />
  <property name="xmlJsp" value="/jsp/XMLResults.jsp" />
</bean>

        
The following list indicates when each .jsp is executed:
  • errorJsp will be executed when any type of expected error condition occurs during handling of a request.
  • xmlErrorJsp will be executed when any type of expected error condition occurs during handling of a request indicating that xml response data is desired.
  • captureJsp will be executed when results listing captures for a specific, single URL are requested in HTML format.
  • urlJsp will be executed when results listing captures for multiple URLs, each URL having one or more captures, are requested in HTML format.
  • xmlJsp will be executed when results are requested in XML format.

Setting up the Replay User Interface within an AccessPoint

There are presently 2 Replay modes supported by the Wayback software, Archival URL mode, and Proxy mode.

Archival URL

Archival URL Replay mode uses a modified URL to designate documents stored in ARC files. The general form of an Archival URL is:

http://HOSTNAME:PORT/CONTEXT/TIMESTAMP/URL


where
  • HOSTNAME is the host where the Wayback Machine is running.
  • PORT is the port where Tomcat is listening for incoming HTTP requests, which also refers to part of the name of the Access Point. See below for example CONTEXT mappings.
  • CONTEXT is the context where the Wayback Machine webapp has been deployed, plus the name of the Access Point. See below for example CONTEXT mappings.
  • TIMESTAMP is 0 to 14 digits of a date, possibly followed by an asterisk ('*'). The format of a TIMESTAMP is:
    YYYYMMDDHHmmss
    where
    • YYYY represents a 4-digit year
    • MM represents a 2-digit, 1-based month (Jan = 1 - Dec = 12)
    • DD represents a 2-digit day of the month (01-31)
    • HH represents a 2-digit hour (01-24)
    • mm represents a 2-digit minute (00-59)
    • ss represents a 2-digit second (00-59)
    The following are example dates expressed as 14-digit Timestamps:

    Jan 13, 1999 03:34:35 (am UTC) - 19990113033435


    Dec 31, 2004 23:01:00 (pm UTC) - 20041231230100


  • URL represents the actual URL that should be replayed.


Here is an example Archival URL, on an assumed host wayback.somehost.org, with a wayback webapp deployed as ROOT, via the Access Point named 80:archive for the page http://www.yahoo.com/ on Dec 31, 1999 at 12:00:00 UTC.

http://wayback.somehost.org/archive/19991231120000/http://www.yahoo.com/




Archival URL mode allows replay of all versions captured of a particular URL, by modifying the Timestamp. When an Archival URL Replay request is received for a URL, the Wayback Machine will replay the closest version in time to the Timestamp requested of the particular URL.


HTML documents returned in Archival URL Replay mode are modified from the original version to provide a replay experience more consistent to viewing the original content. This is accomplished by the insertion of Javascript, which executes in the client browser after the page has loaded. This Javascript modifies most URLs within the HTML page, both Anchors (links) as well as embedded content (images, applets, etc) so that they become appropriate Archival URL requests back to the Wayback application.


This Javascript is imperfect: sometimes requests "leak" to the live web temporarily, before the Javascript has executed. Also, not all URLs are rewritten correctly, especially URLs that are created by Javascript that was in the original page, and specialized file types containing links like Flash and PDF documents.


The name of the Access Point bean in the Spring configuration file determines the CONTEXT and PORT used in Archival URLs within that Access Point. The Servlet context name where the Wayback application is deployed also factors into the CONTEXT used within Archival URLs for each Access Point.


The following examples show the Archival URL prefix for the following two Access Points depending on the Wayback webapp being deployed in two different contexts, "ROOT" and "wayback".


If the following Access Point definitions are present in the wayback.xml:

<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint">
  <property name="collection" ref="localcollection" />
  ...
</bean>

<bean name="8080:wayback2" class="org.archive.wayback.webapp.AccessPoint">
  <property name="collection" ref="localcollection" />
  ...
</bean>

            
then the following table shows the Archival URL prefixes to access each collection on the host "wayback.somehost.org" assuming a Tomcat Connector listening on port 8080:
webapp deployed at Access Point bean name Archival URL prefix
ROOT 8080:wayback http://wayback.somehost.org:8080/wayback/
ROOT 8080:wayback2 http://wayback.somehost.org:8080/wayback2/
wb-webapp 8080:wayback http://wayback.somehost.org:8080/wb-webapp/wayback/
wb-webapp 8080:wayback2 http://wayback.somehost.org:8080/wb-webapp/wayback2/
The properties replay, parser, and uriConverter for Archival URL Access Points must be set to the following implementations:

    <property name="replay">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlReplayDispatcher">
        <property name="jsInserts">
          <list>
            <value>http://wayback.somehost.org:8080/wb-webapp/wm.js</value>
          </list>
        </property>
        <property name="jspInserts">
          <list>
            <value>/replay/Timeline.jsp</value>
          </list>
        </property>
      </bean>
    </property>

    <property name="parser">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
        init-method="init">
        <property name="maxRecords" value="1000" />
        <property name="earliestTimestamp" value="1996" />
      </bean>
    </property>

    <property name="uriConverter">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
        <property name="replayURIPrefix" value="http://wayback.somehost.org:8080/wb-webapp/wayback/" />
      </bean>
    </property>

          
configuration optional/required description
jsInserts required This list must include a reference to the wm.js javascript file, but references to additional javascript files here will result in a reference to those javascript URLs within all replayed HTML pages.
jspInserts optional If any values are referenced here, then those .jsp files will be invoked for every replayed document, and the resulting output will be included in replayed HTML pages. The example included here will result in a Timeline banner in-page presence being included with each replayed HTML page, allowing navigation between different versions of the current URL.
maxRecords optional Sets the default maximum requested records for Archival URL query requests.
earliestTimestamp optional Set the default start date for requested records for Archival URL query requests.
replayURIPrefix required Points to the Archival URL prefix of the Access Point as illustrated in the preceding table.

Proxy

Wayback can be configured to act as an HTTP proxy server. To utilize this mode, the wayback webapp must be deployed as the ROOT context, and client browser must be configured to proxy all HTTP requests through the Wayback Machine application. Instead of retrieving documents from the live web, the Wayback Machine will retrieve documents from the local repository of ARC files.



Proxy Replay mode does not suffer from the shortcomings of the inserted Javascript that the Archival URL mode uses, but it has one major drawback: there is no way to specify which version of a captured document should be replayed. Only the URL to be replayed is sent from the client browser to the Wayback Machine - no date information is sent with the request.



In Proxy Replay mode, the Wayback Machine will return the most recent version captured of any requested page. This behavior can be changed by using the experimental Firefox-specific plugin developed by Oskar Grenholm. You can find out more about this plugin and download it here .



Thanks Oskar!



The following is an example Proxy Replay Access Point definition. It assumes to be running on a host wayback.somehost.org, that a Tomcat Connector has been added for port 8090, that the Wayback webapp has been deployed at the ROOT context, and that another Archival URL Access Point named "8080:wayback" has been configured.

<bean name="8090" parent="8080:wayback">
  <property name="useServerName" value="true" />
  <property name="replay">
    <bean class="org.archive.wayback.proxy.ProxyReplayDispatcher" />
  </property>
  <property name="uriconverter">
    <bean class="org.archive.wayback.proxy.RedirectResultURIConverter">
      <property name="redirectURI" value="http://wayback.somehost.org:8090/jsp/Redirect.jsp" />
    </bean>
  </property>
  <property name="parser">
    <bean class="org.archive.wayback.proxy.ProxyRequestParser" init-method="init">
      <property name="localhostNames">
        <list>
          <value>wayback.somehost.org</value>
        </list>
      </property>
      <property name="maxRecords" value="1000" />
    </bean>
  </property>
</bean>

          




redirectURI is required, and must be set to the name of the host where the Wayback application is running. If this is not the primary name of the machine running the Wayback application, then you may need to also specify the hostname used for the Wayback application in the localhostNames configuration list.

Excluding Documents within an AccessPoint

Excluding Documents with live Robots.txt

Documents may be excluded from access within an Access Point by retroactively enforcing the policies in a web sites live robots.txt documents by adding the following configuration in the Access Point.

<property name="exclusionFactory" ref="excluder-factory-robot" />

        


Please see the default wayback.xml packaged with this software for an example bean definition for the referenced excluder-factory-robot bean.

Excluding Documents with an Administrative List

Documents may be excluded from access within an Access Point by using a plain text file listing URL prefixes which should be blocked. If this option is used with a non-zero value for checkInterval, the Wayback software will monitor the external file, and will automatically reload the file when it changes.

The following Spring configuration defines a static exclusion file that causes URLs listed in the file /tmp/exclude.txt to be blocked, with the file being checked for updates every 10 minutes.

<bean id="static-exclusion" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory" init-method="init">
  <property name="file" value="/tmp/exclude.txt" />
  <property name="checkInterval" value="600" />
</bean>

        


Adding the following configuration to an Access Point will cause the excluded URLs named in /tmp/exclude.txt to be inaccessible:

<property name="exclusionFactory" ref="static-exclusion">

        

Restricting who can interact with an AccessPoint

Limiting Access based on IP Addresses

Access to a particular Access Point can be limited to a specific IP address range by adding the following configuration to an Access Point definition.

<property name="authentication">
  <bean class="org.archive.wayback.authenticationcontrol.IPMatchesBooleanOperator">
    <property name="allowedRanges">
      <list>
        <value>192.168.1.16/24</value>
      </list>
    </property>
  </bean>
</property>

        
which would have the affect of blocking users outside the 192.168.1.16/24 network.

Limiting Access based on HTTP BASIC Authentication

Access can be restricted to a particular Access Point using Tomcat's built-in configuration options. By adding the following configuration to the web.xml, which assumes an Access Point named "8080:secure" (or really for any port):

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Secured-Wayback</web-resource-name>
    <url-pattern>/secure/*</url-pattern>
  </web-resource-collection>
  <auth-constraint>
    <role-name>wayback</role-name>
  </auth-constraint>
</security-constraint>

<login-config>
  <auth-method>BASIC</auth-method>
  <realm-name>Secured-Wayback</realm-name>
</login-config>

        




And then adding user configuration to the tomcat-users.xml file:

<role rolename="wayback"/>
<user password="changeM3" roles="wayback" username="brad"/>

        

Adding Additional Configurations to an AccessPoint

The following configuration can be added to an Access Point:

<property name="configs">
        <props>
                <prop key="inst">Acrobatic Association</prop>
                <prop key="logo">http://images.somehost.com/logos/acro.jpg</prop>
        </props>
</property>

        
These configurations are then accessible in the common .jsp rendering pages, allowing Collection or Access Point specific text to be relayed to shared .jsp files, which can then retrieve the Access Point specific configuration with the following code:

UIResults results = UIResults.getFromRequest(request);
String instString = results.getContextConfig("inst");
String logoString = results.getContextConfig("logo");

        

External Tools

The wayback distribution includes several command-line tools that assist in creating and testing index files, and populating the ArcProxy location db. All the command line tools can be found which can be found underneath the directory where you unpacked your distribution at:bin/* (example: bin/location-client). You will need to change permissions on the tools to allow them to be executed: chmod a+x bin/*

bdb-client

This tool allows several maintenance operations to be performed on BDB files. There are two primary modes, read and write.
  1. bin/bdb-client -r BDB_DIR BDB_NAME [PREFIX] Output records from a BDB database on STDOUT. where:
    • BDB_DIR Open BDB in this directory.
    • BDB_NAME Open BDB with this name.
    • PREFIX (optional) if present, only output records whose KEY begins with PREFIX. If this option is omitted, all records will be output from the BDB. Records are always output in sorted order.
  2. bin/bdb-client -w BDB_DIR BDB_NAME Read CDX format lines from STDIN, and insert into a BDB, creating the BDB if needed. where:
    • BDB_DIR Open BDB in this directory.
    • BDB_NAME Open BDB with this name.

bin-search

This tool allows binary searching against large sorted text files. It will output lines prefixed with a particular key on STDOUT. bin/bin-search KEY FILE [FILE2 ...]
  • KEY String prefix for lines that should be output.
  • FILE [FILE2 ...] Sequentially search through each file specified, outputting the lines prefixed with KEY for each file. Note that the complete output of bin-search will be sorted when used with a single file, but when multiple files are searched, the results may not be sorted completely.

arc-indexer

This tool creates a CDX format index for the ARC file at ARC_PATH, either on STDOUT, or at the path specified by CDX_PATH. The resulting file can be sorted and merged with other CDX format index files to generate CDX format ResourceIndex. bin/arc-indexer ARC_PATH [CDX_PATH]

location-client

If you have already populated your ResourceIndex, and just need to inform the ArcProxy LocationDB of where ARC files are located. This script will allow you to synchronize the ArcProxy LocationDB with the directories holding your ARC files. Execute the script once for each directory containing ARC files (on each machine containing ARC files.) Again, this script will not index the content of the ARC files, but will only populate the ArcProxy LocationDB with the locations of ARC files. bin/location-client sync LOCATION_URL ARC_DIR ARC_URL_PREFIX where:
  • LOCATION_URL is the absolute URL where the ArcProxy can be accessed. ex. http://wayback-webapp.your-archive.org:8080/locationdb/locationDB
  • ARC_DIR is the absolute path to the directory on the local machine which holds ARC files ex. /2/arc-collection-1
  • ARC_URL_PREFIX is the absolute URL where the directory ARC_DIR can be accessed. ex. http://arc-storage-node-1.your-archive.org/2/arc-collection-1/

url-client

URLs stored in BDB and CDX format ResourceIndexes are canonicalized to a more genertic form. Before performing a lookup operation on the ResourceIndex, the same canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column altered. The column that is changed is assumed to be a URL, and the version output is the canonicalized form of the input URL. This tool is mostly useful for debugging the canonicalization function, but can also be used, if the canonicalization function is altered, to update an existing CDX index, without recreating CDX files from original ARCs. bin/url-client [-cdx] [-f FIELD]
  • -cdx Pass thru lines prefixed with " CDX " unchanged.
  • -f FIELD alter column FIELD of each line, instead of the default column 1.

ArcProxy and LocationDB application

The Wayback software includes an additional application, the ArcProxy, which can simplify some distributed ResourceStore implementations. The ArcProxy application exposes two external services, one used to configure the underlying database mapping ARC filenames to the actual, fully qualified HTTP 1.1 URL, and a second service which reverse proxies incoming HTTP 1.1 range requests to appropriate back-end storage nodes. The arcproxy reverse proxy service allows one or more HttpARCResourceStore instances to configure a single URL prefix where all ARC files are assumed to be located. This reverse proxy then uses a BDB JE to find the actual current location of the ARC file, and forward the request to the actual host holding the ARC file. The locationdb service allows population and management of the BDB JE database(the locationDB) used by the arcproxy service. There is also a command line tool, location-client described elsewhere in this document which provides command line access to the management of the locationDB. Adding the following configuration to wayback.xml will expose the arcproxy and locationdb services:

<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
  init-method="init">
  <property name="bdbPath" value="/tmp/wayback/arc-db" />
  <property name="bdbName" value="DB1" />
  <property name="logPath" value="/tmp/wayback/arc-db.log" />
</bean>

<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
  <property name="locationDB" ref="filelocationdb" />
</bean>

<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
  <property name="locationDB" ref="filelocationdb" />
</bean>