Installing
Installing Tomcat
Please refer to the README file included with your Tomcat distribution.
Installing Wayback
Once you have downloaded the .tar.gz file from
sourceforge, you will need to unpack the file to access the
webapp file, wayback.war.
Installation and configuration of this software involves the
following steps:
-
Placing .war file in appropriate location.
-
Waiting for Tomcat to unpack the .war file.
-
Customizing base wayback.xml file.
-
Restarting tomcat.
Wayback Configuration Overview
The wayback software provides Search and Replay access to documents
contained in a WaybackCollection. Search access allows users to
query a collection to locate documents, and is presently limited
to URL based queries. Replay access allows users to view archived
content in collections within a web browser. A WaybackCollection is
a combination of a ResourceStore, which contains the actual archived
documents, and a ResourceIndex, which provides URL based search of the
documents in the ResourceStore.
The Wayback machine is configured using Spring IOC, to specify and
configure concrete implementations of several basic modules. For
information about using Spring, please see
this page
.
Defining WaybackCollections
The XML configuration template for a Wayback collection follows:
<bean id="localbdbcollection"
class="org.archive.wayback.webapp.WaybackCollection">
<property name="resourceStore" ... />
<property name="resourceIndex" ... />
</bean>
The resourceStore property refers to a bean implementing org.archive.wayback.ResourceStore.
The resourceIndex property refers to a bean implementing org.archive.wayback.ResourceIndex.
org.archive.wayback.ResourceStore implementations
LocalARCResourceStore
This implementation works well for small
collections, where all the ARC files can be placed in a single
directory on the same computer running the wayback application.
Using NFS or another network filesystem technology and symbolic
links can allow this implementation to deal with ARC files in
multiple directories, or across multiple storage nodes. This
implementation also includes the capability to run a background
thread to automatically notice new ARC files appearing, index
those ARC files, and hand off the index data for merging with
a BDBResourceIndex.
The XML configuration template for a LocalARCResourceStore follows:
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.LocalARCResourceStore"
init-method="init">
<property name="arcDir" value="/tmp/wayback/arcs/" />
<property name="queuedDir" value="/tmp/wayback/arc-indexer/queued" />
<property name="workDir" value="/tmp/wayback/arc-indexer/work" />
<property name="runInterval" value="10000" />
<property name="indexClient">
<bean class="org.archive.wayback.resourceindex.indexer.IndexClient">
<property name="tmpDir" value="/tmp/wayback/arc-indexer/tmp" />
<property name="target" value="/tmp/wayback/index-data/incoming" />
</bean>
</property>
</bean>
</property>
Required configuration:
-
arcDir
is the local directory where ARC files will be
located.
Optional configuration (only needed for automatic indexing)
-
queuedDir
names a local directory where the indexer will maintain state
about ARC files that have already been indexed.
-
workDir
names a local directory where the indexer will maintain state
about ARC files that are about to be indexed.
-
runInterval
indicates the number of milliseconds between polling arcDir
for newly created ARC files. Default is 10000.
-
tmpDir
names a local directory where index data will be stored
temporarily before handing off to target.
-
target
names:
-
a local directory where an BDBIndexUpdater is configured to
look for new index data to be merged with a BDBIndex.
-
a remote http:// URL where index data should be PUT, for
merging with a remote BDBIndex.
HttpARCResourceStore
This implementation allows the wayback
application to access documents in remote ARC files via HTTP 1.1,
and scales to millions of ARC files.
The XML configuration template for an HttpARCResourceStore follows:
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.HttpARCResourceStore">
<property name="urlPrefix" value="http://localhost:8080/arcproxy/" />
</bean>
</property>
Required configuration:
-
urlPrefix
this is the http:// prefix where ARC files are exported with an
ArcProxy installation. See elsewhere in this document for
information about setting up an ArcProxy.
org.archive.wayback.ResourceIndex implementations
LocalResourceIndex
This ResourceIndex implementation allows wayback to search one of
several index formats hosted on the same machine as the wayback
application. See below for details on which specific index formats
are available.
The XML configuration template for a LocalResourceIndex follows:
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
<property name="source" ... />
<property name="maxRecords" value="10000" />
</bean>
</property>
maxRecords
specifies the maximum number of records to process, and thus that can
be returned, during a single query.
source
defines the format to be used for storing and searching records in
the ResourceIndex. There are several possible implementations
available:
-
BDBIndex
This implementation is good for smaller scale installations, up
to 10's of millions of documents, and allows for fast incremental
updates to the index. It also allows for automated index updating.
<bean class="org.archive.wayback.resourceindex.bdb.BDBIndex"
init-method="init">
<property name="bdbName" value="DB1" />
<property name="bdbPath" value="/tmp/wayback/index/" />
<property name="updater">
<bean class="org.archive.wayback.resourceindex.bdb.BDBIndexUpdater">
<property name="incoming" value="/tmp/wayback/index-data/incoming/" />
<property name="failed" value="/tmp/wayback/index-data/failed/" />
<property name="merged" value="/tmp/wayback/index-data/merged/" />
<property name="runInterval" value="10000" />
</bean>
</property>
</bean>
The updater property is optional. If used, a background
index merging thread will be started. Every runInterval
milliseconds, the thread will look for new files in the
incoming directory. Any files present are assumed to be
in CDX file format, and will be merged into the index and
immediately available for access. Files that are not successfully
merged with the index are left in place (or moved to the
failed directory, if it is specified.) Files that are
successfully merged are deleted (or moved to the merged
directory, if it is specified.)
-
CDXIndex
This implementation is good for larger scale installations,
bounded mostly by the size of the index you can (first create,
and later) store on a single machine. Using the command line tool
arc-indexer, and the standard UNIX sort tool
(see note below on LC_ALL), you create a sorted flat text file
that is searched on each request. Building these sorted files,
and updating the index are manual operations presently.
<bean id="cdxsearchresultsource" class="org.archive.wayback.resourceindex.cdx.CDXIndex">
<property name="path" value="/tmp/wayback/cdx-index/index.cdx" />
</bean>
-
CompositeSearchResultSource
This implementation allows for searching multiple CDXIndex text
files for each request. For optimal search efficiency, multiple
index files should be merged (sort -mu) prior to production use,
but this implementation allows a trade-off in simplified index
management for a decrease in search performance.
<bean id="compositecdxresultsource" class="org.archive.wayback.resourceindex.CompositeSearchResultSource">
<property name="CDXSources">
<list>
<value>/tmp/wayback/cdx-index/index.cdx.1</value>
<value>/tmp/wayback/cdx-index/index.cdx.2</value>
</list>
</property>
</bean>
RemoteResourceIndex configuration
This ResourceIndex option allows hosting of a ResourceIndex on a
machine other than the machine hosting the Wayback webapp.
The XML configuration template for a RemoteResourceIndex follows:
<bean id="remoteindex" class="org.archive.wayback.resourceindex.RemoteResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://wayback-index.archive.org:8080/wayback/xmlquery" />
</bean>
searchUrlBase indicates the URL prefix to which OpenSearchQuery
parameters are appended to access a Wayback AccessPoint running a
LocalResourceIndex on a remote host to the Wayback application.
NutchResourceIndex configuration
This ResourceIndex option allows the wayback to query a Nutch
full-text search engine. This ResourceIndex option is highly
experimental. For help setting up a NutchResourceIndex, please see
this page.
The XML configuration template for a NutchResourceIndex follows:
<property name="remotenutchindex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" />
<property name="maxRecords" value="100" />
</bean>
</property>
searchUrlBase indicates the URL prefix to which OpenSearchQuery
parameters are appended to access a Nutch servers XML query interface.
Defining AccessPoints for WaybackCollections
Once you have defined one or more WaybackCollections, you need to
specify how those collections are exposed to end users. Collections are
exposed by defining an AccessPoint for that collection.
An AccessPoint is a combination of a WaybackCollection, a Query User
Interface, a Replay User Interface, and a URL by which users interact
with that AccessPoint. AccessPoints can also describe mechanisms for
excluding documents, and for limiting what users are allowed to
interact with the AccessPoint.
AccessPoints can be used to provide different levels and types of
access to the same collection for different users. For example, you
can provide both Proxy and Archival URL mode access to a single
collection by defining 2 AccessPoints with different Replay User
Interfaces but the same WaybackCollection. Using AccessPoints, you can
also provide different levels of access to a collection. For example,
users within a particular subnet may be able to access all documents
within a collection via one AccessPoint, but users outside that subnet
may only be restricted to viewing documents currently allowed by a
web sites current robots.txt file.
The XML configuration template for an AccessPoint follows:
<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint">
<property name="collection" ... />
<property name="query" ... />
<property name="replay" ... />
<property name="parser" ... />
<property name="uriConverter" ... />
<property name="exclusionFactory" ... />
<property name="authentication" ... />
<property name="configs" ... />
</bean>
Required property configurations:
-
collection
is a reference to the WaybackCollection for this AccessPoint.
-
query
defines what .jsp files to use to render results for queries to
this AccessPoint. See the section "Query .jsp configuration" for
more information.
-
replay
defines what Replay User Interface to use for this AccessPoint. See
the section "Setting up the Replay User Interface within an
AccessPoint" for more information.
-
parser
defines how incoming requests are parsed and subsequently processed,
and is usually dependent on the Replay User Interface being used
with this AccessPoint.See the section "Setting up the Replay User
Interface within an AccessPoint" for more information.
-
uriConverter
defines how public URLs are constructed to provide Replay access
to this AccessPoint. This is usually dependant on the Replay User
Interface used with this AccessPoint. See the section "Setting up
the Replay User Interface within an AccessPoint" for more
information.
Optional property configurations:
-
exclusionFactory
defines how documents are excluded within this AccessPoint. See the
section "Excluding Documents within an AccessPoint" for more
information.
-
authentication
defines who is allowed to interact with this AccessPoint. See the
section "Limiting Access to an AccessPoint" for more information.
-
configs
Allows additional customizations within this AccessPoint. See the
section "Adding Additional Configurations to an AccessPoint" for
more information.
Query .jsp configuration
Wayback provides query results to a .jsp handler page, which is
responsible for rendering final output to users. The actual .jsp file
invoked for the various response types can be configured as described
below. Included with the Wayback package are several reference .jsp
implementations, including one which outputs XML. This XML interface is
used by the Wayback software in distributed index configurations, but
can also be used as an extension point for further user interface
customizations.
The XML configuration template for the query Renderer follows below,
including the default configuration for each value. The values indicate
the path to the .jsp file that will be executed to generate the output
for each class of query.
<bean class="org.archive.wayback.query.Renderer">
<property name="errorJsp" value="/jsp/HTMLError.jsp" />
<property name="xmlErrorJsp" value="/jsp/XMLError.jsp" />
<property name="captureJsp" value="/jsp/HTMLResults.jsp" />
<property name="urlJsp" value="/jsp/HTMLResults.jsp" />
<property name="xmlJsp" value="/jsp/XMLResults.jsp" />
</bean>
The following list indicates when each .jsp is executed:
-
errorJsp
will be executed when any type of expected error condition occurs
during handling of a request.
-
xmlErrorJsp
will be executed when any type of expected error condition occurs
during handling of a request indicating that xml response data is
desired.
-
captureJsp
will be executed when results listing captures for a specific,
single URL are requested in HTML format.
-
urlJsp
will be executed when results listing captures for multiple URLs,
each URL having one or more captures, are requested in HTML format.
-
xmlJsp
will be executed when results are requested in XML format.
Setting up the Replay User Interface within an AccessPoint
There are presently 2 Replay modes supported by the Wayback software,
Archival URL mode, and Proxy mode.
Archival URL
Archival URL Replay mode uses a modified URL to designate
documents stored in ARC files. The general form of an
Archival URL is:
http://HOSTNAME:PORT/CONTEXT/TIMESTAMP/URL
where
-
HOSTNAME is the host where the Wayback Machine is
running.
-
PORT is the port where Tomcat is listening for
incoming HTTP requests, which also refers to part of the name of
the Access Point. See below for example CONTEXT mappings.
-
CONTEXT is the context where the Wayback Machine
webapp has been deployed, plus the name of the Access Point. See
below for example CONTEXT mappings.
-
TIMESTAMP is 0 to 14 digits of a date, possibly
followed by an asterisk ('*'). The format of a
TIMESTAMP is:
YYYYMMDDHHmmss
where
-
YYYY represents a 4-digit year
-
MM represents a 2-digit, 1-based month
(Jan = 1 - Dec = 12)
-
DD represents a 2-digit day of the month
(01-31)
-
HH represents a 2-digit hour (01-24)
-
mm represents a 2-digit minute (00-59)
-
ss represents a 2-digit second (00-59)
The following are example dates expressed as
14-digit Timestamps:
Jan 13, 1999 03:34:35 (am UTC) - 19990113033435
Dec 31, 2004 23:01:00 (pm UTC) - 20041231230100
-
URL represents the actual URL that should be
replayed.
Here is an example Archival URL, on an assumed host
wayback.somehost.org, with a wayback webapp deployed as
ROOT, via the Access Point named
80:archive for the
page
http://www.yahoo.com/ on Dec 31, 1999 at 12:00:00 UTC.
http://wayback.somehost.org/archive/19991231120000/http://www.yahoo.com/
Archival URL mode allows replay of all versions captured
of a particular URL, by modifying the Timestamp. When an
Archival URL Replay request is received for a URL, the
Wayback Machine will replay the closest version in time
to the Timestamp requested of the particular URL.
HTML documents returned in Archival URL Replay mode are
modified from the original version to provide a replay
experience more consistent to viewing the original
content. This is accomplished by the insertion of
Javascript, which executes in the client browser after
the page has loaded. This Javascript modifies most URLs
within the HTML page, both Anchors (links) as well as
embedded content (images, applets, etc) so that they
become appropriate Archival URL requests back to the Wayback
application.
This Javascript is imperfect: sometimes requests
"leak" to the live web temporarily, before the
Javascript has executed. Also, not all URLs are
rewritten correctly, especially URLs that are created
by Javascript that was in the original page, and
specialized file types containing links like Flash and
PDF documents.
The name of the Access Point bean in the Spring configuration
file determines the CONTEXT and PORT used in Archival URLs within
that Access Point. The Servlet context name where the Wayback
application is deployed also factors into the CONTEXT used within
Archival URLs for each Access Point.
The following examples show the Archival URL prefix for the
following two Access Points depending on the Wayback webapp being
deployed in two different contexts, "ROOT" and "wayback".
If the following Access Point definitions are present in the
wayback.xml:
<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint">
<property name="collection" ref="localcollection" />
...
</bean>
<bean name="8080:wayback2" class="org.archive.wayback.webapp.AccessPoint">
<property name="collection" ref="localcollection" />
...
</bean>
then the following table shows the Archival URL prefixes to access
each collection on the host "wayback.somehost.org" assuming a
Tomcat Connector listening on port 8080:
|
webapp deployed at
|
Access Point bean name
|
Archival URL prefix
|
|
ROOT
|
8080:wayback
|
http://wayback.somehost.org:8080/wayback/
|
|
ROOT
|
8080:wayback2
|
http://wayback.somehost.org:8080/wayback2/
|
|
wb-webapp
|
8080:wayback
|
http://wayback.somehost.org:8080/wb-webapp/wayback/
|
|
wb-webapp
|
8080:wayback2
|
http://wayback.somehost.org:8080/wb-webapp/wayback2/
|
The properties
replay,
parser, and
uriConverter
for Archival URL Access Points must be set to the following
implementations:
<property name="replay">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlReplayDispatcher">
<property name="jsInserts">
<list>
<value>http://wayback.somehost.org:8080/wb-webapp/wm.js</value>
</list>
</property>
<property name="jspInserts">
<list>
<value>/replay/Timeline.jsp</value>
</list>
</property>
</bean>
</property>
<property name="parser">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
init-method="init">
<property name="maxRecords" value="1000" />
<property name="earliestTimestamp" value="1996" />
</bean>
</property>
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="http://wayback.somehost.org:8080/wb-webapp/wayback/" />
</bean>
</property>
|
configuration
|
optional/required
|
description
|
|
jsInserts
|
required
|
This list must include a reference to the wm.js javascript file,
but references to additional javascript files here will result in
a reference to those javascript URLs within all replayed HTML
pages.
|
|
jspInserts
|
optional
|
If any values are referenced here, then those .jsp files will be
invoked for every replayed document, and the resulting output
will be included in replayed HTML pages. The example included
here will result in a Timeline banner in-page presence being
included with each replayed HTML page, allowing navigation
between different versions of the current URL.
|
|
maxRecords
|
optional
|
Sets the default maximum requested records for Archival URL query
requests.
|
|
earliestTimestamp
|
optional
|
Set the default start date for requested records for Archival
URL query requests.
|
|
replayURIPrefix
|
required
|
Points to the Archival URL prefix of the Access Point as
illustrated in the preceding table.
|
Proxy
Wayback can be configured to act as an HTTP proxy server. To utilize
this mode, the wayback webapp must be deployed as the ROOT context,
and client browser must be configured to proxy all HTTP requests
through the Wayback Machine application. Instead of retrieving
documents from the live web, the Wayback Machine will retrieve
documents from the local repository of ARC files.
Proxy Replay mode does not suffer from the shortcomings of
the inserted Javascript that the Archival URL mode uses,
but it has one major drawback: there is no way to
specify which version of a captured document should
be replayed. Only the URL to be replayed is sent from the
client browser to the Wayback Machine - no date information
is sent with the request.
In Proxy Replay mode, the Wayback Machine will return the
most recent version captured of any requested page. This
behavior can be changed by using the experimental Firefox-specific
plugin developed by Oskar Grenholm. You can find out more about
this plugin and download it
here
.
Thanks Oskar!
The following is an example Proxy Replay Access Point definition. It
assumes to be running on a host
wayback.somehost.org, that a
Tomcat Connector has been added for port
8090,
that the Wayback webapp has been deployed at the ROOT context, and
that another Archival URL Access Point named "8080:wayback" has been
configured.
<bean name="8090" parent="8080:wayback">
<property name="useServerName" value="true" />
<property name="replay">
<bean class="org.archive.wayback.proxy.ProxyReplayDispatcher" />
</property>
<property name="uriconverter">
<bean class="org.archive.wayback.proxy.RedirectResultURIConverter">
<property name="redirectURI" value="http://wayback.somehost.org:8090/jsp/Redirect.jsp" />
</bean>
</property>
<property name="parser">
<bean class="org.archive.wayback.proxy.ProxyRequestParser" init-method="init">
<property name="localhostNames">
<list>
<value>wayback.somehost.org</value>
</list>
</property>
<property name="maxRecords" value="1000" />
</bean>
</property>
</bean>
redirectURI is required, and must be set to the name of the
host where the Wayback application is running. If this is not the
primary name of the machine running the Wayback application, then you
may need to also specify the hostname used for the Wayback application
in the localhostNames configuration list.
Excluding Documents within an AccessPoint
Excluding Documents with live Robots.txt
Documents may be excluded from access within an Access Point by
retroactively enforcing the policies in a web sites live robots.txt
documents by adding the following configuration in the Access Point.
<property name="exclusionFactory" ref="excluder-factory-robot" />
Please see the default wayback.xml packaged with this software for an
example bean definition for the referenced
excluder-factory-robot
bean.
Excluding Documents with an Administrative List
Documents may be excluded from access within an Access Point by
using a plain text file listing URL prefixes which should be blocked.
If this option is used with a non-zero value for
checkInterval,
the Wayback software will monitor the external file, and will
automatically reload the file when it changes.
The following Spring configuration defines a static exclusion file that
causes URLs listed in the file
/tmp/exclude.txt to be blocked,
with the file being checked for updates every 10 minutes.
<bean id="static-exclusion" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory" init-method="init">
<property name="file" value="/tmp/exclude.txt" />
<property name="checkInterval" value="600" />
</bean>
Adding the following configuration to an Access Point will cause the
excluded URLs named in
/tmp/exclude.txt to be inaccessible:
<property name="exclusionFactory" ref="static-exclusion">
Restricting who can interact with an AccessPoint
Limiting Access based on IP Addresses
Access to a particular Access Point can be limited to a specific IP
address range by adding the following configuration to an Access Point
definition.
<property name="authentication">
<bean class="org.archive.wayback.authenticationcontrol.IPMatchesBooleanOperator">
<property name="allowedRanges">
<list>
<value>192.168.1.16/24</value>
</list>
</property>
</bean>
</property>
which would have the affect of blocking users outside the
192.168.1.16/24 network.
Limiting Access based on HTTP BASIC Authentication
Access can be restricted to a particular Access Point using Tomcat's
built-in configuration options. By adding the following configuration to
the web.xml, which assumes an Access Point named "8080:secure" (or
really for any port):
<security-constraint>
<web-resource-collection>
<web-resource-name>Secured-Wayback</web-resource-name>
<url-pattern>/secure/*</url-pattern>
</web-resource-collection>
<auth-constraint>
<role-name>wayback</role-name>
</auth-constraint>
</security-constraint>
<login-config>
<auth-method>BASIC</auth-method>
<realm-name>Secured-Wayback</realm-name>
</login-config>
And then adding user configuration to the tomcat-users.xml file:
<role rolename="wayback"/>
<user password="changeM3" roles="wayback" username="brad"/>
Adding Additional Configurations to an AccessPoint
The following configuration can be added to an Access Point:
<property name="configs">
<props>
<prop key="inst">Acrobatic Association</prop>
<prop key="logo">http://images.somehost.com/logos/acro.jpg</prop>
</props>
</property>
These configurations are then accessible in the common .jsp rendering
pages, allowing Collection or Access Point specific text to be relayed
to shared .jsp files, which can then retrieve the Access Point specific
configuration with the following code:
UIResults results = UIResults.getFromRequest(request);
String instString = results.getContextConfig("inst");
String logoString = results.getContextConfig("logo");
External Tools
The wayback distribution includes several command-line tools
that assist in creating and testing index files, and populating
the ArcProxy location db.
All the command line tools can be found which can be found
underneath the directory where you unpacked your distribution
at:
bin/* (example:
bin/location-client). You will
need to change permissions on the tools to allow them to be
executed:
chmod a+x bin/*
bdb-client
This tool allows several maintenance operations to be
performed on BDB files. There are two primary modes, read
and write.
-
bin/bdb-client -r BDB_DIR BDB_NAME [PREFIX]
Output records from a BDB database on STDOUT.
where:
-
BDB_DIR Open BDB in this
directory.
-
BDB_NAME Open BDB with this name.
-
PREFIX (optional) if present,
only output records whose KEY begins
with PREFIX. If this option is omitted,
all records will be output from the
BDB. Records are always output in sorted
order.
-
bin/bdb-client -w BDB_DIR BDB_NAME
Read CDX format lines from STDIN, and insert
into a BDB, creating the BDB if needed.
where:
-
BDB_DIR Open BDB in this
directory.
-
BDB_NAME Open BDB with this name.
bin-search
This tool allows binary searching against large sorted text
files. It will output lines prefixed with a particular
key on STDOUT.
bin/bin-search KEY FILE [FILE2 ...]
-
KEY String prefix for lines that should be
output.
-
FILE [FILE2 ...] Sequentially search through
each file specified, outputting the lines prefixed
with KEY for each file. Note that the complete
output of bin-search will be sorted when used with
a single file, but when multiple files are searched,
the results may not be sorted completely.
arc-indexer
This tool creates a CDX format index for the ARC file at ARC_PATH,
either on STDOUT, or at the path specified by CDX_PATH. The resulting
file can be sorted and merged with other CDX format index files to
generate CDX format ResourceIndex.
bin/arc-indexer ARC_PATH [CDX_PATH]
location-client
If you have already populated your ResourceIndex, and just
need to inform the ArcProxy LocationDB of where ARC files
are located. This script will allow you to synchronize the
ArcProxy LocationDB with the directories holding your ARC
files.
Execute the script once for each directory containing
ARC files (on each machine containing ARC files.) Again,
this script will
not index the content of the ARC
files, but will only populate the ArcProxy LocationDB with
the locations of ARC files.
bin/location-client sync LOCATION_URL ARC_DIR ARC_URL_PREFIX
where:
-
LOCATION_URL
is the absolute URL where the ArcProxy can be
accessed. ex.
http://wayback-webapp.your-archive.org:8080/locationdb/locationDB
-
ARC_DIR
is the absolute path to the directory on the local
machine which holds ARC files ex.
/2/arc-collection-1
-
ARC_URL_PREFIX
is the absolute URL where the directory ARC_DIR can
be accessed. ex.
http://arc-storage-node-1.your-archive.org/2/arc-collection-1/
url-client
URLs stored in BDB and CDX format ResourceIndexes are
canonicalized to a more genertic form. Before
performing a lookup operation on the ResourceIndex, the same
canonicalization function is applied to requested URLs. This
tool will read space(" ") delimited lines from STDIN, and
output the same lines on STDOUT, but with one column
altered. The column that is changed is assumed to be a URL,
and the version output is the canonicalized form of the
input URL.
This tool is mostly useful for debugging the
canonicalization function, but can also be used, if the
canonicalization function is altered, to update an existing
CDX index, without recreating CDX files from original ARCs.
bin/url-client [-cdx] [-f FIELD]
-
-cdx Pass thru lines prefixed with " CDX "
unchanged.
-
-f FIELD alter column FIELD of each line,
instead of the default column 1.
ArcProxy and LocationDB application
The Wayback software includes an additional application, the ArcProxy,
which can simplify some distributed ResourceStore implementations. The
ArcProxy application exposes two external services, one used to
configure the underlying database mapping ARC filenames to the actual,
fully qualified HTTP 1.1 URL, and a second service which reverse proxies
incoming HTTP 1.1 range requests to appropriate back-end storage nodes.
The
arcproxy reverse proxy service allows one or more HttpARCResourceStore
instances to configure a single URL prefix where all ARC files are
assumed to be located. This reverse proxy then uses a BDB JE to find the
actual current location of the ARC file, and forward the request to the
actual host holding the ARC file.
The
locationdb service allows population and management of the
BDB JE database(the
locationDB) used by the
arcproxy
service. There is also a command line tool,
location-client
described elsewhere in this document which provides command line access
to the management of the locationDB.
Adding the following configuration to wayback.xml will expose the
arcproxy and locationdb services:
<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
init-method="init">
<property name="bdbPath" value="/tmp/wayback/arc-db" />
<property name="bdbName" value="DB1" />
<property name="logPath" value="/tmp/wayback/arc-db.log" />
</bean>
<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>
<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>