Full listing of changes and bug fixes are not currently available prior
to release 1.2.0.
- Memento integration.
Improved live-web fetching, enabling simpler external caching of
robots.txt documents, or other arbitrary content used to improve
function of a replay session.
Customizable logging, via a logging.properties configuration file.
Vastly improved Server-side HTML rewriting capabilities, including
customizable rewriting of specific tags and attributes, rewriting
Snazzy embedded toolbar with "sparkline" indicating the distribution
of captures for a given HTML page, control elements enabling
navigation between various versions of the current page, and a
search box to navigate to other URLs directly from a replay session.
Improved hadoop CDX generation capabilities for large scale indexes.
SWF (Flash) rewriting, to contextualize absolute URLs embedded
within flash content.
ArchivalUrl mode now accepts identity ("id_") flag to indicate
transparent replaying of original content.
NotInArchive can now optionally trigger an attempt to fill in
content from the live web, on the fly.
Updated license to Apache 2.
Major Bug Fixes
More robust handling of chunk encoded resources.
Fixed problem with improperly resolving path-relative URLs found
Fixed problem with improperly escaping URLs within HTML when
Fixed problem where a misconfigured or missing administrative
exclusion file was allowing results to be returned, instead of
returning and appropriate error.
No longer extracts resources from the ResourceStore before
redirecting to the closest version, which was a major inefficiency.
Now provide closeMatches list of search results which were not
applicable given the users request, but that may be useful for
Archival Url mode now allows rotating through several character
encoding detection schemes.
Proxy Replay mode now accepts ArchivalURL format requests, allowing
dates to be explicitly requested via proxy mode.
AccessPoints can be now configured to optional require strict host
matching for queries and replay requests.
Now filters URLs which contain user-info (USER:PASSWORD@example.com)
from the ResourceIndex
ArchivalURL mode requests without a datespec are now interpreted as
a request for the most recent capture of the URL.
Improvements in mapping incoming requests to AccessPoints, to allow
virtual hosts to target specific AccessPoints.
ResourceNotAvailable exceptions now include other close search
results, allowing the UI to offer other versions which may be
ArchivalURL mode now forwards request flags (cs_, js_, im_, etc)
when redirecting to a closer date.
ResourceStore implementation now allows retrying when confronted
with possibly-transient HTTP 502 errors.
Minor Bug Fixes
cdx-indexer (replacement for arc-indexer and warc-indexer) tool now
returns accurate error code on failure.
No longer sets JVM-wide default timezone to GMT - now it is set
appropriately on Calendars when needed.
Hostname comparison is now case-insensitive.
Server-relative archival url redirects now include query arguments
Server-relative archival url redirects now include a Vary HTTP
header, to fix problems when a cache is used between clients and
the Wayback service.
Fixed problem with robots.txt caching within a single request,
which caused serious inefficiency.
Fixed problem with resources redirecting to alternate HTTP/HTTPS
version of themselves.
Fixed problem with accurately converting 14-digit Timestamps into
Date objects for later comparison.
Automatically remaps the oft-misused charset "iso-8859-1" to the
Added exactSchemeOnly configuration to AccessPoint, allowing
explicit distinction between http:// and https://(ACC-32)
Now times out requests to a slow/non-responsive RemoteResourceIndex
and remote(HTTP 1.1) ResourceStore nodes.(ACC-38)
experimental OpenSearchQuery .jsp implementations(ACC-56)
FileProxyServlet now accepts /OFFSET trailing path in addition to
Content-Range HTTP header.(ACC-74)
warc-indexer now has -all option to produce a CDX line for ALL
records, not just captures and revisits(ACC-75)
now includes file+offset for all records, keying off mime-time of
warc/revist to determine revisits at query time.(ACC-76)
Allow prefixing of original HTTP headers with a fixed string.
Now Wayback rewrites Content-Base HTTP headers.(ACC-78)
Timeline.jsp improvements which prevent Timeline from being severely
distorted on some pages.
Improvement to ArchivalUrl client-rewrite.js to preserve link text,
working around a bug in Internet Explorer.
Now all mime-types are escaped to prevent spaces from getting into
the CDX files.(ACC-45)
Some CSS URLs were being rewritten twice. (ACC-53)
No longer writing original pages Content-Length HTTP header to
output, which caused original pages with Lower-Case "L" in
"Content-length" to return wrong length, truncating replayed
documents. This caused some replayed pages to not have embedded
Fixed severe problem with live web robots.txt retrieval where wrong
offset was being writting into the live web ResourceIndex.
Charset extraction from HTTP headers is now case-insensitive.
No longer adding content to HTML pages with FrameSet tags, as they
were being broken.(ACC-65)
No longer set GMT as default timezone for entire JVM.(ACC-70)
Index filter which allows including/excluding records based on HTTP
response code field.(ACC-43)
Outputs log message instead of stack dump when failing to access
Some redirect records were not being located in index due to bad
logic in Duplicate record filter.(ACC-30)
Wayback was not throwing a NotInArchiveException when
Self-Redirect replay filter removes all records. (unreported)
Location HTTP header values were not being escaped before
placing in CDX, causing some records to have too many columns.
Search Result summary counts were incorrect in Url Prefix
Implemented NoCache.jsp, a replay insert which adds a
Cache-Control: no-cache HTTP header to all replayed
Timeline.jsp was using Request Date, not Capture date, which
caused Proxy Mode Timeline to show the wrong date.
Advanced Search reference implementation .jsp was broken.
AnchorDate and AnchorWindow functionality is now disabled by
default, and can be enabled via configuration on an AccessPoint.
- @ Completely new implementation of ResourceStore classes,
including recursive local directory scanning, scanning multiple
local directories, an experimental remote directory scanning
capability, and groundwork for future support of both non ARC/WARC
file formats and large scale automatic indexing.
- @ Complete overhaul of the Replay system, allowing
jspInserts within ArchivalUrl, DomainPrefix, and Proxy replay
modes. Also includes groundwork for future fine-grained mime-type
and url-based Replay customizations.
Added capability to explicitly set Locale to use for an
AccessPoint, overriding the default behavior of using the user
agents specified preferred language.
New flat file implementation of FileLocationDB. See
CDXCollection.xml within the .war file for and example usage.
AnchorDate feature, tracking the date with which a user begins a
replay session. During this session, wayback will always attempt to
remain near this date, preventing time-drift within a replay
AnchorWindow feature, which allows users to specify a maximum time
window in either direction of the AnchorDate that they wish to view
replayed content. When a user has set this option, Wayback will not
display captures outside the specified window.
New command line tool location-db to create a location DB
offline, populating with lines read from STDIN.
Added new AccessControlSettingOperation authentication control
component, allowing the configuration of the appropriate Exclusion
system per-request, as defined by arbitrary BooleanOperators. See
ComplexAccessPoint.xml within the .war file for an example usage.
Added .asx archival URL replay, which rewrites links inside
archived .asx files, attempting to make them point back into the
Now accept "http:/" as identical to "http://" in the beginning of
a URL, working around a browser bug which stripped multiple "/"s in
- @ Refactoring of ResourceIndex interfaces, to allow for
future update-able ResourceIndex implementations beyond BDBIndex
- * Major internal refactoring of WaybackRequest object,
providing more stable get/set methods for accessing the standard
internal fields with type-safety.
- * Major internal refactoring of SearchResults into
CaptureSearchResults and UrlSearchResults, which was previously
under-specified and often confusing. These new classes provide more
stable get/set methods for accessing the standard internal fields
- * Changed locations of replay, query, and exception .jsp
files within .war file to underneath WEB-INF, so they are not
directly accessible via HTTP.
German translation of default Wayback UI. Thanks Andreas!
Czech translation of default Wayback UI. Thanks Luká? Mat?jka!
All threads now notified of shut downs, allowing resources to be
- *Refactor of all Request and Result related constants from
WaybackConstants to WaybackRequest and the *SearchResult(s)
- * Refactor of the various UI*Results classes, which are used
by Query, Replay, and Exception .jsp files to access context
information into the single class, UIResults, which has a more
New AccessPoint.urlRoot optional configuration, enabling explicit
control over URLs generated for the UI.
(ACC-24) Fixed bug in Proxy mode which prevented the correct number
of results from being returned from the index during Replay.
(ACC-21) fixed bug where some CSS import declarations where not
being correctly rewritten.
(ACC-26) fixed rare String OOB exception when marking up pages with
(ACC-28) verifies that detected encoding is supported in local JVM
before attempting to decode a resource into a String.
(unreported) fixed declared page encoding of help, advanced search
and index page to UTF-8.
Explicitly set character encoding on returned documents, instead of
relying on Tomcat to return the correct encoding.
Migration notes to 1.4.0 from 1.2.X
Wayback 1.4.0 includes substantial code changes aimed at extending
current capabilities, enabling planned future features, and
stabilizing interfaces used in .jsp customizations. Since these
changes would already require a significant update of existing
customizations made to .jsp files, many non-vital cleanups to the
source tree were included. The goal of implementing all of these
features within this single release is to minimize future required
Below is a somewhat inclusive list of changes that will be required
when upgrading to Wayback 1.4.0 from 1.2.X, divided into two main
categories: changes required to Spring configuration, and changes
required for .jsp customizations. Depending on the scope of the
existing customizations in your installations, it may be simpler
to modify your existing customizations to conform to new interfaces
and packages, and in other cases, it may be simpler to begin with the
new reference implementations and modify them to meet your needs.
If there are changes not addressed here, or if you have questions
regarding specific issues when upgrading, please direct these
questions to the archive-access-discuss forum.
Spring upgrade information
New features with the @ mark indicate features that will directly
impact Spring XML configuration files used with 1.2.X.
- org.archive.wayback.resourcestore.http.FileLocationDB now:
- org.archive.wayback.resourcestore.http.FileLocationDBServlet now:
- org.archive.wayback.resourcestore.http.ArcProxyServlet now:
All ReplayUI implementations changed completely, now located in:
ArchivalUrlReplay.xml, DomainPrefixReplay.xml, ProxyReplay.xml.
Customizations to jspInserts should be straightforward on
inspecting these files.
- org.archive.wayback.resourcestore.Http11ResourceStore now:
RemoteCollection.xml for configuration example.
The new automatic indexing is most simply upgraded by modifying
the new example in BDBCollection.xml with your custom paths.
.jsp upgrade information
New features with the * mark indicate features that will directly
impact customizations made to .jsp files used with 1.2.X. The bulk of
the changes fit three categories:
class name and package changes requiring import tag updates.
Please see .jsps in new distribution for updated packages.
.jsp path changes due to webapp directory tree cleanup. Again,
please see the current locations in the new distribution.
Java changes within .jsp files due to UIResults refactoring.
Previously each type of response page had a unique class used
to marshal context information to the .jsp files. These have all
been refactored into a single class,
org.archive.wayback.core.UIResults which has methods to
access the appropriate data in each case. Additionally, many
convenience methods that were present on the various UI*Results
classes have been removed, since convenience methods are now
available on the core classes:
As an example, the Timestamp class is no longer used in the .jsp
files, since all time information uses the Date class for
localization. All of the above classes now have methods to
directly return Dates.
For specific examples, please see the reference .jsp files
included with the new distribution.
Now explicitly sets the charset component of replayed HTML
page Content-Type HTTP headers in Archival URL mode. This
overrides Tomcat's default behavior of explicitly setting this value
to Tomcat's default encoding character set, if a document
does not set it explicitly. The original Content-Type HTTP
header value is now returned as HTTP header
error handling .jsps
now returns "closest" indicator on XML query results, fixing problem
with WAXToolbar/Proxy mode.(ACC-11)
- auto-indexer now closes ARC/WARC files after indexing, fixing
- location-client now syncs .warc and .warc.gz files with
locationDB, in addition to .arc and .arc.gz files.(ACC-13)
fixed problem which prevented captures archived after webapp was
deployed from being returned. Now captures up to the current moment
are returned. (ACC-14)
changed all .jsp files to return UTF-8(ACC-18)
now sending correct end Date to remote NutchWAX index.
fixed String OOB exception when attempting to rewrite some CSS text
now updates CSS "import 'URL';" and 'import "URL";' content.
Previously only updated "import url(URL);" content.
fixed Replay redirect loop when using RemoteResourceIndex
now supports compressed and uncompressed ARC and WARC files.
initial revision of "deduplicated" WARC record handling, which
returns the last version that was actually stored when
subsequent captures are not saved because they have not changed.
now filters (literal) duplicate records from the ResourceIndex,
in case the same capture (url + date) appears twice, or in two
UrlCanonicalizer is now pluggable, current functionality is now
implemented in AggressiveUrlCanonicalizer. Added
IdentityUrlCanonicalizer, which performs no canonicalization.
- bin-search command line tool now outputs a single stream of
sorted results from multiple files, instead of returning matches
from each file sequentially.
extracted several replay features into separate jspInserts that
can now be mixed and matched.
now handles most text/css URL rewriting, both inside HTML pages,
and in externally linked .css files.
externalized comment embedded inside replayed HTML pages into
added two-month timeline partition.
root page of webapp now lists access points, when users make
a request that does not specify one. Also, now access point
"slash-pages" are available "without the slash".
Now rewrite Location and Content-Base HTTP headers in non-HTML
Archival URL replayed documents.
now rewrites all background attributes found in returned
pages (archival URL mode only) instead of just on BODY tags.
now rewrites src attributes on INPUT tags.
command line tools now allow whitespace arguments, important for
tools accepting delimiter arguments.
replay URLs in query results now include non-standard ports, if
Timezone is now explicitly set to GMT/UTC, fixing a Calendar
result partitioning problem.
uncaught character-encoding exceptions now handled, plus
slightly improved detection of correct character encoding by
removing internal whitespace in declared encoding names.
archival URL parsing of query end-date now assumes latest
possible date given a partial end-date, instead of earliest
re-implemented lost "closest" indicator for XML results.
now supports multiple auto index threads, one per ResourceStore,
and also multiple auto index merge threads, one per BDB
fixed hard-coded maximum year issue.
reimplemented NotInArchive logging, which was lost in 1.0.0.